This study externally validates our AMH-age based prediction of live birth for IVF . Furthermore equivalent model performance was demonstrated in the EV cohort, with confirmation of the independent associations of AMH and age with live birth [3–7].
Recent literature has identified an array of factors which can influence the success of ART, with various prediction models utilizing these factors to aid the determination of a couple’s likelihood of success [11–13]. However, the use of such prediction models clinically has remained limited, largely due to lack of external validation. Of the 29 pregnancy prediction models identified in a recent systematic review, only 8 were externally validated, with only 3 of these applicable to IVF . Our model adds to this literature, allowing stratification of the probability of live birth prior to the commencement of treatment. A relevant difference with previous published model of live birth in IVF is that while the majority of prediction models are based on variables measured during the IVF cycle (e.g. number and quality of embryos), the AMH-age model is based on only baseline characteristics, hence permitting it to be used by clinicians and patients prior to commencing stimulation.
A criticism of the original study was the cut off points given to age and AMH levels in the nomogram and the potential for predictive power of the model to be attenuated by these designated cut-offs. In order to overcome this potential weakness, we additionally investigated the use of age and AMH as continuous variables. The ROCAUC achieved through this mechanism was 0.65, which was identical to that achieved originally, suggesting that the use of the proposed cut-offs does not compromise the predicted probabilities generated and that alternative values would not improve predictions. This is reassuring and allows AMH and age to be displayed as categories, rather continuous variables, in tables. This has clear benefit for applying the model in a clinical environment, with simple cross tabulation of the patient’s age with their AMH concentration rather than having to apply a complex logistic regression formula.
In the EV cohort the discriminative ability of the model was only moderate (ROCAUC: 0.66), meaning that the model has limited capacity to be able to correctly distinguish between women who will or will not have a baby following IVF. However, ROC curves are primarily designed for diagnostic models (15), rather for prognostic models accuracy is better assessed by examining calibration (16). Calibration is evaluated by determining the level of correspondence between the calculated pregnancy probabilities and the observed proportion of pregnancies. A well-calibrated model for IVF would be able to classify individuals into whether they have a low, medium or high probability of achieving a live birth. In contrast to the relatively modest discrimination, the calibration of the model was found to be good (Figure 1).
The strength of this study is that the sample size was more than twice that used for model derivation. However, the EV cohort differed from the original cohort for several characteristics such as BMI and duration of infertility and also the intermediate outcome of IVF were different between the two cohorts. This largely reflects the difference existing between the demographic characteristics of Italian and Scottish infertility populations and also the different IVF clinical practices between the two countries. Particularly as the initial study was undertaken when the Italian law regulating assisted reproduction limited the number of inseminated oocytes to three, thereby reducing the number of embryos that may be generated for each patient, was still operative. This resulted in a discordance in the number of embryos transferred, with the all available embryos being transferred in Modena – mainly three; while single or double embryo transfer dominated in Glasgow. In the EV cohort, women were included irrespective of the cause of infertility, past medical history or type of stimulation. Despite these relatively important differences in patient characteristics, legislation and clinical practice the proposed model still fitted very well, further highlighting the potential generalizability of the prognostic model.
The original study limited its analysis to age and AMH, as only these two factors were identified as predictive in the original multivariate analysis for model development (8). We are aware that additional characteristics including BMI, cause and duration of infertility may influence results and the lack of association of these baseline factors with live birth, may have reflected the size of the original cohort (14).
Finally it should be acknowledged that the probabilities generated have relatively wide confidence intervals for all groups; therefore a couple’s predicted likelihood can range significantly. For example, women aged below 31, with AMH levels less than 0.4 ng/mL, are predicted a 13% chance of live birth, however, the confidence interval ranges from 4 to 36% which does not infer much reassurance in their chances of successful outcome. It would however be inappropriate to withhold treatment purely based on the probability estimates derived from our nomogram . Even in women with an AMH below or close to the functional sensitivity of the assay, natural and assisted conception pregnancies have been reported [15–18]. Therefore clinical consultations would require interplay of both the interpretation of the nomogram results by the clinician and individual patient opinion as to whether the probabilities produced could be of benefit. The greatest utility of this external validation may therefore be to confirm that AMH is an independent predictor of live birth and is worthy of evaluation in larger cohorts with detailed baseline phenotyping, with a view to assessing its utility in improving model performance [13, 19].