We present results from Ped-BERT’s pre-training stage and then evaluate Ped-BERT’s fine-tuned prediction ability for our two downstream tasks. We conclude by discussing the results of a few fairness tasks and how Ped-BERT could guide researchers and medical practitioners.
Ped-BERT pre-training evaluation
The Ped-BERT architecture is determined by computational constraints and performance evaluation on the ‘pre-training validation set’, featuring the following specifications: the input diagnosis embedding matrix is of size 120 \(\times\) 128, with the first dimension representing the length of the diagnosis vocabulary (115 unique two-digit diagnosis codes + OOV + [MASK] + [CLS] + [SEP] + padding token) and the second dimension representing the embedding size; the patient history is restricted to a maximum length of 40 tokens; the encoder is a stack of 6 identical layers; inside each of these identical layers there is a multi-head attention sublayer containing 12 heads and a feedforward network sublayer containing 128 hidden units; the dropout regularization rate is set to 0.1; pre-training is for 15 epochs using the Adam optimizer with a learning rate of \(3e-5\) and a decay of 0.01. More details on hyperparameter search can be found in Supplementary Table S1.
Ped-BERT is pre-trained using different specifications for the input embedding matrix. As mentioned in the “Methods” section, we define our baseline embeddings specification as the sum of diagnosis embeddings and positional encodings. We then augment this baseline by adding age embeddings (+ age), county embeddings (+ cnty), and age + county embeddings (+ age + cnty). Figure 4a presents a couple of interesting findings: adding age embeddings improves the APS score relative to baseline [0.52 vs. 0.51]; adding county embeddings to the baseline + age specification results in negligible APS differences [APS: 0.521 vs. 0.52]; adding additional embeddings (such as age and/or county) to the baseline specification results in negligible differences in terms of ROC AUC. We also assess specifications with the patient’s zip code instead of the county given as additional embeddings and find that the model performance is below the base specification in terms of both APS and ROC AUC. In summary, our results suggest that, in the context of pediatric patients, augmenting a pre-trained model with information on the patient’s age at the time of medical encounter has a modest positive impact on model performance, while the addition of patient’s county of residence at the time of the visit does not improve the results further.
We proceed to evaluate the quality of our pre-trained embeddings through both intrinsic and extrinsic methods. Intrinsic assessment involves examining the embeddings’ quality through visual inspection and reporting cosine similarity among disease embeddings. For the extrinsic evaluation, we examine the embeddings’ effectiveness in predicting patient gender distribution for specific disease codes.
To visually inspect Ped-BERT’s embeddings, we reduce the embedding space to 2D using t-SNE (see scikit-learn27 for implementation details). Figure 4b shows the reduced embeddings for the baseline + age input embeddings specification. The visualization reveals that similar diseases (such as those related to injury and poisoning, diseases of the respiratory system, and birth conditions) cluster together. Furthermore, diseases known to frequently co-occur (such as neoplasms, diseases of the blood, and blood-forming organs) are also grouped closely. Upon closer examination of these 2D disease embedding clusters, a remarkable association with the International Classification of Disease Codes (ICD9 codes) becomes evident. Notably, this finding is interesting because we did not explicitly provide this information to Ped-BERT during the pre-training phase. Subsequently, we proceed to report the cosine similarity between disease codes using Ped-BERT’s learned embeddings. Upon aggregation at the chapter level, we observe a range of similarity values, with the minimum and maximum values being − 0.318 and 1, respectively; the values at the 25, 50, and 95 percentiles, are 0.093, 0.229, and 0.586, respectively (additional details are available in Supplementary Fig. S2).
Finally, we conduct an extrinsic evaluation of Ped-BERT’s embeddings by assessing their performance in predicting the gender distribution of patients with congenital anomalies and tuberculosis. This evaluation is prompted by the increasing body of evidence highlighting sex-specific disparities in the prevalence of congenital anomalies and tuberculosis, with research studies demonstrating higher prevalence rates among pediatric males28,29. As shown in Supplementary Fig. S3, Ped-BERT consistently predicts a higher prevalence of these two diseases among males when evaluated on the ‘pre-training validation set’, with a Fisher’s exact test value equal to 0.0862 (p \(<0.1\)).
In summary, our current intrinsic and extrinsic evaluation results indicate that Ped-BERT has developed a substantial understanding of the contextual relationships between diseases.
Ped-BERT fine-tuning evaluation
The complete training procedure for Ped-BERT involves fine-tuning the model, which is initially trained as a general disease model for pediatric patients without task-specific objectives. In the fine-tuning stage, we adapt Ped-BERT to predict the principal medical diagnosis and the LoS in the subsequent pediatric visit. Specifically, we add a feedforward layer with 64 hidden units and an output layer with a softmax activation function on top of the pre-trained Ped-BERT. The model is fine-tuned for each task for 100 epochs using the Adam optimizer with a learning rate of \(3e-4\) a dropout rate of 0.3, and early stopping.
To explore the impact of bidirectional self-attention versus constrained attention (where each disease token can solely attend to context on its left), we pre-train a transformer decoder (TDecoder) following the original transformer architecture24. This involves utilizing our ‘pre-training training set’ and ‘pre-training validation set’, similar to Ped-BERT, followed by fine-tuning the TDecoder model on our two downstream tasks. Additionally, we assess the efficacy of pre-training Ped-BERT compared to logistic regression (LR) and random forests (RF) classifiers, incorporating standard multi-hot inputs for up to three disease codes noted by clinical personnel during a medical encounter. We also introduce an untrained NN architecture (NN_REmb) with randomly initialized disease embeddings and a feedforward layer with 564 hidden units into our comparison framework.
For pre-training the TDecoder model, we use Ped-BERT’s hyperparameter configuration to facilitate comparative evaluation (refer to Supplementary Fig. S4 for pre-training APS and AUC results). The optimal architecture for LR, NN_REmb, and the fine-tuned TDecoder model involves training for 100 epochs using the Adam optimizer with a learning rate of \(3e-4\) and early stopping. LR employs a dropout rate of 0.1, while NN_REmb and the fine-tuned TDecoder model utilize a dropout rate of 0.3. The optimal configuration for the RF model comprises 10 trees with a maximum depth of 5, coupled with balanced bootstrapped sampling.
Our experimental setup utilizing LR and RF algorithms focuses exclusively on the baseline embedding specification due to the course of dimensionality and memory limitations when age and county, as well as their respective interaction terms, are considered as features. Furthermore, given the inherent challenges decision tree models face in handling multiclass classification tasks with a large number of classes (in our case, 100+), we exclusively apply the RF model to the LoS prediction task. We report each model’s performance by taking the average of five independent runs.
For the diagnosis prediction task utilizing base embeddings/multi-hot inputs, the DL models obtain the best results compared to the LR classifier (APS between 0.374–0.392 vs. 0.277, and ROC AUC between 0.914–0.92 vs. 0.876). Among the DL models, Ped-BERT stands out by significantly outperforming both the NN_REmb and TDecoder models in terms of both scores. The inclusion of additional embeddings such as age (+ age) or age + county (+ age + cnty) yields only marginal or negligible enhancements in ROC AUC across all models. However, there are improvements in APS, particularly for the NN_REmb model augmented with age embeddings, suggesting the potential utility of this additional feature for this architectural choice (Fig. 5a,b, square-green lines; Fig. 5e).
Similar trends, although with a larger magnitude, are observed for the LoS task with base embeddings/multi-hot inputs, whereby DL models outperform the LR and RF classifiers (APS between 0.751–0.769 vs. 0.619–0.693, and ROC AUC between 0.756–0.781 vs. 0.546–0.659). Once again, Ped-BERT outperforms both the NN_REmb and TDecoder models. Incorporating age embeddings into the base specification enhances the APS of the DL models by 2.2–3.8\(\%\) and the ROC AUC by 4.1–6.3\(\%\), effectively narrowing the performance gap between the DL models. No further improvements in performance are observed upon adding county embeddings to the base + age specification (Fig. 5a,b, diamond-black lines; Fig. 5e).
We next focus on diving deeper into Ped-BERT’s input configuration yielding the best prediction results. Specifically, in Fig. 5c, we report Ped-BERT’s ROC AUC for each principal diagnosis code as derived from the base + age embeddings specification. The results highlight Ped-BERT’s high predictive performance for specific conditions, including maternal causes of perinatal morbidity and mortality (AUC = 0.984), malignant neoplasm of genitourinary organs (AUC = 0.984), congenital anomalies (AUC = 0.945), pneumoconiosis and other lung diseases due to external agents (AUC = 0.903), and ischemic heart disease (AUC = 0.899). Conversely, lower prediction performance is observed for conditions such as hernia of abdominal cavity (AUC = 0.651), toxic effects of substances (AUC = 0.632), injury of nerves of spinal cord (AUC = 0.614), persons with potential health hazards related to personal and family history (AUC = 0.525), and injury to blood vessels (AUC = 0.496). Furthermore, Supplementary Fig. S5, assesses Ped-BERT’s suitability for detecting rare diseases, showing varying prediction performance levels across different diseases. Concretely, we report the ROC AUC scores for various genetic diseases listed as principal disease codes, including congenital anomalies of eyes (AUC = 0.912), cerebral degenerations manifesting in childhood (AUC = 0.677), diseases of white blood cells (AUC = 0.667), other diseases of the biliary tract (AUC = 0.660), diseases of the capillaries (AUC = 0.569), and other metabolic and immunity disorders (AUC = 0.560). For more details, refer to Supplementary Table S2, which provides additional information on the number of patients with these rare diseases in the ‘fine-tuning training set’, ‘fine-tuning validation set’ and ‘fine-tuning test set’. Finally, Fig. 5d illustrates Ped-BERT’s ROC AUC for the LoS prediction task based on the base+age specification. The prediction performance varies across different classes of LoS. Notably, the highest performance is observed for patients seen in an emergency or inpatient setting but discharged on the same day (LOS \(\ge\) 1 day, AUC = 0.91). This is followed by the prediction of LoS in an inpatient setting for more than 3 days (LOS > 3 days, AUC = 0.79), and LoS in an inpatient setting lasting 2-3 days (LOS \(\le\) 3 days, AUC = 0.73).
The role of mother attributes data
To evaluate potential prediction enhancements for our two downstream tasks, and to further investigate the effectiveness of Ped-BERT’s pre-training, we expand our analysis by integrating the mother’s attributes data from the \(p.A_bm\) set into all model configurations. We observe significant improvements.
In the diagnosis prediction task using base embeddings/multi-hot inputs, LR exhibits the most substantial gains with an APS and ROC AUC improvement of \(6.9\%\) and \(2.5\%\), respectively. The gains for the DL models are smaller or insignificant, with an APS ranging between 1–1.9\(\%\) and a ROC AUC between 0.06–1\(\%\). The DL models once again obtain the best results, with Ped-BERT slightly outperforming both the NN_REmb and TDecoder models in terms of APS score. Adding the mother’s health attributes data to the age (+ age) or age and county (+ age + cnty) embedding specification yields similar APS and AUC gains for the DL models (Fig. 5a,b, square-yellow lines; Fig. 5e).
Performance improvements are larger in magnitude for the LoS prediction task utilizing base inputs. The APS and ROC AUC gains for the non-DL models vary between 0.02–12\(\%\) and 1.2–20.4\(\%\), respectively. For the DL models, we observe APS and AUC improvements ranging from 4.3 to 5.3\(\%\) and 6.7 to 8.2\(\%\), respectively. Most importantly, incorporating pre- and post-partum mother health information significantly reduces the performance gap between the DL models. Finally, adding the mother’s data to the age (+ age) or age and county (+ age + cnty) embedding specification yields lower APS and AUC gains compared to the base specification (Fig. 5a,b, diamond-red lines; Fig. 5e). In Supplementary Fig. S6, we report Ped-BERT’s ROC AUC performance with these additional features, for each diagnosis code and LoS class as derived from the base + age embeddings specification.
Fairness tasks
We are interested in determining whether next-visit diagnosis and LoS prediction errors are uniform across patient subgroups in our data. Figure 5 already gives us some insights into the APS and ROC AUC performance for these tasks (overall and by disease code or LoS class), but it is desirable to understand how well it performs for different subgroups. For example, Fig. 2 identifies groups of mother–baby/patient demographics and health-related outcomes belonging to the pairs used in this analysis. Our data also contains information on the mother’s country of birth, which is rarely available to research and unique to our study. As such, in this section, we aim to assess the fine-tuned Ped-BERT’s prediction performance with fairness in mind and use the pre-trained baseline + age embeddings specification for this task.
For diagnosis prediction, we find minimal differences in ROC AUC performance across groups of patient gender and race, mother race and education, month prenatal care began, the number of prenatal visits, and the number of times the mother visited a healthcare facility overnight or in an ER setting (Fig. 6, top and middle panels). Next, we create bins for the mother’s country at her own birth, for similar patient ages, for zip codes/counties at birth belonging to the same geographical region30, and for the number of times a patient was seen in an inpatient/ER setting. We find that Ped-BERT is slightly more susceptible to prediction errors depending on mother’s country of origin at her own birth and for patients with shorter medical histories (Fig. 6, bottom panel). Additionally, the integration of the mother’s health attributes data yields enhancements in ROC AUC performance across all demographic subgroups within our dataset, rather than solely in the overall evaluation (refer to Supplementary Fig. S7).
In contrast, we find significant subgroup differences in ROC AUC performance for the LoS prediction task. For example, Ped-BERT has better LoS performance for females relative to males (AUC 0.840 vs. 0.818), for patients whose mothers had less than 10 inpatient/ER visits in the post-partum period as opposed to more than 10 (AUC 0.827 vs. 0.758), for patients aged 3 years and above (AUC 0.780 for 0–2 years old vs. 0.968 for 3–17 years old and 0.986 for 17+ years old), and for patients with a longer history of medical visits (AUC 0.762 for 3 visits vs. 0.982 for more than 7 visits and 0.871 for 4–6 visits (Fig. 7). Once again, the integration of the mother’s attributes data in \(p.A_bm\), leads to ROC AUC performance improvements across all subgroups in our data (refer to Supplementary Fig. S8).
Research application
Our study allows medical researchers to assess optimal machine learning model configurations for enhancing diagnosis accuracy and LoS predictions based on available input features. For example, Ped-BERT could be a good architectural choice when age information is generalized or masked to protect patient identities due to privacy regulations like the Health Insurance Portability and Accountability Act (HIPAA). These research outcomes, in turn, hold potential significance for clinical practitioners, as they could integrate machine learning insights into their decision-making processes, thereby addressing uncertainties associated with potential medical conditions. This integration facilitates informed scheduling of follow-up appointments, optimizing patient care delivery and potentially diminishing anticipated LoS.
link