In a recent study published in the Med Journal, researchers trained machine learning (ML) models to analyze RNA molecular signatures in patients’ blood and evaluated their performance in distinguishing between common infectious pediatric diseases.
Their results elucidate that ML models assessing differential gene expression levels can rapidly differentiate between 18 inflammatory and infectious diseases in children. Notable, these models’ diagnostic accuracy was comparable to medical health professionals perusing conventional clinical data.
Given the poor diagnostic accuracy and severe delays of current diagnostic approaches, this proof of concept shows excellent promise in diagnosing illnesses during pediatric care in the future.
Study: Diagnosis of childhood febrile illness using a multi-class blood RNA molecular signature. Image Credit: NDABCreativity/Shutterstock.com
The limitations of today’s pediatric diagnoses
Children seeking medical care most commonly suffer from inflammatory and infectious diseases in hospital and community settings.
Of these, only a small portion of children are infected with severe bacterial or inflammatory conditions, presenting clinical teams with the conundrum of appropriately identifying and treating this cohort without over-treating most patients suffering from self-limiting viral infections.
“Conventional diagnostic tests cannot distinguish the multitude of potential etiologies with sufficient speed and accuracy to inform initial treatment. Culture-based microbiological diagnosis is slow, and while molecular diagnostic techniques are faster, they are limited by the pathogens included in the panel and positive results may identify pathogens that are not the cause of the current illness, particularly for respiratory samples.”
Conventional viral pathogen detection often identifies a single viral pathogen but fails to capture infections of multiple interacting microbes, limiting their diagnostic application.
Most severe infections are localized in hard-to-access sites (especially the lungs), resulting in false negative reports despite severe clinical infection symptoms. Inflammatory conditions, including Kawasaki disease (KD) and juvenile idiopathic arthritis, do not currently have tests to confirm or refute diagnosis, resulting in severe delays in treatment initiation, or worse, disease misidentification.
Alarmingly, less than half of children admitted with a fever or even to a pediatric intensive care unit ultimately receive a definitive diagnostic verdict.
This forces healthcare professionals to rely on interventions involving broad-spectrum antibiotics for even the most harmless infections, thereby contributing to the growing problem of antimicrobial drug resistance.
Recently, RNA sequencing (RNA-seq) has been explored as an alternative diagnostic approach, not limited by waiting times associated with conventional diagnostic procedures.
A growing body of research elucidates that transcriptional signatures in whole-blood samples can rapidly and accurately distinguish between bacterial and viral infections, dengue, malaria, rotavirus, respiratory syncytial virus, tuberculosis (TB), and inflammatory conditions, including systemic lupus erythematosus (SLE) and KD.
A noteworthy limitation of these studies is that they focus on simplified binary distinctions – one-versus-one (bacterial or viral infection) or one-versus-all (TB or any other disease) – thereby reducing their practical clinical applications.
About the study
The present study employs a least absolute shrinkage and selection operator (LASSO) and Ridge regression hybrid-derived feature selection and classification approach to alleviate the limitations of previous research undertaken in the field.
Researchers trained ML classifiers on 12 gene expression microarray datasets and subsequently tested model performance on an independent patient cohort whose whole-blood RNA-seq data was acquired.
To discover the biomarker panel used for model training, 12 publicly available microarray datasets of children (n = 1,212) with acute febrile illness and healthy controls were used.
Control data was used to batch correct results using the COmbat CO-Normalization Using conTrols (COCONUT) method. Patients for whom clinical validation of illness was available were included in the study, while those with multiple potential causative pathogens were excluded.
This resulted in a final dataset of 338 bacterial, 290 viral, and 487 inflammatory cases. Malaria was the only identified parasitic pathogen in the dataset (n = 97). This dataset was randomly divided into training (75%) and test (25%) data using a stratified holdout approach to maintain class proportions.
Five ML models were trained and assessed, of which the LASSO + Ridge hybrid model was identified as the best-fit model that allowed cost-sensitivity evaluation.
Cost-sensitivity (also called ‘cost-sensitive learning’) is a model penalization algorithm that uses the consensus judgment of multiple field experts to assign ‘weightage’ to the demerits of disease misidentification or treatment initiation delays. This allowed for the prioritization of predictions in favor of conditions for which misdiagnosis consequences are highest.
While the above approach is helpful for specific disease identification and long-term clinical intervention, most pediatric cases, especially severe infections, require immediate treatment of the broad group of causative agents (bacterial, viral, or inflammatory).
All data was categorized into viral, bacterial, or inflammatory to address this need and reanalyzed. Since TB and KD differ significantly from other bacterial and inflammatory conditions, respectively, in their pathology, management, and transcript signatures, they were treated as independent classes.
“These predictions allow the model to reflect the diagnostic classification used in clinical decision making and simultaneously address multiple clinical questions. The clinical teams can be provided with the probabilities for each patient to belong in each class as an optimal input for decision making.”
The final ML model was cross-validated on an independent dataset comprising whole-blood RNA-seq data from 411 patients covering all broad diagnostic classes and 18 under-study diseases to validate the LASSO-Ridge hybrid model performance.
Finally, ML models were benchmarked against previous one-versus-all studies using linear model coefficients, receiver operating characteristic (ROC), and area under the curve (AUC) measures.
Study findings
The LASSO-Ridge ML model identified 161 RNA probes comprising 155 genes capable of distinguishing between 18 possible pediatric conditions. Since 10 genes were underrepresented across the datasets or represented transcripts that could not be sufficiently verified, 145 genes were defined as the final biomarker cohort.
Broad class analyses revealed that all six included classes (viral, bacterial, malaria, TB, KD, inflammatory) could be accurately distinguished in one-versus-one and one-versus-all analyses.
Test set prediction results revealed that ML models can reliably predict most diagnostic classes, albeit with prediction performance being a function of training sample size.
However, broad-scale class classification was reliable independent of training sample size, which highlights the future applications of RNA-seq data in informing early pediatric disease interventions.
This study has notable limitations in the current dearth of RNA-seq data for model training – except for the 18 conditions under investigation, most pediatric illnesses do not have sufficient publicly available case-cohort training data, preventing the expansion of ML model sensitivity.
This is because the current high throughput RNA-seq of whole-blood samples is expensive and requires facilities and technical expertise beyond the scope of most diagnostic clinics.
“To ensure clinical utility, further development of the approach will require large prospective patient cohorts, with consistent, detailed, and accurate clinical phenotypes. By expanding the range of conditions included in the discovery of the transcript panels, it may be possible to improve the treatment of a large number of patients, particularly for rare and under-diagnosed conditions for which early detection and thus treatment could have a significant benefit.”
Conclusions
The present study shows how ML models can efficiently utilize a single whole-blood sample to accurately and rapidly diagnose and distinguish between common pediatric ailments.
The LASSO-Ridge hybrid model was identified as the best-performing model after model penalization via ‘cost-sensitive learning,’ an approach that prioritizes the accurate diagnoses of life-threatening ailments over the misidentification of less-morbid conditions.
Whole-blood RNA-seq analysis has thus been verified as a rapid and reliable alternative to conventional clinical diagnostic approaches, the latter of which have historically taken days or weeks, with less than 50% diagnostic accuracy.
“…given appropriate clinical cohorts and gene expression datasets, it may be possible to expand this principle to other populations such as adults, patients with co-morbidities, and populations affected by pathogens specific to certain geographic areas, such dengue, arbovirus infections, or zoonotic illnesses such as Lyme disease and typhus, which pose considerable diagnostic challenges.”
Thus, this study represents a proof of concept that may usher in a new era in pediatric diagnoses, with potentially life-saving outcomes.
With the gradual decline in expenses associated with next-generation sequencing and broader adoption of these tools, future clinicians may have access to diagnostic information in a matter of hours, significantly reducing misidentification, improving clinical outcomes, and indirectly reducing the global burden of antibiotic-resistant pathogens.