Developing an artificial intelligence diagnostic tool for paediatric distal radius fractures, a proof of concept study
Publication: The Annals of The Royal College of Surgeons of England
Volume 105, Number 8
Abstract
Introduction
In the UK, 1 in 50 children sustain a fractured bone yearly, yet studies have shown that 34% of children sustaining an injury do not have a visible fracture on initial radiographs. Wrist fractures are particularly difficult to identify because the growth plate poses diagnostic challenges when interpreting radiographs.
Methods
We developed Convolutional Neural Network (CNN) image recognition software to detect fractures in radiographs from children. A consecutive data set of 5,000 radiographs of the distal radius in children aged less than 19 years from 2014 to 2019 was used to train the CNN. In addition, transfer learning from a VGG16 CNN pretrained on non-radiological images was applied to improve generalisation of the network and the classification of radiographs. Hyperparameter tuning techniques were used to compare the model with the radiology reports that accompanied the original images to determine diagnostic test accuracy.
Results
The training set consisted of 2,881 radiographs with a fracture and 1,571 without; 548 radiographs were outliers. With additional augmentation, the final data set consisted of 15,498 images. The data set was randomly split into three subsets: training (70%), validation (10%) and test (20%). After training for 20 epochs, the diagnostic test accuracy was 85%.
Discussion
A CNN model is feasible in diagnosing paediatric wrist fractures. We demonstrated that this application could be utilised as a tool for improving diagnostic accuracy. Future work would involve developing automated treatment pathways for diagnosis, reducing unnecessary hospital visits and allowing staff redeployment to other areas.
Introduction
Wrist and distal forearm fractures are one of the most common types of presentation to the emergency department (ED) in children and adolescent patients.1 The unique properties of the immature skeleton result in specific fracture patterns in children.2,3 Furthermore, the growth plate (physis), which is still open in children, poses particular challenges when analysing radiographs. Not only can the physis mimic the appearance of a fracture for clinicians with less experience, but the fracture can involve the physis itself.
Radiographs of acute injuries are provisionally interpreted by front-line medical staff in the ED and then formally reviewed by a radiologist who writes a report, which may not be available in real time, after the radiograph was taken. By this time, the patients may have already been discharged from the ED. When there is a significant discrepancy between the initial interpretation of the radiograph by the front-line staff and the radiologist, the patient may have to be recalled to initiate a different management pathway.
Subtle injuries can be missed altogether at initial presentation. One study reported that 3.1% of all fractures were not diagnosed at the initial visit to the ED.4 The same study also observed a diurnal distribution of errors, with more occurring during night-time, potentially due to less senior support during night shifts. The specific challenge and factors of missing fractures on radiographs in children have been well documented in the literature.1–3 To overcome these challenges, studies have investigated the feasibility of artificial intelligence (AI)-based, automated analysis of radiographs in supporting front-line clinicians in fracture detection. To date, most fracture-detection models have been developed for the distal radius.5–9 AI support tools can significantly improve diagnostic accuracy of clinicians’ ability to correctly detect fractures.7
Most studies in the recent literature have focused on radiographs of adult patients. Specific paediatric AI models have been developed for the distal tibia and the elbow, but not for the wrist.10,11 Similar to humans, AI models seem to struggle more with paediatric images. In one study of distal radius fractures, a subgroup analysis found the accuracy of correctly identifying fractures was lower with paediatric radiographs than with adult radiographs with a sensitivity of 92.7% vs 97.5%.12 A potential explanation is that their AI algorithm was not trained on a dedicated paediatric data set.
Our study aimed to train an AI model to detect fractures in paediatric wrist radiographs. To our knowledge this is the first study of its kind to use a specifically prepared data set with paediatric wrist radiographs only. Further development of the AI algorithm will enhance radiographic image interpretation in clinical practice and support front-line clinicians.
Methods
The objective of this study was to train and optimise an AI model to detect wrist fractures in paediatric radiographs and to test the accuracy of the model. We trained a Convolutional Neural Network (CNN) using retrospective data from paediatric wrist radiographs and existing text-based reports to categorise the image as ‘fracture’ or ‘no fracture’. Ethics approval was granted by the Health Research Authority (REC 20/PR/0211). The TRIPOD statement checklist was applied for reporting the development of the CNN.13,14
Data source
A data set of 5,000 retrospectively collected paediatric wrist radiographs was obtained from St George’s University Hospitals NHS Foundation Trust. All consecutive wrist radiographs from 2014 to 2018 of patients less than 19 years old were included until a sufficient quota of 5,000 images was reached. Images of poor quality that were not amenable to interpretation were excluded. Using the radiology text reports that had been produced by consultant musculoskeletal radiologists, the radiographs were labelled as ‘fracture’ or ‘no fracture’. All radiographs with fractures were subcategorised according to the bones involved (Figure 1).
Data processing
Data collection and de-identification was performed by the direct clinical care team. Radiographs were stored on the PACS (picture archiving and communication system) server of the hospital in DICOM (digital imaging and communications in medicine) format. DICOM is a container format that contains imaging data and text metadata. The metadata comprised personal information about the patient, such as name, date of birth and so on. The radiographs were downloaded from PACS on a radiology workstation, using specific software tools that allowed mass downloading (RadiAnt DICOM Viewer, https://www.radiantviewer.com or Conquest DICOM, https://ingenium.home.xs4all.nl/dicom.html).
The downloaded radiographs were stored securely on a hospital computer. There was one file for each radiograph. Any text metadata containing personally identifiable information was removed prior to sharing with Kingston University, where the AI model was trained. Information removed from the DICOM images included patient’s name, age, sex, birth date, hospital identity number, NHS number, ethnic group, occupation, referring physician, and institution name and study date. Text metadata was converted to an xlsx file that consisted of entries with accession ID and the presence of fracture. The DICOM files were then split into fracture or no fracture.
There were a few instances where a corresponding ID number was not found in the csv file with the existing DICOM files. These files were separated and gathered for further revision of the model. The radiographs and the radiology reports were anonymised before further processing, which included image file conversion, downscaling, and cropping and data augmentation. Augmentation resulted in a larger labelled data set, intended to make the AI model more robust to real-life data variability.
The data set was in the format of DICOM, which is unsuitable for input into a CNN. The data set was therefore converted to a high-resolution png (portable graphics) format using DICOM viewer software. The data set consisted of images with outliers (features that largely change the model’s predictions) which in our data set were radiographs with intra-osseous metal rods, heavy bandages and alignment tags on the radiographs images. These outliers were removed and the data set was refined. There were also 239 images that were joined (two radiographs in one image and three radiographs in one image). These images were split manually using image editing software on a Windows operating system.
This data set was then used to create augmented images by applying horizontal and vertical flip as well as random zoom. The augmentations were applied using Python script with a Keras image data generator function.15 This function was repeated for all the files in fracture and no fracture image directories.
Model training testing and validation
To train these neural networks with a high batch size (Table 1), a considerable amount of computational power was required. This was provided by Google Colab pro account. This account was then configured with weights and biases in a third-party application that provided an application programming interface reporting the metrics of classification models trained in all experiments.
Data set size specifications | ||
---|---|---|
Fracture | No fracture | |
Augmented | 6,457 | 4,589 |
Real | 2,881 | 1,571 |
Total | 9,338 | 6,160 |
The AI model was based on a CNN, a type of deep-learning architecture that performs a convolution operation on the image multiple times to extract features from it. For a given input image and output, the model learns relevant visual features based on convolving the image into the learned filters. Supervised CNN-based models usually need to be trained on large amounts of labelled data. To reduce the need for an even larger data set and make the training faster, transfer learning from VGG16 CNN16 pretrained on non-radiological images was applied. Our training data set was used to fine-tune the last hidden layers of the VGG16.
The validation data set was used to optimise the model’s parameters and to perform interval validations. The trained model’s ability to detect the presence of fractures in paediatric wrist radiographs was evaluated on the test data sets, previously unseen by the model.
Training
Models were trained to understand the overview of the model’s accuracy and bias. The data set was cleaned and verified for outliers. The outliers were then removed and the model was trained. VGG16 with ImageNet weights were used to train an image classification model with two classes. More augmentations were used to increase the diversity of data sets to improve model generalisation. Augmentation of data assisted the network-training process to eliminate learning of irrelevant features and noise.
Statistical analysis
Model performance was evaluated based on the accuracy of the model on the validation set during training. The loss function of neural networks was defined as binary cross entropy to update the weights and biases of the model with back propagation. Because this was the first study performed on the current data set, an accuracy metric was selected to evaluate the feasibility of this approach to compare fracture detection in paediatric wrist radiographs by AI against the radiologists.
Results
The final training data set consisted of 2,881 real radiographs with a fracture and 1,571 without a fracture; 548 radiographs were outliers (Table 1). The training model tended to start overfitting after reaching 70% training accuracy. To reduce overfitting, additional images were added to the data set using augmentation techniques. These techniques ensured that the models generalised key features of the data set instead of memorising all features. This yielded an accuracy of 65% and also started overfitting at 70% training accuracy. Following data augmentation to produce a further 6,457 images with a fracture and 4,589 with no fracture, a total of 15,498 images were available for analysis (Table 1) to assist the network-training process to eliminate learning of irrelevant features and noise, after which the model generalisation improved and overfitting of the model on the validation set reduced. Consequently, the validation accuracy of the model was greatly improved to 85%.
Deep-learning models have a large number of parameters, which can lead to overfitting with small data sets. Therefore, it was necessary to split the data set into: (1) a training set, to train the model and update weights; (2) a validation set, to select the best model during the training process; and (3) a test set, to evaluate the selected model outcome and report the results. In this study, the data set was randomly split into three subsets: training (70%), validation (10%) and test (20%).
The model was trained for 20 epochs for each model, and the training accuracy and validation accuracy of the last model are shown in Figures 2 and 3. The model was based on VGG16 and a few more layers added after that. The flatten layer was added just before the final activation layer. The final layer had a dense layer with SoftMax activation function. SoftMax was used to avoid the problem of a vanishing gradient. The model had pretrained weights from ImageNet. Input was fed through the model and these weights were recalculated corresponding to the training data and an inference made with these weights for any image that was passed through the input layer of the neural network.
Training loss and validation loss are shown in Figures 2 and 3, respectively. The model was evaluated using accuracy, which was measured based on unseen data (test data). Accuracy was defined as the number of correct predictions divided by the total number of predictions. With that as a metric of measurement, the test data accuracy was 85%.
The CNN has a feature extractor layer and a classification layer. The classification layer, as shown in Figure 4, arranges the final output of the convolution block into a column. In the case of the algorithm implemented here, it has 25,088 neurons that are then multiplied by the weight and summed. This sum is presented to the activation function, which then draws the probability of the class of the image, fires the respective neuron and a decision is made. The dropout layer ensures the activation function does not fire all the neurons in the network, which reduces overfitting.
Grad-CAM, a third-party library, was used to visualise important features used to make predictions. A heat map is generated of all the key features, which are represented in red, with a decreasing intensity of colour to blue, as shown in Figures 5–7. This gives us a better understanding of what the model abstraction sees while making a prediction to classify an image as fracture or no fracture; red indicates regions with greater significance and blue those with lower significance. The regions include the background and boundaries of the image. If a heat map colour is not superimposed on an image it signifies that those features of the image were not considered in making the decision.
Discussion
Distal radius fractures account for around 25% of fractures in the paediatric population.17 The incidence is increasing with an ensuing increased burden on ED services for diagnosis and treatment.17 The cost implications are not insignificant with the cost of treating paediatric forearm fractures quoted as $2 billion per year in the US.18
In addition, taking a child to the ED often necessitates time off work for parents for initial diagnosis and further follow-up.1–4 Improving pathways and automating systems can therefore have a positive impact both in terms of hospital resources and reducing unnecessary outpatient attendances for the child and their carer.
Studies using AI utilising CNNs to detect fractures of the distal radius in adults are well reported in the literature,5,8,9,19 with accuracy, sensitivity, specificity and the Youden Index all showing that CNNs can perform better than a group of radiologists in diagnosing adult distal radius fractures.5,8 Current commercially available CE-marked applications in paediatric musculoskeletal radiology using AI models have concentrated mainly on bone age, bone health, fractures around the elbow and diagnosing child abuse from inflicted fractures.20
Few models exist to detect distal radial fractures, with a recent systematic review identifying only two studies using AI for the distal radius.21 Of these, one study by Zhang et al used ultrasound, whereas Dupuis et al used radiographs to detect fractures in the entire paediatric appendicular skeleton.22,23 The systematic review noted that the studies that assessed using ultrasound of the distal radius were subject to bias due to the ultrasound being performed by medical students on a ‘convenience sample’ of suspected wrist fractures in children.22 The review noted that studies had strict exclusion criteria (healing bones, certain types of fracture, treatment with cast), reducing the applicability of the models in clinical practice. The diagnostic accuracy rates were 92% test accuracy in ultrasound diagnosis of the distal radius but the study design was weakened by selection bias.22
In our study, a large consecutive data set was used with augmentation to a final data set of 15,498 images. The AI model classified images into ‘fracture’ and ‘non-fracture’ groups to assist radiologists in the detection of bone fractures. To evaluate the model, all the reported metrics were based on the test data set, which had not been seen by the AI model during the training process. Therefore, the accuracy of the model represents model performance without overfitting on the data set.
For future work, the accuracy of the AI model would be calculated on a per-radiograph and a per-study basis. Per-radiograph true-positive determination requires a fracture diagnosis that corresponds to a fracture diagnosis in the radiology report. Per-study true-positive determination requires at least one true-positive for either or both of the radiographs of a typical radiographic examination (anteroposterior and lateral view of the same wrist). The sensitivity, specificity, positive and negative predictive values, and area under the receiver operating characteristic curves would be estimated with 95% confidence intervals. The model’s false-positive and false-negative predictions would undergo further evaluation by a second reading of the original radiographic images.
Because the model is trained on a limited sample of images, further diverse data would need to be collected to be robust against all different types of ‘noise’ in the image. For example, radiographic imagery devices may have different ‘noise’ levels that are not even captured by the human eye although the noise can affect the model performance. Even a static electromagnetic noise in the imaging room can introduce a bias in the data set.
In addition, the current classifier model provides a single outcome to assist radiologists which can be improve in future by the development of image ‘segmentation’ models and ‘explainable AI’ to highlight the exact region of fracture. The segmentation method requires radiologists to highlight the fracture region to train a new AI-based segmentation model to learn the regional features.
Conclusion
In conclusion, the diagnosis of paediatric wrist fractures with CNN is feasible and could help radiologists reduce the time they take to diagnose a child’s fracture. The CNN trained here was VGG16 with ImageNet weights and augmentations to data set were applied to reduce the over fitting of the model. The model’s accuracy improved significantly after multiple tests and reached 85% on test data.
Acknowledgements
This work has been funded by a grant from AO UK.
References
1.
Naranje SM, Erali RA, Warner WC et al. Epidemiology of pediatric fractures presenting to emergency departments in the United States. J Pediatr Orthop 2016; 36: e45–48.
2.
George MP, Brixby S. Frequently missed fractures in pediatric trauma. A pictorial review of plain film radiography. Radiol Clin North Am 2019; 57: 843–855.
3.
Segal LS, Shrader MW. Missed fractures in pediatric trauma patients. Acta Orthop Belg 2013; 79: 608–615.
4.
Hallas P, Ellingsen T. Errors in fracture diagnosis in the emergency department – characteristics of patients and diurnal variation. BMC Emerg Med 2006; 6: 4.
5.
Gan K, Xu D, Lin Y et al. Artificial intelligence detection of distal radius fractures: a comparison between convoluted neural networks and professional assessments. Acta Orthop 2019; 90: 394–400.
6.
Kim DH, MacKinnon T. Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks. Clin Radiol 2018; 73: 439–445.
7.
Lindsey R, Daluiski A, Chopra S et al. Deep neural network improves fracture detection by clinicians. Proc Natl Acad Sci USA 2018; 115: 11591–11596.
8.
Thian YL, Li Y, Jagmohan P et al. Convolutional neural networks for automated fracture detection and localization on wrist radiographs. Radiol Artif Intell 2019; 1: e180001.
9.
Blüthgen C, Becker AS, Vittoria de Martini I et al. Detection and localization of distal radius fractures: deep learning system versus radiologists. Eur J Radiol 2020; 126: 108925.
10.
Starosolski ZA, Kan H, Annapragada AV. CNN-based radiographic acute tibial fracture detection in the setting of open growth plates. bioRxiv 2019: 506154.
11.
Rayan JC, Reddy N, Kan JH et al. Binomial classification of pediatric elbow fractures using a deep learning multiview approach emulating radiologist decision making. Radiol Artif Intell 2019; 1: e180015.
12.
Cheng C-T, Ho T-Y, Lee T-Y et al. Application of a deep learning algorithm for detection and visualization of hip fractures on plain pelvic radiographs. Eur Radiol 2019; 29: 5469–5477.
13.
Collins GS, Moons KGM. Reporting artificial intelligence prediction models. Lancet 2019; 393: 1577–1579.
14.
Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the tripod statement. BMC Med 2015; 13: 1.
15.
K. Team. Keras Documentation: Image Data Preprocessing. Keras.io; 2021. https://keras.io/api/preprocessing/image/.
16.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. Published as a conference paper at ICLR San Diego, California, USA, 7–9 May 2015.
17.
Nellans KW, Kowalski E, Chung KC. The epidemiology of distal radius fractures. Hand Clin 2012; 28: 113–125.
18.
Ryan LM, Teach SJ, Searcy K et al. Epidemiology of pediatric forearm fractures in Washington DC. J Trauma Acute Care Surg 2010; 69: S200–S205.
19.
Oka K, Shiode R, Yoshii Y et al. Artificial intelligence to diagnosis distal radius fracture using biplane X-rays. J Orthop Surg Res 2021; 16: 694.
20.
Offiah AC. Current and emerging artificial intelligence applications for pediatric musculoskeletal radiology. Pediatr Radiol 2022; 52: 2149–2158.
21.
Shelmerdine S, Liu H, Arthurs OJ, Sebire NJ. Artificial intelligence for radiological pediatric fracture assessment: a systematic review. Insights Into Imaging 2022; 13: 94.
22.
Zhang J, Boora N, Melendez S et al. Diagnostic accuracy of 3D ultrasound and artificial intelligence for detection of pediatric wrist injuries. Children 2021; 8: 431.
23.
Olczak J, Fahlberg N, Maki A et al. Artificial intelligence for analysing orthopaedic trauma radiographs. Acta Orthop 2017; 88: 581–586.
Information & Authors
Information
Published In
The Annals of The Royal College of Surgeons of England
Volume 105 • Number 8 • November 2023
Pages: 721 - 728
PubMed: 37642151
Copyright
Copyright © 2023, All rights reserved by the Royal College of Surgeons of England.
History
Accepted: 2 March 2023
Published online: 29 August 2023
Published in print: November 2023
Keywords
Authors
Metrics & Citations
Metrics
Citations
Export citation
Select the format you want to export the citation of this publication.
View Options
Get Access
Login Options
Check if you have access through your login credentials or your institution to get full access on this article.
Purchase Options
Save for laterSubscribe and get full access to this article.