Computer-aided diagnosis of chest X-ray for COVID-19 diagnosis in … – Nature.com

This retrospective study was approved by the institutional review boards of eight hospitals (Kobe University Hospital, St. Luke's International Hospital, Nishinomiya Watanabe Hospital, Kobe City Medical Center General Hospital, Kobe City Nishi-Kobe Medical Center, Hyogo Prefectural Kakogawa Medical Center, Kita Harima Medical Center, and Hyogo Prefectural Awaji Medical Center); the requirement for acquiring informed consent was waived by the institutional review boards of these eight hospitals owing to the retrospective nature of the study. This study complied with the Declaration of Helsinki and Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan (https://www.mhlw.go.jp/file/06-Seisakujouhou-10600000-Daijinkanboukouseikagakuka/0000080278.pdf).

The CXR datasets used for developing and evaluating our DL model contain CXRs for the following three categories: normal CXR (NORMAL), non-COVID-19 pneumonia CXR (PNEUMONIA), and COVID-19 pneumonia CXR (COVID). Our DL model was developed using two public (COVIDx and COVIDBIMCV) and one private (COVIDprivate) datasets. One public dataset (COVIDx) was built to accelerate the development of highly accurate and practical deep learning model for detecting COVID-19 cases (https://github.com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md)15. The other public dataset (COVIDBIMCV) was constructed from two public datasets: the PadChest dataset (https://github.com/auriml/Rx-thorax-automatic-captioning)16 and BIMCV-COVID19+dataset (https://github.com/BIMCV-CSUSP/BIMCV-COVID-19)17. COVIDprivate was based on the dataset collected from six hospitals previously, and the two public datasets (COVIDx and COVIDBIMCV) were the same as those in previous studies18,19. The details of these datasets are described in the Supplementary material. Compared with the previous study, CXRs were added for COVIDprivate in the current study. The additional CXRs included 37, 7, and 31 cases of NORMAL, PNEUMONIA, and COVID, respectively. COVIDprivate contained 530 CXRs (176 NORMAL, 146 PNEUMONIA, and 208 COVID).

In addition to COVIDprivate, CXRs were collected from two other medical institutions. In total, 168 CXRs (80 NORMAL, 37 PNEUMONIA, and 51 COVID) collected from one medical institution (Hospital A) were used for the internal validation of the DL model (as a part of validation set) and for radiologists reading practice conducted before the observer study. Moreover, as unseen test set, 180 CXR cases (60 NORMAL, 60 PNEUMONIA, and 60 COVID) collected from another medical institution (Hospital B) were used for the external validation of the DL model and observer study of radiologists.

In the Hospital B, COVID was limited to those diagnosed with COVID-19 pneumonia using RT-PCR, and CXR was obtained after symptom onset. The time of COVID-19 diagnosis was between January 24, 2020, and May 5, 2020. PNEUMONIA was defined as patients clinically diagnosed with bacterial pneumonia that improved with appropriate treatment. Patients who showed no pneumonia on CT or had lung metastasis of malignancy and acute exacerbation of interstitial pneumonia were excluded from PNEUMONIA. NORMAL was defined as the absence of abnormalities in the lung, mediastinum, thoracic cavity, or chest wall on CXR and CT. NORMAL and PNEUMONIA were limited to cases before the summer of 2019 (before the COVID-19 pandemic). The details of the unseen test set collected from the Hospital B are described in the Supplementary material. The inclusion criteria of CXRs in the COVIDprivate and the Hospital A were the same as the previous study19.

Table 1 lists the details of each CXR dataset. The 180 cases (as the unseen test set) used for the external validation and reading sessions were adults aged 20years or older. In the 180 cases, NORMAL included 39 men and 21 women aged 58.127.9years. PNEUMONIA included 43 men and 17 women aged 76.220.8years. The COVID group included 46 men and 14 women aged 53.438.6years.

Our EfficientNet-based DL model was constructed in the same manner as described in previous papers18,19. Figure1 shows a schematic of the construction of the DL model. There are two major differences in the DL model construction between the present study and previous studies; one is that the 168 CXRs collected from Hospital A were used for internal validation as a part of the validation set, and the other is that the 180 CXRs collected from Hospital B were used for external validation as the unseen test set. The DL model development set included two public datasets, COVIDprivate, and 168 CXRs collected from Hospital A. Five different random divisions of the training and validation sets were created from the development set. In the division, 300, 300, and 90 images were randomly selected as the validation set from COVIDx, COVIDBIMCV, and COVIDprivate, respectively. The remaining images of COVIDx, COVIDBIMCV, and COVIDprivate were used as the train set. In addition, all the 168 CXRs collected from Hospital A were used for the validation set. Model training and internal validation of diagnostic performance were performed for the training set and validation set, respectively. The training of our DL model is also described in the Supplementary material.

Schematic illustration of dataset splitting and model training for our DL model. Abbreviation: DL, deep learning; COVIDx, public dataset used for COVID-Net; COVIDBIMCV, public dataset obtained from the PadChest and BIMCV-COVID19+datasets; COVIDprivate, private dataset collected from six hospitals; Hospital A, dataset collected for internal validation and radiologists practice before the observer study; Hospital B, dataset collected for external validation.

The inference results of the DL model were calculated using an ensemble of five trained models. For the 180 CXRs of the external validation, an average of the probabilities obtained from the five trained models was calculated as the inference results of the DL model to evaluate the diagnostic performance of the DL model and to provide supporting information for radiologists during the observer study.

The DL model calculated the probability of NORMAL, PNEUMONIA, or COVID for each CXR, with a total of 100%. We also created images using Grad-CAM and Grad-CAM++as explainable artificial intelligence, which visualized the reasoning for the diagnosis of the DL model20,21. Grad-CAM and Grad-CAM++images were used for the observer study. Minmax normalization with a linear transformation was performed on the original Grad-CAM and Grad-CAM++images.

Eight radiologists (with 520years of experience in diagnostic radiology) performed the observer study at two medical facilities. For the 180 CXRs collected from Hospital B, each radiologist performed two reading sessions over a period of more than 1month. One reading session was performed with reference to CXRs only, and the other was performed with reference to both CXRs and the results of the DL model. The order of the two sessions was randomly selected to reduce bias. The eight radiologists scored the probabilities of NORMAL, PNEUMONIA, and COVID on a 100% scale. In the reading session with the DL model, the radiologists referred to the probabilities of NORMAL, PNEUMONIA, and COVID calculated using the DL model. If there was any uncertainty regarding the probabilities of the DL model, the results of Grad-CAM and Grad-CAM++were available. Images of the 168 CXRs collected from Hospital A were also processed with Grad-CAM and Grad-CAM++, and the diagnosis of the DL model and images of Grad-CAM and Grad-CAM++of the 168 CXRs were presented to the radiologists for practice sessions before each reading session. Eight radiologists were taught how to interpret the Grad-CAM and Grad-CAM++images before the observer study. There was no time limit for reading and practice sessions. Prior to the reading sessions, only the approximate frequencies of the three categories were presented to the radiologists and no other clinical information was provided. Our novelties in this study were to investigate whether radiologists changed their diagnosis by referring to our DL model of CXR and whether the diagnostic performance of radiologists was significantly improved.

After the observer study, one senior radiologist visually evaluated the 180 Grad-CAM++images in the test set. The visual evaluation of the Grad-CAM++images was performed on the images that were accurately diagnosed by the DL. The radiologist visually examined the CXR and Grad-CAM++images and determined whether the Grad-CAM++images were typical or understandable. The typical Grad-CAM++images were described in Supplementary material. If abnormal findings on CXR images were highlighted on Grad-CAM++images, the cases were considered understandable by the radiologist. In addition, for COVID, the radiologist counted the number of Grad-CAM++images with highlighted regions outside the lung area.

We evaluated the diagnostic performance of the DL model alone and compared the results between reading sessions with and without the DL model. The evaluation metrics were accuracy, sensitivity, specificity, and area under the curve (AUC) in the receiver operating characteristics. Because three-category classification was performed, these metrics were calculated class-wise (one-vs-rest), except for accuracy. For the AUC, multi-reader multi-case statistical analysis was used to statistically analyze the results of the eight radiologists. MRMCaov was used for the statistical analyses22. Although MRMCaov is a statistical method designed for binary classification of two categories, this study was designed to diagnose three categories: NORMAL, PNEUMONIA, and COVID. Therefore, the three-category classification was divided into three binary classifications (one-vs-rest): (1) NORMAL versus PNEUMONIA or COVID, (2) PNEUMONIA versus NORMAL or COVID, and (3) COVID versus NORMAL or PNEUMONIA. We then compared the class-wise AUC of the eight radiologists between reading sessions with and without the DL model. The difference in the AUC was statistically tested using MRMCaov. Because it was necessary to integrate the results from the eight radiologists, the class-wise MRMCaov was used in the present study. To control the family-wise error rate, Bonferroni correction was used; a p value less than 0.01666 was considered statistically significant. R (version 4.1.2) was used for the statistical analysis.

Read more:

Computer-aided diagnosis of chest X-ray for COVID-19 diagnosis in ... - Nature.com

Related Posts
Tags: