Optimized network based natural language processing approach to reveal disease comorbidities in COVID-19 … – Nature.com

Network based word-embedding (mpDisNet)

The OMIM database was used to collect 394 disease types to be used in mpDisNet models. Results from the reproduced model show that, majority of the high-scoring disorders are cancer-related phrases, as can be observed in the reproduction score distribution (Fig.1). The distribution indicates that, MpDisNet scores are highly biased towards the cancer related terms. (Supplementary doc: Similarity mpdisnet.xlsx), We discovered 10,563 disease-disease associations with a score higher than 0.9, which is computed using vector cosine similarities. 6838 out of 10,593 disease similarities contain at least one cancer related term, which constitutes nearly %65 of the scores higher than 0.9.

Score distribution of the mpDisNet (reproduction model) that represents the effect of the cancer-term dominance in the disease interaction scores. (a) Reproduced model score distribution of all disease scores from the mpDisNet trial. (b) Score distributions of cancer-terms in the range between 0.9 and 0.95. (c) Score distributions of the remaining (non-cancer) disease interactions.

Scores of cancer-related phrases, as shown in Fig.1a, are likewise the main reason for the higher score accumulation between 0.95 and 1. This has a significant impact on the overall distribution of scores for diseases other than cancer. Because of the large amount of cancer-related research and the disease's complications, cancer is strongly linked to all other diseases, resulting in higher comorbidity ratings. Removal of the cancer-related elements from the disease similarity scores reduces the score accumulation on the high score range, as seen in Fig.1c.

When cancer terms and their linked miRNAs are removed from the training data, significant changes in the score distribution is observed. This change in the distribution also indicates that, the number of highly connected elements such as cancer terms also leads to a reduction in the occurrence of other disease representations in the model. Since multiple pathway dysfunctions emerge in cancers, a larger number of related miRNAs were reported in literature. Indeed, as it shown in Table 2, number of discovered miRNAs for cancer terms are large in comparison to median number of miRNAs (Fig.2) per disease used in this study. Cancer related miRNAs constitute the outlier points in Fig.3a and lead to high number of occurrences of cancers in training data as shown in Fig.3bd. The imbalance in number of miRNAs in cancer and non-cancer diseases lead to dominant occurrence of cancer terms over other diseases, which increases the possibility of random selection of cancers in different sub-networks in the corpus and causes over-training of their vectors.

Comparison of the medians of the number of miRNAs are related with cancer-type diseases and the rest of the OMIM disease dataset.

Effects of variations of miRNA counts in diseases and the disease frequencies in the training data. (a) Boxplot of the number of the miRNAs of each diseases indicate a narrow IQR range (box) and high number of outliers (circles). (b) Occurrence frequencies of each disease in training data in non-modified version. c Scatter plot of mean score of each disease and its frequency in the training data. (d) Positive correlation between the disease frequencies and number of miRNAs.

Further, unbalanced occurrence of words (diseases) causes instabilities such as vector update rate disparities between high and low frequency words (Fig.3). The degree of learning for each condition will eventually be affected by differences in the number of updates of the individual diseasevectors17. As a result, there will be differences between reliability of disease interaction scores for relatively rare disorders and high frequency disorders. In NLP models, this property can be used to classify the words by their semantic information importance. However, in disease representations, there is no difference between the diseases in terms of information values i.e., diseases cannot be classified as more important or less important in our context as in other NLP problems. This is the main difference between the real words and word-like representations. By removing highly connected diseases, we would like to increase the score reliability of the rest of the diseases and consequently increase the prediction performance.

Prior to applying the approach to the COVID-19 disease to find possible comorbidities, we aimed to increase the disease interaction identification performance. Use of heterogeneous miRNA-gene-disease network approach is mostly conserved in our analysis, which is based on data from miR2Disease and HMDD miRNA-disease interactions9,18. The random walk method based on meta-paths has also been preserved. However, changes have been made to increase the accuracy of the network method. In contrast to the original architecture, we used a TF-TF interaction network rather than the PPI to be able to represent the regulatory mechanisms in a more precise manner. The cosine similarity of the disease vectors, which is one of the distance metrics used for quantifying the word similarities in NLP applications, was utilized to analyze disease similarities (comorbidities) for performance evaluation of the method.

miRNA expression profiles of SARS-CoV-2 infected cells have been collected from Wyler dataset19. 24-h mock-infected samples were used as control samples. Infected Calu3 cell samples have been analyzed for identification of differentially expressed miRNAs (Calu3 4h12h24h). 39 significant and differentially expressed miRNAs have been identified (adj.p-value<0.05) which includes hsa-mir-4485, hsa-mir-483 and has-mir-155.

The score distribution in the reproduced version and in our version with improvements in the disease list and transcription-factor implementation has been shown as a heatmap for all disease scores (Fig.4).

Heatmaps representing the similarity scores in reproduction and TF integrated network. All diseases are placed into x and y axis and they are colored based on their similarity scores as green (1) and blue (0). (a) Heatmap of the reproduction scores. (b) Scores after cancer-associated diseases are removed (sub-frame of part A that matches with diseases in part E). (c) Scores of Transcription-factor integrated model instead of Human PPI with cancer terms. (d) TF model without cancer terms (sub-frame of part B that matches with diseases in part F). (e) Updated training without cancer terms, with Human PPI, and (f) updated training with TF interaction network without cancer terms. Black color represents the zero score that indicates that no association between diseases was found. The region between the black ticks on the x and y axes in (a) and (c) indicate the cancer (right) and non-cancer (left) diseases.

Excluding cancer terms and their related miRNAs from the heatmap data resulted in higher scores on non-cancer disease relationships. Figure4a depicts the reproduction score distribution, with cancer-related phrases gathered in the top-right side of the graph, which alsohas the highest scores. It was transformed into Fig.4b by deleting cancer-related rows and columns from the heatmap, resulting in a clear distinction between the effect of cancer-related terms on the score distribution. Figure4e, on the other hand, was produced by retraining disease pair scores after removing cancer diseases and their corresponding miRNA set from the training dataset of themodel, it can be seen that the overall score profile for non-cancer diseases has improved. Figure4c demonstrates the distribution of disease similarity scores including cancer terms when TF-TF interaction data is utilized instead of Human-PPI. In this case, some disease relationships were lost, and the majority of disease scores were reduced. However, scores of the some of the rare disease interactions were increased that may be of significance. Figure4f demonstrates that when cancer terms are omitted from the TF-TF included trials, the effect on the scores is similar with the upper row, again demonstrating the cancer domination in models where cancer interactions were included. Score distribution differences between the TF-TF regulatory map implementation and PPI can be seen when Fig.4c and f are compared. Although the scores of the TF network models are lower than the PPI network models, the training time of the models has been greatly decreased due to the smaller vocabulary when TF network is used.

Several diseases were found to have no comorbidity with the rest of the diseases (Table 3). All of these diseases have quite a limited number of miRNA connections in the disease-miRNA data in the network. In the Reproduction and Cancerless models, most of them only have one miRNA interaction. When the TF-TF interaction data was used, the non-comorbid disease list was expanded to include some of two miRNA-connected disorders that are not linked to the transcription factor interactions in the network.

ROC (Receiver Operator Characteristic) curves are often used to evaluate the performances of prediction algorithms by presenting true positive rate and false positive rate of predictions as a curve. For this evaluation, information on true positives and true negatives is needed. Compilation of True Positives (comorbidities) from literature is relatively easy despite the scarcity of verified disease-disease interactions in the literature. However, finding the True Negatives is far more challenging as there is no literature data that directly reports non-comorbid pairs of diseases. One way is designating disease pairs with a low RR score or no interaction information as non-comorbid. This technique classifies comorbidities not yet reported in the literature as False Positives (FP) in ROC curve calculations. As a result, True-positive (TP) scores are hampered when each disease has a small number of known comorbidities. To address this negative bias on AUROC, more comorbidity data is needed to increase the TP/FP ratio. We expanded the amount of clinical data in the validation set to be used in mpDisNet., as a result, the AUROC performance was greatly improved over the original implementation.

The performance of original implementation of mpDisNet in terms of AUROC (Area Under ROC) was 0.65, which was higher when compared to the AUROC of the overlap method (0.58), a simpler methodology that finds comorbidities by comparing shared miRNA ratios between two disorders, The key drawback of the ROC analysis in the original implementation is the high number of predicted Disease-Disease interactions, which is around 90.000 [n*(n1)], compared to a small number of known disease interactions which is 81. To be able to expand this list, the disease pairs with RR higher than 1.5 in MediCare dataset and the comorbid disease list of 81 pairs were merged, after all disease names were converted to ICD-9. Generated final Disease-Disease scores (Supplementary file: rev_over_15.xlsx) were converted to pivot table by using pandas python package20. The compiled data visualized as heatmap (Fig.5) with matplotlib v3.4 seaborn python package21. Disease similarity scores were used to calculate TP and FP rates when compared to compiled list of comorbidities and ROC curves were drawn for all cases (Fig.6). The main objective of this improvement is to maximize TPR to better understand the algorithm's true discovery performance. However, since the algorithm's False Discovery Rate cannot be changed, as previously stated, and all disease interactions that have not yet been documented in the literature ought to be labeled as 0, leading to false positives.

Diseases that have at least 1.5 Relative-Risk (RR) score from US Medicare data visualized as heatmap with matplotlib seaborn python package v3.4. Full sized heatmap can be found in Supplementary Fig. 1 (disease_heatmap.pdf) and full list of disease comorbidity scores from MediCare data can be found in material rev_over_15.xlsx).

Receiver-Operator Curve (ROC) curve of all models. (a) ROC score for reproduced model with same setup in the original MpDisNet implementation. (b) ROC curve of cancer removed model scored compared to limited known disease interactions (81 pairs). (c) TF-TF substituted model scores compared to limited disease data. (d) Reproduction of the original model with extended known disease pairs. (e) Cancer removed model with extended disease pairs. (f) Cancer removed and TF integrated model with extended disease pairs.

To determine whether the implementation of the TF-TF regulation mechanism has a beneficial effect on the discovery of comorbidities, a comparison between the PPI network and the transcription factor-implemented network was made.

Figure6a presents the reproduction of the original implementation, and the same AUROC is reproduced as expected. The modifications on the model (removal of cancer terms and using TF-TF instead of PPI) did not improve the results as seen in Fig.6b and c, when they were evaluated using limited clinical data. However, in the second row of Fig.6, use of extended clinical data significantly improves the AUROC when compared to its counterpart on the first row.

We further hypothesized that correlation between scores of disease pairs may be a more accurate measure of similarity or comorbidity between them. A high positive correlation of scores indicates that the pair of diseases have similar scores with other diseases, hence has a common profile of similarities in their mechanisms. This approach also mitigates the impact of low-frequency disorders having low scores due to lack of miRNA connections.

We tested two alternative correlation metrics; Pearson and Spearman correlations are calculated between similarity scores of each disease pair using our mpDisNet results as shown in Fig.7. Although both metrics produced similar results, to reduce the effect of possible methodological bias, both Pearson and Spearman correlation score-based performances were kept in the ROC curves. When correlations are used for evaluation of performances, we find that cancer-term included model scores also have slightly improved AUROC performance. There is more obvious improvement in Cancerless model than other models. We could say that the similarity between diseases is more prominent when we keep PPI and remove cancer terms as seen in Fig.7b. In addition, concordance of the Spearman and Person correlations in Cancerless model could be evidence of improved score reliability when compared to the other models. But in order to keep taking into consideration the non-normal distribution of the similarity scores between disease pairs, the non-parametric Spearman correlation coefficients may be more appropriate to keep. Therefore, only Spearman correlation of vector similarity scores were used to determine possible COVID comorbidities in the following section.

Updated ROC representations of three main approaches (a) Reproduction data with No correction (blue), Pearson correlated scores (gold), Spearman correlation (green), (b) cancer removed, and correlation implemented. (c) Cancer removed and TF integrated network with correlation tunings.

Although our modifications on the algorithm reduced the scores in general, we have observed that low scores of some disease pairs in reproduction model are increased in the modified models. For example, there is an increase in Rheumatoid Arthritis and Depression comorbidity score from 0.79 to 0.90, which is one of the known comorbidities in the literature22. Another elevated score is between epilepsy and chronic hepatitis, clinical evidence suggests that these two disorders are strongly comorbid and should be further examined23. The algorithm cannot provide any direction information between disease comorbidities, therefore it is not possible to assume causality such as one disease being the cause of another disease, since the direction of the comorbidities cannot be implemented into the network structure yet. However, these findings could indicate that, patients with one disease could have a higher genetic and regulatory inclination to another disease which have high similarity score to the first disease24.

We have used our results to investigate comorbidity of COVID-19 with other diseases as a case study. Highly scored diseases that are potentially comorbid with COVID-19 have been retrieved from cancer-removed network training results with Spearman correlation of scores. The threshold has been chosen as 0.9 since it was found as the optimum threshold for the Cancerless model in the ROC curve performance analysis. The algorithm found 156 diseases (Supplementary table: COVID_comorbs.xlsx) with a similarity score of more than 0.9 and correlation of more than 0.95, indicating a strong link to COVID-19 with associated genes and miRNAs. There are also 57 disorders with a score of 0.8 to 0.9, it can be suggested that a moderate link between thesediseases and COVID-19 exists. We have identified high-scored associations with disorders that had clinical evidence of increased risk with COVID-19 on the CDC website, such as Diabetes (0.996), Heart Diseases (0.989), Schizophrenia (0.994), and Hypertension (0.994) (https://www.cdc.gov/coronavirus/2019-ncov/science/science-briefs/underlying-evidence-table.html) (Supplementary Dat: CDC_Diseases.xlsx). When the Spearman correlation is applied to the result file, the number of probable COVID disease interactions (Scores over 0.95) increases (From 97 to 210). Also, the overall score of CDC diseases increased from 0.92 (stdev0.054) to 0.98 (stdev0.012).

Encouragingly, we have further found that there are strong links between immune response and infection in diseases such as Hepatitis, Infectious Disorders, and several lung-related diseases. High-scores were also found for vessel and artery-related disorders, such as coronary artery disease, aortic aneurysm, and renal-related diseases. Additionally, various neurological and psychological disorders, such as Alzheimers, Parkinsons, Depression, and Schizophrenia, may raise the impact of COVID-19 according to our results. Indeed, recently this link was shown for Parkinson's Disease in the literature25.

While application of disease similarity networks to the NLP models is a promising approach, there are some challenges that should be tackled. The first of them is biases in data, as stated in the first part of the Results and Discussion, over-representation of diseases such as cancer and subtypes can substantially skew the disease representations as shown in Fig.1. The choice of network also has an impact on the outcome. Since the final goal is to trace back the disease similarities and identify the potential genes/metabolic activities that mediate the similarities, it is important to keep only the interactions that are mechanistically meaningful. The original reliance on human PPI may not have offered the mechanistic precision that TF-TF interaction network could. While the benefits of integrating TF-TF were not immediately obvious, exploring specific subtypes of regulatory mechanisms in future models could augment performance further. A critical limitation in the initial approach was the scant validation data, confining the model's evaluative robustness. Diseases, influenced by factors like genetics and environment, require a model that captures this complexity. Word2vec and similar embeddings, while powerful, have the risk of oversimplifying these complexities. A holistic view, potentially achievable by assimilating diverse data sources like clinical records and genomic databases, is desirable. While the introduction of correlation metrics illuminates aspects of disease similarity, it is paramount to distinguish between mere correlation and actual causation. Lastly, presented model could provide a quick and broad perspective on disease comorbidities by offering easy implementation. However, while this quick glimpse is valuable in such cases as pandemics, a deeper dive into the underlying causes and intricacies of these disease connections is essential. As we forge ahead, it becomes evident that continuous refinement and validation are not just beneficial but crucial on these applications.

See the original post:

Optimized network based natural language processing approach to reveal disease comorbidities in COVID-19 ... - Nature.com

Related Posts
Tags: