A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction … – Nature.com

Classification task

The target classes for this task were three classes with a risk of new COVID cases. Nine single classifiers, viz., Logistic Regression (GM), Decision Tree, SVM with linear kernel, k-nearest neighbors (KNN), eXtreme Gradient Boosting (XGBoost), SVM with Radial kernel (RBF), Random Forest, Nave Bayes, and Multilayered perceptron with three hidden layers and four neurons inside of each layer (Ml (c(4, 3, 3)), were used to compare the performance of the proposed ensembles.

Table 2 lists the most important features for the new COVID-19 case classification according to Boruta, Random Forest, and Decision Tree feature selectors (for each feature description, see Table 1). The listed features can help decision-makers select factors affecting COVID-19 spread and thus optimize medical care and/or restriction policy to minimize the epidemic impact, considering all aspects of human well-being.

The classification performance metrics for 9 weak classifiers and the proposed ensembles are summarized in Table 3. As one can see from the table, the best classification results were obtained in the case of the KNN model, with Accuracy=0.816, ROC-AUC=0.797, and F1-score=0.814. Using the developed ensembles allows us to increase all the metrics substantially. Thus, in the case of Ensemble 1, Accuracy was raised to 0.895, ROC-AUC to 0.897, and F1-score to 0.897. The proposed cut-off voting improvement in Ensemble 2 further increased all the metrics compared to Ensemble 1 by approx. 2% (Accuracy, ROC-AUC, and F1-score values are 0.912, 0.916, and 0.916 correspondingly). Hence, the developed hybrid hierarchical classifiers outperform single classification algorithms by more than 10% and are well-suited for COVID-19 spread prediction in real life.

Dynamic voting based on mathematical expectation is used. In addition to the trained models themselves, the cutoff function of the classifier is trained in this algorithm. The traditional stacking is based on averaging indicators, and there is a cut-off by class with a constant coefficient of 0.5; then, the efficiency of the algorithm drops sharply to~79%. The proposed cutting method increases the overall efficiency of the ensemble by several percent. The essence of the algorithm is to choose a cut-off coefficient. In the case of this work, the voting input contains a vector of independent classifier scores, which will vote differently depending on the context. The idea of the method is to calculate the average score for each vote and add it to the list of average scores. The list of average grades is a set of independent grades on which the mathematical expectation function is applied. We got a cut-off coefficient close to the optimal class separation coefficient at the output.

We used the nested fivefold cross-validation technique to perform additional tests, as described in28. Nested cross-validation was used to validate the findings obtained using the proposed approach in addition to the usual fivefold cross-validation. Though this approach has its limitations, e.g., the assumption of the data split independence, it is widely used across the ML community. The difference between the Accuracy values across the five folds was 0.018. Next, we performed a more robust statistical test, viz. KolmogorovSmirnov normality test. The obtained p-value was 0.793.

Table 4 shows the efficiency of proposed ensembles for the whole dataset and for selected features. Selecting features allows for increasing the total analyzed metrics.

For the regression task, the following regression models were used: linear model, polynomial regression, regression tree with CART algorithm, Gradient boosted tree, random forest, l1 regularization for the linear model, and l2 regularization for the linear model. These models aimed to predict the number of confirmed COVID-19 cases and deaths. Table 5 summarizes the most important features affecting the prediction of the COVID-19 spread.

As it follows from the comparison of Tables 2 and 5, virus pressure, i.e., a measure for virus transmission from neighboring counties, defined as the weighted average of the number of confirmed cases in the adjacent counties, is the most important feature for classification and regression analysis. Besides, there is a subset of common features, which were recognized as the most important in these two studies, viz., (i) the total population of the countythe second most important common feature, (ii) distance to the nearest international airport with average daily passenger load more than ten, (iii) daily average temperature, (iv) the longitude of the county barycenter, (v) number of total COVID-19 tests performed at each day in the state of the county, and (vi) population ratio in the state. As we can see, the COVID-19 spread is affected by various factors: epidemiological, like the virus pressure; demographic, like the total population and population density; social, like the distance to the nearest international airport; climate, like daily average temperature; geographical, like the longitude of the county barycenter, and medical like the number of total COVID-19 tests performed at each day. These findings can help epidemiologists to analyze the spread and lifecycle of the virus and decision makers to select the most important restriction factors and limitations to prevent the spread of the disease.

Other factors affecting the number of COVID-19 cases and deathsas seen in Table 4are mainly social features, like social distancing, percentage of health-insured residents, median household income, and percent change in mobility trends in retail shops and recreation centers. The analysis of Table 2 reveals that while speaking on the classification, there are some additional factors affecting the chance of getting infected with coronavirus, viz., percentage of residents in the age group 2529, immigrant student ratio, intensive care unit bed ratio, and the percent change in human encounters compared to pre-COVID-19 period.

Table 6 lists the regression task performance evaluation for the six most common regression models and the proposed ensemble.

The proposed hybrid hierarchical ensemble combining both supervised and unsupervised learning allows us to increase the accuracy of the regression task by 11% in terms of MSE, 29% in terms of the area under the ROC, and 43% in terms of the MPP metric. Indeed, the ROC-AUC value increased from 0.609 for the best traditional regression model (Gradient Boosted Tree) up to 0.790 in the case of the proposed Ensemble; MSE decreased from 112.6 down to 101.3, and MPP from 18.8 to 13.1 respectively. Thus, using the proposed approach, it is possible to predict the number of COVID-19 cases and deaths based on demographic, geographic, climatic, traffic, public health, social-distancing-policy adherence, and political characteristics with sufficiently high accuracy.

Besides, we used a nested fivefold cross-validation technique28 to perform a grid search hyperparameters optimization. The tuning parameter was set to a constant value of 1. RMSE was used to select the optimal model using the smallest value. The final values used for the model were =1 and =0.211 with the MAE metrics of 9.51, RMSE of 20.11 and R2 value of 0.76.

The developed way of cutting off the classifier or regressor, which is the part of the ensemble, increases the overall efficiency of the ensemble by several percent. A vector of models with different contextual characteristics can provide reasonable generalized estimates.

Table 7 shows the efficiency of proposed ensembles for the whole dataset and for selected features. Feature selection allows for increasing all the analyzed metrics.

Follow this link:

A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction ... - Nature.com

Related Posts
Tags: