Classification and prediction with very imbalanced group sample sizes: an illustration with COVID-19 testing
- Biometrics & Biostatistics International Journal
Xin Qiao, Yishan Ding, Chengbin Ying, Hong Jiao, George Macready
PDF Full Text
This study explored predictions of COVID test results using statistical classification methods based on available COVID-related data such as demographic and symptom information. The performances of logistic regression, machine learning models, and latent class analysis in the predictions of extreme imbalanced COVID data were compared. One technical challenge of using statistical classification methods was tackled in the extreme imbalance sample sizes of the COVID data. The oversampling method was applied on the training dataset to mitigate the impact of such data structure on the training process. Further, the adjusted pooled sampling method based on the statistical classification results was proposed to facilitate the efficiency of COVID testing. Results indicate that some machine learning models (e.g., support vector machine) had better performance than traditional logistic regression model and latent class analysis under extreme imbalance data condition. Further, the oversampling method increased the sensitivity of various statistical classification methods when different cut-off values were applied. The adjusted pooled sampling was shown to be more efficient than the traditional pooled sampling method.
COVID-19 data, machine learning, logistic regression, latent class analysis, unbalanced samples