Background: People with COVID19 infection exhibit a diverse set of symptoms that vary greatly in nature and severity. Cluster analysis is a popular machine learning technique for extracting homogeneous groups from a heterogeneous population, however, it often yields results that are challenging to interpret and action. This study investigates a novel multistage clustering technique to produce well-separated clusters.
Objectives: To investigate if a multistage clustering technique can be used to describe cohort clusters that share common symptomatology within a community COVID19 registry.
Methods: We obtained data from US participants in a community based COVID-19 registry known as CARE (https://www.helpstopcovid19.com/). Via a web platform, participants report their symptoms, as well as test results, risk factors, and treatment of COVID-19. We used data from 4,063 people with a COVID test result who recorded symptoms between 30 July 2020 and 19 January 2021.
Our novel clustering methodology grouped COVID+ve patients based on similar symptom drivers for positive test predictions, and visualised them in an interpretable two-dimensional space. Rules-based descriptions were then derived for the clusters in this two-dimensional embedding.
Results: Of the 4,063 participants, 2,479 (61.0%) received a positive COVID-19 test and were used in the final stage of clustering analysis to identify six distinct symptom presentations. The six clusters were 1: Asymptomatic (approximately 12% of the population), 2: Headache and chills, without decreased smell (7%), 3: Headache, without chills or decreased smell (13%), 4: Headache and decreased smell (29%), 5: Decrease smell without headache (18%), 6: 2+ symptoms that aren’t headache or decreased smell (20%). Less than 1% of participants could not be assigned to one of the six clusters.
For comparison, we applied the clustering stage of this framework to the raw symptom data and found that the resulting clusters were heterogenous and lacking meaningful structure (81% of participants fitted into only two of the clusters).
Conclusions: We have proposed a novel multistage clustering technique for identifying distinct groups of symptomatology in a community-based registry which records symptoms reported by people with COVID19. Future work will investigate common demographic and clinical features exhibited by each cluster cohort as well as mapping clusters to outcomes to better understand the clinical presentation, risk factors and prognosis.