Study participants were volunteers from the COVID Symptom Study Biobank (CSSB, approved by Yorkshire & Humber NHS Research Ethics Committee Ref: 20/YH/0298). Individuals were recruited to the CSSB via the ZOE COVID Study (ZCS)56 using a smartphone-based app developed by Zoe Ltd, King’s College London, the Massachusetts General Hospital, Lund University, and Uppsala University, launched in the UK on 24 March 2020 (approved by the Kings College London Ethics Committee LRS-19/20–18,210). Via the app, participants self-report demographic information, symptoms potentially indicative of COVID-19 disease (both closed/polar questions, and free text), any SARS-CoV-2 testing, and SARS-CoV-2 and influenza vaccinations. Participants can be invited via email to participate in other studies, according to eligibility.
In October 2020, prior to UK vaccination roll-out, 15,564 adult participants from the ZCS were invited to join the CSSB. Invited participants had: (a) a self-reported SARS-CoV-2 test result: A swab test (in this time period RT-PCR) at the start of illness, or a subsequent antibody test, whether positive or negative; and (b) logged at least once every 14 days since start of illness, or since the start of logging if asymptomatic.
Initially, individuals were recruited in four groups based on understanding of Long COVID at that time, and prior to definitions being published: (1) Asymptomatic, with confirmed infection; (2) Short illness (≤ 14 days) after confirmed infection; (3) Long illness (≥ 28 days)5 after confirmed infection; and (4) Long symptom duration (≥ 28 days) but with a negative test for SARS-CoV-2 infection. Invited individuals were matched across these four groups by Euclidian distance for age, sex and BMI9. Participants were invited by email, and consented separately into the CSSB. Participants were sent home sampling kits in November 2020 via post, and returned capillary blood samples for metabolomic analysis. This also enabled antibody testing using an ELISA method57 to confirm prior infection status of all participants, the current standard for retrospective ascertainment58. A subgroup also consented to send in stool samples for analysis of their gut microbiome.
Study participants were subsequently aligned using symptoms ascertained up to sample collection date, and the permissible gap in logging was further tightened to 7 days to increase accuracy of classification. Long-COVID groups were reshaped to match definitions published in November 20205 (see Table 5) corresponding to Ongoing Symptomatic COVID-19 (28–83 days, OSC28) and Post-COVID-19 Syndrome (> 84 days, PCS84). The same duration parameters were applied to those reporting symptoms with the same timeframe parameters around a negative test for SARS-CoV-2, who were presumed to have a non-COVID-19 illness. This yielded six groups for comparison—four SARS-CoV-2 positive groups: Asymptomatic, Acute COVID-19 (≤ 7 days), OSC28, PCS84; and two SARS-CoV-2 negative groups: Non-COVID-19 illness 28–83 days (NC28), Non-COVID-19 illness ≥ 84 days (NC84).
To check that groupings assigned by the recruitment algorithm were clinically accurate, symptom logging maps were scrutinised in a subsample (n = 115) by two researchers (MFÖ, CJS), independently and blind to algorithmic phenotype classification, before analysis. Final categories are detailed in Table 5.
Due to changes in logging stringency criteria, some participants also fell into additional, shorter categories of illness duration, detailed in Supplementary Table 1. These additional phenotypes have been reported in supplementary data tables, but not included in primary analysis as they were not recruited for this purpose, and their classification is less certain.
Capillary blood samples were obtained between November 2020 and January 2021, when participants had recovered. Samples were returned in plasma collection tubes with initial processing of 20µL used for serology with the remainder frozen. Samples were processed in March/April 2021 by Nightingale Health Oyj (Helsinki, Finland) using high-throughput nuclear magnetic resonance metabolomics, measuring 249 metabolites including lipids, lipoprotein subclasses with lipid concentrations within fourteen subclasses, lipoprotein size, fatty acid composition, and various low-molecular weight metabolites including amino acids, ketone bodies and glycolysis metabolites33. Of these, 37 are certified for clinical diagnostic use and formed the focus of our analysis (referred to herein as “Clinically Validated”)59. Quality control was performed and reported by Nightingale Health. Due to postal transit time, glucose, lactate, and pyruvate could not be assessed and have been excluded from analyses, and creatinine was unavailable. There were no concerns raised with other biomarkers. Metabolites measured using this panel have been associated with the risk of hospitalisation for COVID-19 in the UK Biobank previously19, including 25 of the clinically validated biomarkers used in an Infectious Diseases risk prediction score (ID Score) derived using Lasso regression19.
Sample collection and faecal sample processing
Two faecal samples per individual were collected and returned by post: faecal material from both collection tubes were homogenised in a Stomacher® bag, aliquoted out and stored at -80 degrees Celsius. The first 301 samples that would maintain a balance for BMI, age, and sex, were selected to undertake a pilot investigation of microbiome differences.
DNA extraction and sequencing
Genomic DNA (gDNA) was isolated from 1 g faecal sample, using a modified protocol of the MagMax Core Nucleic Acid Purification Kit and MagMax Core Mechanical Lysis Module60. Libraries were prepared using the Illumina DNA Prep (Illumina Inc., San Diego, CA, USA) following the manufacturer’s protocol. Libraries were sequenced (2 × 150 bp reads) using the S4 flow cell on the Illumina NovaSeq 6000 system.
Metagenome quality control and pre-processing
All metagenomes were quality controlled using the pre-processing pipeline (available at https://github.com/SegataLab/preprocessing). Briefly, pre-processing consisted of three main steps: (i) read-level quality control, (ii) removal of host sequence contaminants, and (iii) splitting and sorting of cleaned reads. Read-level quality control removes low-quality reads (quality score < Q20), fragmented short reads (< 75 bp), and reads with ambiguous nucleotides (> 2 Ns), using trim-galore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). Host sequences contaminant DNA were identified using Bowtie 261 with the “–sensitive-local” parameter to remove both the phiX 174 Illumina spike-in and human-associated reads. Splitting and sorting allowed for creation of standard forward, reverse, and unpaired reads output files for each metagenome.
Taxonomic and functional profiling
The metagenomic analysis was performed using the bioBakery 3 suite of tools62. Taxonomic profiling and estimation of species’ relative abundances were performed with MetaPhlAn 3 (v. 3.0.7 with “–stat_q 0.1” parameter)62,63. MetaPhlAn 3 taxonomic profiles were used to compute three alpha diversity measures: (i) the number of species with positive relative abundance in the microbiome (‘Richness’), (ii) the Shannon diversity index, independent of richness, which measures how evenly microbes are distributed64, and (iii) the Simpson diversity index, which accounts for the proportion of species in a sample65. Similarly, species-level relative abundances were used to estimate microbiome dissimilarity between participants (beta diversity) using the Bray–Curtis dissimilarity metric, which accounts for the shared fraction of the microbiome between two individuals and their relative abundance values66. Functional potential profiling of metagenomes was performed with HUMAnN 3 (v. 3.0.0.alpha.3 and UniRef database release 2019_01)62,67 that produced pathway profiles and gene family abundances. We assessed beta diversity by computing a Principal Coordinates Analysis (PCoA)/Metric Multidimensional Scaling (MDS) based on the pairwise Bray–Curtis dissimilarity metric.
BMI was derived from self-reported weight and height. Other self-reported information (obtained from ZCS app-reported data) included smoking; and co-morbid illness (‘heart disease’, ‘diabetes’ (and type of diabetes), ‘lung disease’ (including asthma), hay fever, eczema, ‘kidney disease’ and current cancer (type, and cancer treatment). Address data was linked to the UK Index of Multiple Deprivation (IMD), with the IMD rank decile used as a categorical variable measuring local area deprivation68,69,70,71. Frailty was assessed using the Prisma-7 scale, with a score > 2 indicating frailty72.
A subset of participants had participated in a dietary assessment during the COVID-19 pandemic, also recruited through the ZCS (published previously73,74). This included detailed information on vitamin supplementation (including omega-3 oils), physical activity, alcohol consumption, dietary habits and a food frequency questionnaire. These data were used to derive a diet quality score73, and a plant-based diet index73, analysed as continuous variables. Both have previously been associated with cardiovascular disease75, Type 2 Diabetes76, a lower risk of COVID-19 illness, and a lower risk of hospitalisation for COVID-19 during the early waves of the pandemic73.
The statistical analyses were performed using R software (v. 4.0.5) and Stata (v.17, StataCorp). Baseline characteristics were described by frequency and percentages. Descriptive data on those invited and those enrolled, are presented in Supplementary Table 2 + 3. Metabolites were all log-transformed and standardised (mean 0, standard deviation 1) as per protocol19. To account for 0 values, prior to log transformation, a pseudo-count of 1 was added to all values.
Initial analysis examined association between duration of illness and each metabolite individually, using multinomial logistic regression (all adjusted for age, sex, and BMI). The Asymptomatic group was used as the reference category of the outcome variable. We also performed a secondary analysis, using the non-COVID-19 participants as a reference category. The primary analysis was then extended, assessing association between length of illness and ID score, with asymptomatic as the reference category. The Benjamini–Hochberg False discovery rate method was used to correct P-values for multiple testing77.
To assess potential confounders, we performed eight sensitivity analyses additionally adjusting for: (1) self-reported co-morbidities (cardiovascular disease and diabetes), (2) Frailty, (3) IMD rank decile, (4) lifestyle variables (smoking status, frequency of alcohol consumption, frequency of physical activity), (5) self-reported use of any health supplement, (6) self-reported use of Omega-3 containing supplements, (7) Diet quality score, and (8) Healthy plant-based diet index (hPDI).
For microbiome analyses, differences between the alpha diversity distributions of the groups were assessed using the Wilcoxon rank-sum test within the ‘RClimMAWGEN’ package (P-value ≤ 0.05 considered significant). With a sample size of 300 individuals, we have 79% power at 0.05 significance level, assuming a low effect size of 0.20. PERMANOVA from the ‘adonis2’ function of the ‘vegan’ package, was used to test for differences between groups based on the beta diversity computed from the PCoA/MDS of the pairwise Bray–Curtis dissimilarities. For the microbial differential abundance analysis, we built a generalized linear model, controlling for confounding factors, including age, sex,and BMI. Only species with minimum 20% prevalence were used in this statistical analysis78. P-values were corrected using the Benjamini–Hochberg method77.
Spearman correlation analyses were conducted to associate microbiome profiles of 301 individuals with their metabolome profiles, adjusting for confounding factors (age, sex, BMI). Correlation analyses were conducted using R version 3.6.0. The package ‘corrplot_0.90′ was used to compute the variance and the covariance or correlation, ‘pheatmap_1.0.12’ and ‘cor.mtest’ were used to visualise the heatmap, calculate associated P values. Hierarchical clustering of both top 39 most abundant microbial species and 39 metabolic profiles was conducted using hclust, implementing the Ward.D2 agglomeration method. The package ‘p.adjust’ was used to perform Benjamini–Hochberg multiple testing correction77.
The ZOE COVID Study by the King’s College London Ethics Committee (Ref: LRS-19/20-18,210) and licensed under the Human Tissue Authority (reference 12,522). All ZCS participants provided informed consent for use of their data for COVID-19 research. The COVID Symptom Study Biobank and related studies, including this study, were approved by the Yorkshire & Humber NHS Research Ethics Committee (Ref: 20/YH/0298). CSSB participants were invited to join from the ZCS user base and provided informed consent to participate in the additional questionnaire and sample collection studies, and for linkage to app-collected data. All research and sample processing has been carried out in line with relevant guidelines including the Declaration of Helsinki.