4 December 2023
Last week, UK Biobank released the sequence analysis of half a million samples to approved researchers across academia, industry, charitable organizations, and government. Experts believe this milestone will greatly aid scientists in drug discovery and other biomedical developments.
“This is a fundamentally important project in terms of enabling access to whole-genome sequencing data matched to phenotypic data at a large scale,” says Rami Mehio, head of Software and Informatics at Illumina.
In 2006, UK Biobank announced its goal to study 500,000 individuals living in the United Kingdom to understand the environmental, lifestyle, and genetic correlates of disease. They recruited participants between the ages of 40 and 69, since diseases such as cancer, cardiovascular disease, dementia, and diabetes tend to start developing during this period.
Within four years, the study had recruited 500,000 consenting patients, and in 2012 they launched their biomedical database. Then, in 2018, UK Biobank announced a major initiative to sequence the full genomes of those half-million patients.
“Whole-genome sequencing at this magnitude requires precise and highly sensitive technology,” says Mark Effingham, PhD, UK Biobank’s deputy chief executive officer—and UK Biobank turned to Illumina as their preferred technology partner at the time.
The whole-genome sequencing (WGS) was performed on Illumina NovaSeq 6000 Sequencing Systems at deCODE Genetics in Iceland and the Wellcome Sanger Institute in the UK.
After the sequencing was completed, an industry consortium that helped fund the ambitious program—composed of leading pharmaceutical companies Amgen, AstraZeneca, GSK, and Johnson & Johnson—collaborated with Illumina to analyze the data. For this first step they used DRAGEN’s germline pipeline, known for its efficiency and accuracy. They selected the DRAGEN pipeline version to match that used by other large-scale population genomics initiatives, with the aim of cross-analyzing the data in the future. These other large initiatives include Singapore’s PRECISE program; the UK’s Genomics England; the National Institute of Health’s All of Us Research Program in the US; and, led by Nashville Biosciences, the Alliance for Genomic Discovery.
“Genomic analysis of whole genomes is quite computationally intensive, and at a scale such as this, speed, accuracy, reliability, and cost are all important factors in the choice of pipeline,” says Mehio. In a second step, the consortium enlisted Illumina to aggregate the cohort using the DRAGEN joint calling solution deployed on the Illumina Connected Analytics (ICA) cloud platform.
“To make this data useful for researchers, it is not enough to analyze each individual sample—you need to present these samples as an aggregated cohort,” says Mehio.
Illumina’s award-winning DRAGEN secondary analysis was the ideal tool to most accurately mine the samples for variants, and the DRAGEN aggregation transformed it into a single genetic dataset.
“This is probably the biggest aggregation of whole-genome sequencing in the world at this time,” Mehio continues. “The DRAGEN joint calling solution on ICA scales to hundreds of thousands of samples and solves the N+1 problem. So, adding another 10,000 samples to the cohort does not require the user to restart the joint calling from the beginning.”
“The DRAGEN algorithms have enabled the identification of an impressive ~1.5 billion variants from a half-million genomes,” Effingham adds.
“Our ICA infrastructure, hosted on Amazon Web Services, or AWS, is unique in its ability to scale and support a large cohort aggregation,” says Mehio. “We have now completed similar aggregations on multiple population-level projects. The platform is capable of large-scale compute and it enables researchers to mine and collaborate on their rich datasets. The platform is seeded with the most popular tools and, of course, all the major DRAGEN pipelines.” Beyond enabling sequencing and analysis, Illumina is expanding researchers’ ability to learn more about the genetics of health and disease with additional data sourced from across the globe.
“Working with UK Biobank, we will explore technologies to analyze genomic data across these large cohorts, which gives researchers even more statistics to refine their models and get more precise in their drug discovery targets and polygenic risk scores,” Mehio explains.
“Such an enormous dataset promises invaluable insights,” Effingham says. “It will help researchers to better understand how genetics can advance drug discovery and, most importantly, improve patients’ health and well-being.”