From sample to insight: Streamlining analysis with the Illumina 5-base solution

In our previous blog, we introduced the Illumina 5-base solution—a fast and automation-compatible workflow that captures both genetic and epigenetic information from a single sample. In this second blog, we explore how to interpret the Illumina 5-base data across diverse applications, including genetic disease research, cancer detection, and population epigenetics.

The software pipeline is composed of: 

  • BCL Convert: Sequencing and raw read output.

  • DRAGEN secondary analysis: Outputs aligned reads, calls variants, and reports methylation. A DRAGEN Report summarizes key QC metrics across samples. 

  • Illumina Connected Multiomics: Tertiary analysis and biomarker discovery with complex data set visualization, differential methylation calling, variant-methylation joint analysis, and advanced multiomic analysis.

You can run the software pipeline on the cloud (DRAGEN on BaseSpace Sequencing Hub, Illumina Connected Analytics, Illumina Connected Multiomics) and automatically launch the pipeline from the sequencing system. You can also run DRAGEN secondary analysis on your DRAGEN server.  

In DRAGEN, we developed a new 5-base secondary analysis mode1 that builds off the standard DRAGEN DNA workflow by incorporating methylation-aware logic in the core algorithms and integrates methylation output into pre-existing standardized data formats (Figure 1).

Figure 1: DRAGEN 5-base secondary analysis
Algorithm model updates to hash table builder, map/align, UMI collapsing, variant calling, and methylation reporting. Integrated small variant calls and methylation reporting per allele are output in the gVCF. Legacy methylation reporting is also supported (CX-report format). Quality control metrics (eg, M-bias, control genome methylation) confirm successful sequencing and analysis.

After run completion, the DRAGEN Report provides a comprehensive summary of quality control (QC) metrics (Figure 2). These QC metrics include additional methylation-specific metrics such as:

  • % Methylation in methylated/unmethylated control genomes: Small bacterial genomes (Lambda/pUC19) are spiked-in to the 5-base library prep and serve as controls with known methylation levels. The Lambda genome has 0% methylation, and the pUC19 genome is synthetically methylated at CpG positions to over 97%. 

  • % Methylation in sample in CpG/CpH contexts: Methylation in mammalian genomes occurs predominantly in CpG cytosine contexts.

  • Read alignment rate to both DNA strands: 5-base reads should align in equal proportion to the original top (OT, also denoted +) or original bottom (OB, also denoted −) DNA strands. 

Figure 2: New methylation metrics in DRAGEN Reports provide simple quality control
(Left) In mammalian genomes, global percent methylation in CpG context is high (40–60%) whereas non-CpG contexts are unmethylated. Spike-in methylated and unmethylated control genomes show expected methylation levels in CpG contexts. (Right) Reads map to the expected DNA strand OT or OB in equal proportions.

In a traditional DNA library prep, the DNA strand of origin for a particular read is not known. In contrast, the 5-base library prep uses directional adapters that assign a sequenced fragment (ie, read pair) to its strand of origin. Commonly, the two DNA strands are denoted as original top (+) and original bottom (-). By convention, the reference genome sequence encodes an original top DNA sequence (Figure 3). Then, by sequencing to a standard coverage (greater than 30X), both DNA strands are represented in a genomic region. This representation enables visualization of sequence variants (present on both DNA strands at a given genomic position) and methylation of cytosine bases (present only on one strand at a given genomic position). In Figure 3C, a typical methylated region is depicted, where successive CpG dinucleotides are methylated. This region also contains a C>T heterozygous variant that, in contrast to CpG methylation, shows evidence of C>T mutations on both DNA strands at the specified genomic position.

Figure 3: Cytosine methylation produces a stranded signature that is distinct to sequence variants
(A) A DNA fragment that contains a methylated CpG dinucleotide undergoes C>T conversion in library prep. After sequencing and alignment to the reference genome (+) strand sequence, the fragment manifests as a C>T mutation on a (+) strand read or a G>A mutation on a (−) strand read.
(B) Unmethylated alleles A-C-G-T are sequenced similarly to a standard DNA library prep. The matching allele base is present on both the (+) and (−) strand.
(C) An illustration of a region that contains methylated CpG dinucleotides and a C>T heterozygous variant. Reads are grouped by DNA strand of origin (+, −). Methylated CpG dinucleotides are shown with mutations only present on one strand at a time, whereas the C>T sequence variant carries mutations on both strands simultaneously.

You can inspect aligned reads in a genome browser to glean information on variant and methylation status in regions of interest. For example, in a patient with Kabuki syndrome, we find a variant in the KMT2D gene (chr12:49,024,720 G>C) and distinct episignatures or hyper/hypomethylation in regions of interest known to be associated with Kabuki syndrome

Figure 4:  Visualization of variant and methylation status in a patient with Kabuki syndrome
(A) Differential methylation is observed in the subject relative to healthy controls, across regions shown to be indicative of Kabuki syndrome2. 
(B) A splicing variant in the KMT2D lysine methyltrasferase gene (chr12:49,024,720 G>C) is known to be associated with Kabuki syndrome. The variant is visible in the Integrative Genomics Viewer (IGV) browser as heterozygous, only present on one allele. In addition, this region is hypermethylated in surrounding CpG sites. Methylated cytosines are read as thymines, which manifest as C>T on pink reads from the (+) strand and manifest as G>A on blue reads from the − strand. The read strand is encoded in the read orientation (using IGV read coloring for first-of-pair strand).

From a DRAGEN Germline run, the variant caller calls a G>C heterozygous variant that is output in the small variant VCF file:

#Chrome Pos Ref Alt Qual Filter  
chr12  49024720   G C 50 Pass

 

In the gVCF (or the legacy CX-report format), you can query per-cytosine methylation levels in the region of interest and convert them into a bedGraph format for visualization in a genome browser. With the Illumina 5-base solution, we now introduce methylation reporting directly in the gVCF file, which enables accurate genome-wide small variant and methylation reporting in a single file and high file compression. This compression supports anywhere from small single-sample analyses up to large population scale studies. The gVCF output is concordant with the VCF 4.5 specifications that introduce new methylation fields:

  • M5mC: Percentage methylation per cytosine allele

  • DPM5mC: Coverage per cytosine allele

  • INFO:M5mC: Cytosine allelic context 

Variant calling from 5-base data is highly accurate, due to the high data quality of the Illumina 5-base solution (high coverage uniformity, low error rates) and state-of-the-art DRAGEN algorithms that we tuned for 5-base data. For example, in small variant calling, we adapted the caller such that it maximizes the information available in the 5-base pileup. Specifically, we expand the caller model such that a thymine on a (+) strand read can be a methylated cytosine with a certain probability (similarly for adenines in − strand reads). Importantly, by using read evidence across both (+) and (−) strands, a genotype can be accurately resolved (Figure 5). In addition, DRAGEN can determine the methylation levels at each cytosine in a variant allele that has been called. As a result, the Illumina 5-base solution detects subtle, local interplay between variant and methylation, such as a C>G or G>C variant that alters methylation levels locally by changing a CpG into a CpH context or vice versa (Figure 5).

Figure 5: Small germline variant calling and methylation level estimation per cytosine in variant alleles 
(Top) Example of the three heterozygous genotypes that can contain a methylated C allele. In all cases, the DRAGEN 5-base genotyper calls both alleles by using information across both DNA strands. After genotyping, the methylation level per C or G reference allele is called by quantifying excess C>T or G>A mutations on the (+) or (−) strands, respectively. (Bottom left) Germline SNV calling accuracy of the Illumina 5-base solution for NA12878 is comparable with WGS (EM-seq and Bisulfite processed by Bis-SNP, 5-base and WGS by DRAGEN). (Bottom right) Discovery of heterozygous variants in NA12878 (C>G or G>C) that introduce an allele-specific change in methylation status (purple, orange) by altering the allelic cytosine context (CpG to CpH, or vice versa). For example, a G>C variant will convert a CGH allele into a CCH allele, where the CpG context is lost in the alternative CCH allele.

Small variant calling is also supported in somatic tumor-only and tumor-normal modes due to the previously mentioned algorithm updates. Figure 6 shows sensitivity plots for libraries that were prepared from mixtures of npDNA from NA12877 and NA12878 to mimic cfDNA with variants at 0.5%, 1%, and 2% variant allele frequencies (VAF) and from mixtures of gDNA from NA12877 and NA12878 to obtain truth variants at 2.5%, 5%, 10%, and 20% VAF. The 75 kb or 1 Mb labels are two panels part of Illumina Custom Enrichment Panel v2 designed for Illumina 5-Base DNA Prep with Enrichment, targeting 75 kb or 1 Mb of the genome. We found that large variant calling is also accurate with adjustments to the core DRAGEN models (Figure 6). As a result, copy number variant (CNV) calling is available in DRAGEN 5-base runs and we plan to release structural variant calling (SV) as well as short tandem repeat (STR) calling in future software releases.

Figure 6: DRAGEN 5-base also supports somatic small variant calling and can call large variants
(Top) Sensitivity of somatic SNV detection at various variant allele frequencies. (Middle) Large variant calling accuracy (CNV/SV) in HG002. CNV accuracy is determined with Witty.er against the GIAB v0.6 truth set3. SV accuracy is determined with Truvari Bench against the Genome in a Bottle T2T-Q100 HG002 SV v1.1 truth set with the SV confident BED file. (Bottom) A pathogenic copy number deletion is detected in the CREBBP gene for an individual with Rubinstein-Taybi syndrome.

Reads can be confidently labeled as methylated or unmethylated by classifying based on the joint methylation status of the cytosines in a read. This is useful for applications that aim to detect low signals in individual reads, for example for cancer early detection or minimal residual disease (MRD) screening. Taking a control genome such as the Lambda unmethylated genome, methylated read classification error with the Illumina 5-base solution is less than 10 PPM (parts per million) for reads with two or more CpG (Figure 7).

Figure 7: Ultra-high accuracy of read classification by methylation status using the BAM XM tag 
(Left) Unmethylated Lambda phage DNA is processed with the Illumina 5-base DNA Prep and sequenced. Low frequency C>T errors can lead to a cytosine base being incorrectly marked as methylated in the BAM XM tag. (Right) Only a small fraction of Lambda phage DNA fragments are classified as methylated (less than 10 parts per million for read pairs with 2+ CpG dinucleotides). A fragment is classified as methylated when greater than 70% CpG cytosine bases are methylated in the read pair.

Multiomics interpretation and integration

Illumina Connected Multiomics provides a powerful data science platform to streamline the 5-base methylation and genomic multiomic analyses. The platform enables teams to custom design complex workflows that transform raw data from DRAGEN to actionable biological insights. Multiple users can track the progress of an analysis, collaboratively perform data science experiments in parallel, and create interactive dashboards to communicate results. 

The platform ingests the outputs of DRAGEN and creates a multisample data structure that enables streamlined analysis at the cohort level. This architecture simplifies common tasks such as data quality filtering, unsupervised clustering, and differential methylation analysis. The following representative analysis workflow showcases the capabilities of Connected Multiomics with an acute myeloid leukemia (AML) patient cohort. A more detailed showcase of Connected Multiomics will come in a subsequent blog post.

Data quality control

The platform first ingests the outputs of DRAGEN and summarizes the data set at the multisample cohort level to streamline tasks such as filtering data by quality control metrics. Figure 8 illustrates a dashboard that visualizes the distribution of common whole-genome sequencing quality control metrics across the cohort.   

Figure 8: Quality control dashboard
A single sample is selected to illustrate how users can select specific data sets of interest to simultaneously observe multiple metrics corresponding to it.

Supervised and unsupervised clustering 

After the sample cohort is created, users can perform exploratory analyses, such as clustering, to visualize global differences in the cohort. Connected Multiomics allows users to cluster based CpG position as well as larger features, such as promoter regions where the CpG methylation is averaged over those features. A custom feature set tailored to the biology or clinical context of the analysis can also be used to enhance clustering performance. Figure 9 illustrates how a user or team can explore the clustering performance over different UMAP parameters. 

Figure 9: Typical parameter screening for UMAP clustering

Differentially methylated region calling 

Connected Multiomics streamlines the identification of differentially methylated regions (DMRs) by integrating a widely used DMR caller that uses dispersion shrinkage for sequencing data (DSS) directly into its interactive sandbox environment. Sample groupings can be created from metadata or cluster labels from the PCA/UMAP tasks. DSS models the CpG position methylation as a beta binomial distribution, and statistically significant differentially methylated positions between sample groups are stitched together to create DMRs3. Figure 10 shows how DMRs can be easily visualized and filtered for downstream analyses. AML patients carrying IDH mutations typically have hypermethylated phenotypes, which is reflected by the larger number of hypermethylated DMRs. diff.Methy represents the average methylation difference between the two sample groups over a specific genomic region, and the length is the basepair length of the DMRs. areaStat is integrated statistical significance of all the CpG positions in a DMR which is most strongly with DMR length. Larger DMRs that have larger methylation differences will result in a larger areaStat absolute value. Significance labels are provided as a guide to help users interpret DMRs. However, users should ultimately evaluate the biological relevance of each DMR within the context of their specific study.

Figure 10: DSS DMR calling results volcano plot based on typically useful DMR metrics 

Conclusion

Illumina 5-base solution and DRAGEN pipeline redefine what is possible in genomics—combining genetic and epigenetic insights in a single, efficient workflow. We designed a software solution to simplify and accelerate sample to insights across multiple applications, including genetic disease, cancer biology, and population-scale studies. 

For more information on the Illumina 5-base solution, refer to Genome & Methylome Sequencing | Methylome analysis plus DNA variants

References

[1] To run DRAGEN 5-base on cloud, see https://help.connected.illumina.com/dragen-5-base. To run DRAGEN 5-base from a local server, see https://help.dragen.illumina.com/product-guide/dragen-v4.4/dragen-recipes.

[2] Aref-Eshghi, E., Schenkel, L. C., Lin, H., Skinner, C., Ainsworth, P., Paré, G., … Sadikovic, B. (2017). The defining DNA methylation signature of Kabuki syndrome enables functional assessment of genetic variants of unknown clinical significance. Epigenetics, 12(11), 923–933. https://doi.org/10.1080/15592294.2017.1381807

[3] Francisco M De La Vega, Sean A Irvine, Pavana Anur, Kelly Potts, Lewis Kraft, Raul Torres, Peter Kang, Sean Truong, Yeonghun Lee, Shunhua Han, Vitor Onuchic, James Han, Benchmarking of germline copy number variant callers from whole genome sequencing data for clinical applications, Bioinformatics Advances, Volume 5, Issue 1, 2025, vbaf071, https://doi.org/10.1093/bioadv/vbaf071

[4] Feng, H. & Wu, H. (2019). Differential methylation analysis for bisulfite sequencing using DSS. Quant Biol.  https://doi.org/10.1007/s40484-019-0183-8