In our previous blog, we introduced the Illumina 5-base solution—a fast and automation-compatible workflow that captures both genetic and epigenetic information from a single sample. In this second blog, we explore how to interpret the Illumina 5-base data across diverse applications, including genetic disease research, cancer detection, and population epigenetics.
The software pipeline is composed of:
BCL Convert: Sequencing and raw read output.
DRAGEN secondary analysis: Outputs aligned reads, calls variants, and reports methylation. A DRAGEN Report summarizes key QC metrics across samples.
Illumina Connected Multiomics: Tertiary analysis and biomarker discovery with complex data set visualization, differential methylation calling, variant-methylation joint analysis, and advanced multiomic analysis.
You can run the software pipeline on the cloud (DRAGEN on BaseSpace Sequencing Hub, Illumina Connected Analytics, Illumina Connected Multiomics) and automatically launch the pipeline from the sequencing system. You can also run DRAGEN secondary analysis on your DRAGEN server.
In DRAGEN, we developed a new 5-base secondary analysis mode1 that builds off the standard DRAGEN DNA workflow by incorporating methylation-aware logic in the core algorithms and integrates methylation output into pre-existing standardized data formats (Figure 1).
Figure 1: DRAGEN 5-base secondary analysis
Algorithm model updates to hash table builder, map/align, UMI collapsing, variant calling, and methylation reporting. Integrated small variant calls and methylation reporting per allele are output in the gVCF. Legacy methylation reporting is also supported (CX-report format). Quality control metrics (eg, M-bias, control genome methylation) confirm successful sequencing and analysis.
After run completion, the DRAGEN Report provides a comprehensive summary of quality control (QC) metrics (Figure 2). These QC metrics include additional methylation-specific metrics such as:
% Methylation in methylated/unmethylated control genomes: Small bacterial genomes (Lambda/pUC19) are spiked-in to the 5-base library prep and serve as controls with known methylation levels. The Lambda genome has 0% methylation, and the pUC19 genome is synthetically methylated at CpG positions to over 97%.
% Methylation in sample in CpG/CpH contexts: Methylation in mammalian genomes occurs predominantly in CpG cytosine contexts.
Read alignment rate to both DNA strands: 5-base reads should align in equal proportion to the original top (OT, also denoted +) or original bottom (OB, also denoted −) DNA strands.
Figure 2: New methylation metrics in DRAGEN Reports provide simple quality control
(Left) In mammalian genomes, global percent methylation in CpG context is high (40–60%) whereas non-CpG contexts are unmethylated. Spike-in methylated and unmethylated control genomes show expected methylation levels in CpG contexts. (Right) Reads map to the expected DNA strand OT or OB in equal proportions.
Figure 3: Cytosine methylation produces a stranded signature that is distinct to sequence variants
(A) A DNA fragment that contains a methylated CpG dinucleotide undergoes C>T conversion in library prep. After sequencing and alignment to the reference genome (+) strand sequence, the fragment manifests as a C>T mutation on a (+) strand read or a G>A mutation on a (−) strand read.
(B) Unmethylated alleles A-C-G-T are sequenced similarly to a standard DNA library prep. The matching allele base is present on both the (+) and (−) strand.
(C) An illustration of a region that contains methylated CpG dinucleotides and a C>T heterozygous variant. Reads are grouped by DNA strand of origin (+, −). Methylated CpG dinucleotides are shown with mutations only present on one strand at a time, whereas the C>T sequence variant carries mutations on both strands simultaneously.
You can inspect aligned reads in a genome browser to glean information on variant and methylation status in regions of interest. For example, in a patient with Kabuki syndrome, we find a variant in the KMT2D gene (chr12:49,024,720 G>C) and distinct episignatures or hyper/hypomethylation in regions of interest known to be associated with Kabuki syndrome
Figure 4: Visualization of variant and methylation status in a patient with Kabuki syndrome
(A) Differential methylation is observed in the subject relative to healthy controls, across regions shown to be indicative of Kabuki syndrome2.
(B) A splicing variant in the KMT2D lysine methyltrasferase gene (chr12:49,024,720 G>C) is known to be associated with Kabuki syndrome. The variant is visible in the Integrative Genomics Viewer (IGV) browser as heterozygous, only present on one allele. In addition, this region is hypermethylated in surrounding CpG sites. Methylated cytosines are read as thymines, which manifest as C>T on pink reads from the (+) strand and manifest as G>A on blue reads from the − strand. The read strand is encoded in the read orientation (using IGV read coloring for first-of-pair strand).
From a DRAGEN Germline run, the variant caller calls a G>C heterozygous variant that is output in the small variant VCF file:
| #Chrome | Pos | Ref | Alt | Qual | Filter | |
|---|---|---|---|---|---|---|
| chr12 | 49024720 | G | C | 50 | Pass |
In the gVCF (or the legacy CX-report format), you can query per-cytosine methylation levels in the region of interest and convert them into a bedGraph format for visualization in a genome browser. With the Illumina 5-base solution, we now introduce methylation reporting directly in the gVCF file, which enables accurate genome-wide small variant and methylation reporting in a single file and high file compression. This compression supports anywhere from small single-sample analyses up to large population scale studies. The gVCF output is concordant with the VCF 4.5 specifications that introduce new methylation fields:
M5mC: Percentage methylation per cytosine allele
DPM5mC: Coverage per cytosine allele
INFO:M5mC: Cytosine allelic context
Variant calling from 5-base data is highly accurate, due to the high data quality of the Illumina 5-base solution (high coverage uniformity, low error rates) and state-of-the-art DRAGEN algorithms that we tuned for 5-base data. For example, in small variant calling, we adapted the caller such that it maximizes the information available in the 5-base pileup. Specifically, we expand the caller model such that a thymine on a (+) strand read can be a methylated cytosine with a certain probability (similarly for adenines in − strand reads). Importantly, by using read evidence across both (+) and (−) strands, a genotype can be accurately resolved (Figure 5). In addition, DRAGEN can determine the methylation levels at each cytosine in a variant allele that has been called. As a result, the Illumina 5-base solution detects subtle, local interplay between variant and methylation, such as a C>G or G>C variant that alters methylation levels locally by changing a CpG into a CpH context or vice versa (Figure 5).
Figure 5: Small germline variant calling and methylation level estimation per cytosine in variant alleles
(Top) Example of the three heterozygous genotypes that can contain a methylated C allele. In all cases, the DRAGEN 5-base genotyper calls both alleles by using information across both DNA strands. After genotyping, the methylation level per C or G reference allele is called by quantifying excess C>T or G>A mutations on the (+) or (−) strands, respectively. (Bottom left) Germline SNV calling accuracy of the Illumina 5-base solution for NA12878 is comparable with WGS (EM-seq and Bisulfite processed by Bis-SNP, 5-base and WGS by DRAGEN). (Bottom right) Discovery of heterozygous variants in NA12878 (C>G or G>C) that introduce an allele-specific change in methylation status (purple, orange) by altering the allelic cytosine context (CpG to CpH, or vice versa). For example, a G>C variant will convert a CGH allele into a CCH allele, where the CpG context is lost in the alternative CCH allele.
Figure 6: DRAGEN 5-base also supports somatic small variant calling and can call large variants
(Top) Sensitivity of somatic SNV detection at various variant allele frequencies. (Middle) Large variant calling accuracy (CNV/SV) in HG002. CNV accuracy is determined with Witty.er against the GIAB v0.6 truth set3. SV accuracy is determined with Truvari Bench against the Genome in a Bottle T2T-Q100 HG002 SV v1.1 truth set with the SV confident BED file. (Bottom) A pathogenic copy number deletion is detected in the CREBBP gene for an individual with Rubinstein-Taybi syndrome.
Figure 7: Ultra-high accuracy of read classification by methylation status using the BAM XM tag
(Left) Unmethylated Lambda phage DNA is processed with the Illumina 5-base DNA Prep and sequenced. Low frequency C>T errors can lead to a cytosine base being incorrectly marked as methylated in the BAM XM tag. (Right) Only a small fraction of Lambda phage DNA fragments are classified as methylated (less than 10 parts per million for read pairs with 2+ CpG dinucleotides). A fragment is classified as methylated when greater than 70% CpG cytosine bases are methylated in the read pair.
Multiomics interpretation and integration
Illumina Connected Multiomics provides a powerful data science platform to streamline the 5-base methylation and genomic multiomic analyses. The platform enables teams to custom design complex workflows that transform raw data from DRAGEN to actionable biological insights. Multiple users can track the progress of an analysis, collaboratively perform data science experiments in parallel, and create interactive dashboards to communicate results.
The platform ingests the outputs of DRAGEN and creates a multisample data structure that enables streamlined analysis at the cohort level. This architecture simplifies common tasks such as data quality filtering, unsupervised clustering, and differential methylation analysis. The following representative analysis workflow showcases the capabilities of Connected Multiomics with an acute myeloid leukemia (AML) patient cohort. A more detailed showcase of Connected Multiomics will come in a subsequent blog post.
Data quality control
The platform first ingests the outputs of DRAGEN and summarizes the data set at the multisample cohort level to streamline tasks such as filtering data by quality control metrics. Figure 8 illustrates a dashboard that visualizes the distribution of common whole-genome sequencing quality control metrics across the cohort.
Figure 8: Quality control dashboard
A single sample is selected to illustrate how users can select specific data sets of interest to simultaneously observe multiple metrics corresponding to it.
Supervised and unsupervised clustering
After the sample cohort is created, users can perform exploratory analyses, such as clustering, to visualize global differences in the cohort. Connected Multiomics allows users to cluster based CpG position as well as larger features, such as promoter regions where the CpG methylation is averaged over those features. A custom feature set tailored to the biology or clinical context of the analysis can also be used to enhance clustering performance. Figure 9 illustrates how a user or team can explore the clustering performance over different UMAP parameters.
Figure 9: Typical parameter screening for UMAP clustering
Differentially methylated region calling
Connected Multiomics streamlines the identification of differentially methylated regions (DMRs) by integrating a widely used DMR caller that uses dispersion shrinkage for sequencing data (DSS) directly into its interactive sandbox environment. Sample groupings can be created from metadata or cluster labels from the PCA/UMAP tasks. DSS models the CpG position methylation as a beta binomial distribution, and statistically significant differentially methylated positions between sample groups are stitched together to create DMRs3. Figure 10 shows how DMRs can be easily visualized and filtered for downstream analyses. AML patients carrying IDH mutations typically have hypermethylated phenotypes, which is reflected by the larger number of hypermethylated DMRs. diff.Methy represents the average methylation difference between the two sample groups over a specific genomic region, and the length is the basepair length of the DMRs. areaStat is integrated statistical significance of all the CpG positions in a DMR which is most strongly with DMR length. Larger DMRs that have larger methylation differences will result in a larger areaStat absolute value. Significance labels are provided as a guide to help users interpret DMRs. However, users should ultimately evaluate the biological relevance of each DMR within the context of their specific study.
Figure 10: DSS DMR calling results volcano plot based on typically useful DMR metrics
Conclusion
Illumina 5-base solution and DRAGEN pipeline redefine what is possible in genomics—combining genetic and epigenetic insights in a single, efficient workflow. We designed a software solution to simplify and accelerate sample to insights across multiple applications, including genetic disease, cancer biology, and population-scale studies.
For more information on the Illumina 5-base solution, refer to Genome & Methylome Sequencing | Methylome analysis plus DNA variants
References
[1] To run DRAGEN 5-base on cloud, see https://help.connected.illumina.com/dragen-5-base. To run DRAGEN 5-base from a local server, see https://help.dragen.illumina.com/product-guide/dragen-v4.4/dragen-recipes.
[2] Aref-Eshghi, E., Schenkel, L. C., Lin, H., Skinner, C., Ainsworth, P., Paré, G., … Sadikovic, B. (2017). The defining DNA methylation signature of Kabuki syndrome enables functional assessment of genetic variants of unknown clinical significance. Epigenetics, 12(11), 923–933. https://doi.org/10.1080/15592294.2017.1381807
[3] Francisco M De La Vega, Sean A Irvine, Pavana Anur, Kelly Potts, Lewis Kraft, Raul Torres, Peter Kang, Sean Truong, Yeonghun Lee, Shunhua Han, Vitor Onuchic, James Han, Benchmarking of germline copy number variant callers from whole genome sequencing data for clinical applications, Bioinformatics Advances, Volume 5, Issue 1, 2025, vbaf071, https://doi.org/10.1093/bioadv/vbaf071
[4] Feng, H. & Wu, H. (2019). Differential methylation analysis for bisulfite sequencing using DSS. Quant Biol. https://doi.org/10.1007/s40484-019-0183-8