Illumina Complete Long Reads software analysis workflow for human WGS

Kyria Roessler

Introduction

Next-generation sequencing (NGS) enables scientists to decipher the genome for a deeper understanding of biology. Proven Illumina sequencing by synthesis (SBS) chemistry combined with award-winning DRAGEN secondary analysis delivers whole-genome sequencing (WGS) data with outstanding accuracy.^1,2 DRAGEN Multigenome (graph) further improves mapping accuracy in challenging regions by ~50%.¹ Still, there remains a small fraction of genic regions that is difficult to map with short reads alone and can benefit from the increased mappability of longer read lengths.

Illumina Complete Long Reads offers a streamlined workflow to make long-read sequencing accessible and help resolve these challenging regions of the human genome. Using Illumina Complete Long Reads, short and long reads are possible from a single platform. In combination with DRAGEN informatics and machine learning methods, Illumina Complete Long Reads extracts accurate variant calling and phasing information from NGS technology. This article delves into the fundamental principles behind Illumina Complete Long Read human genome analysis.

How it works: Assay overview

The Illumina Complete Long Reads workflow (Figure 1) combines a proprietary library prep assay, proven Illumina SBS chemistry, and powerful DRAGEN secondary analysis to generate highly accurate long-read data with an N50 of 5–7 kb.

Library prep for Illumina Complete Long Reads

The efficient, single-day library preparation protocol is easy to scale for high-throughput studies and requires only 50 ng DNA input.^* The assay uses tagmentation to make long genomic DNA fragments (> 10 kb), eliminating the need for additional shearing or size selection. Long, single-molecule DNA fragments are enzymatically marked with unique patterns of single base pair changes. These “land-marks” are introduced at low (4%–7%) frequency along the length of the DNA fragment. Each single-molecule fragment has a unique signature of land-marks to capture and preserve long-read information (without the use of complex barcodes or adapters). Land-marked long fragments are amplified, followed by a second tagmentation step to prepare the libraries for standard sequencing on Illumina systems.

_{* 50 ng DNA input is recommended, as low as 10 ng DNA input is possible.}

Bioinformatics workflow

The analysis pipeline generates long reads and combines the data with a standard, unmarked WGS library^† to produce long contiguous reads that are complete and accurate representations of the original single-molecule fragments.

_{† Requires 30× standard short-read human whole-genome data from the same sample for analysis. Illumina DNA PCR-Free Prep is recommended. Third-party WGS kits are also compatible. Unmarked library does not need to be prepared or sequenced in parallel; can use FASTQ files from a previously run sample.}

how complete long read assay works — Figure 1: How the Illumina Complete Long Reads assay works

> Watch how it works

Illumina Complete Long Reads generation

The Illumina Complete Long Reads bioinformatics workflow for long-read generation includes standard genomic computational methods like alignment and variant calling. The workflow is packaged and available as a push-button app in BaseSpace Sequence Hub. The workflow uses land-marked and unmarked libraries and a reference genome as inputs. These inputs are then used to carry out a series of steps (Figure 2) to generate long reads from single molecules for comprehensive WGS analysis.

Identify land-marked sites on reads

The first step in the long-read generation process is to identify the marks present in the land-marked library. In confident-to-map regions, most land-marks can be identified by standard alignment and detection of nucleotides that differ from the reference genome.

For reads that come from regions that do not readily align to the reference genome (eg, repetitive regions), a different approach is needed to detect land-marks. Specific methods for k-mers (ie, informatically breaking up reads into small strings of nucleotides of “k” length) allow algorithms to determine relationships between reads without use of a reference genome. In difficult-to-map regions, marks are inferred by comparing k-mers from the land-marked and unmarked reads.³ If a k-mer in a marked read cannot be paired with any k-mer from the unmarked reads, it will be treated as a land-mark.

Build weighted network of land-marked reads

After detecting all the land-marks in the reads, the next step is to identify connections among reads based on their shared marks. We use minimizer k-mers to index pairs of reads that are similar and optimize k-mer matching.⁴ All pairs that share a given minimizer k-mer can be compared in detail. The number of shared and conflicting land-marks determines the strength of evidence connecting reads (Figure 3). We build a weighted network of marked reads based on the strength of those connections.

weighted network of marked reads — Figure 3: Illustration of the process used to identify reads with shared land-marks

Find groups of reads from the same template

The connections between reads form a graph of all reads. A series of decomposition and clustering methods is applied (such as removing conflicting or weak connections due to a low number of shared land-marks) to split the full network into strongly linked clusters (Figure 4). Each cluster is presumed to originate from a single molecule.

marked reads clustering process — Figure 4: Illustration of the clustering process

Assemble each group of land-marked reads

From the final clusters, DRAGEN analysis uses k-mer–based, de Bruijn graph–like assembly methods to generate long-read contigs (Figure 5).

marked long reads assembly — Figure 5: Illustration of the assembly process

Remove land-marks from long reads

After land-marks are used to support generation of long reads, the marks can be removed. To distinguish land-marks from true variants, land-marked long reads are compared to unmarked reads. Any land-marks that do not match with the corresponding unmarked read are updated so that the final Illumina Complete Long Read reveals the true sequence (Figure 6). The comparison between land-marked long reads and unmarked reads is similar to how land-marks are identified—performed in part using reference genome alignment and in part using k-mer indexing, especially in regions with challenging mapping. After obtaining an alignment of unmarked reads to land-marked long reads, a Bayesian model is applied to determine the final base calls of the long read and the corresponding quality scores.

Secondary analysis

After the Illumina Complete Long Read construction steps described above, Illumina Complete Long Reads and the unmarked short reads are used for secondary analysis (Figure 7). Complete long reads are first aligned to the genome using a modified version of Minimap2.

For small variant calling, results from DRAGEN small variant calling of long reads and short reads are merged into a single VCF file. DRAGEN small variant calling is capable of processing reads longer than 75 kb. A machine learning model (trained on variant calls from Genome in a Bottle) is used to combine and improve small variant calls obtained from long reads and standard short reads. Finally, a modified version of WhatsHap is used for phasing Illumina Complete Long Reads and merged small variants with new, comprehensive output files created to capture the haplotype information.

For structural variant calling, results from long-read structural variant caller (Sniffles2) output⁵ and short-read DRAGEN structural variant caller are merged into a single VCF file.

complete long reads secondary analysis — Figure 7: Illumina Complete Long Reads secondary analysis

Highly accurate WGS

Illumina Complete Long Read technology takes advantage of proven Illumina SBS chemistry and DRAGEN secondary analysis to further improve accuracy for human WGS. With PrecisionFDA Truth Challenge v2 data sets, the F1 score reflecting precision and recall for WGS using the Illumina Complete Long Read assay was 99.87% (Figure 8).^6,7 Compared with standard WGS, Illumina Complete Long Read data demonstrate an overall reduction in false negatives and false positives in both SNPs and indels across multiple benchmark samples (Figure 9).

highest accuracy variant calling — Figure 8: Highest accuracy with Illumina Complete Long Reads

accurate variant calling in challenging regions — Figure 9: Illumina Complete Long Read assay performs highly accurate variant calling for challenging genic regions

Conclusion

Long-read information can help resolve the most challenging regions of the genome. Illumina Complete Long Reads makes comprehensive WGS easily accessible for genomics labs by enabling both long- and short-reads on the same instrument. Illumina Complete Long Reads offers advantages such as a streamlined, familiar lab workflow, minimal input requirements, large-scale library kit manufacturing, and contiguous reads for producing high-quality and comprehensive variant calling across genic regions.

Learn more

Long-read sequencing

Whole-genome sequencing

Illumina Complete Long Reads product line

Read how using Illumina Complete Long Reads increases accuracy for small variant calling: Comprehensive WGS with Illumina Complete Long Read Prep, Human technical note

Illumina Complete Long Read Prep, Human data sheet

DRAGEN secondary analysis

References

Mehio R, Ruehle M, Catreux S, et al. DRAGEN Wins at Precision- FDA Truth Challenge V2 Showcase Accuracy Gains from Alt-aware Mapping and Graph Reference Genomes. Accessed May 16, 2023.
Illumina. Accuracy improvements in germline small variant calling with the DRAGEN Bio-IT Platform. Accessed May 16, 2023.
Leinonen M, Salmela L. Extraction of long k-mers using spaced seeds. IEEE/ACM Trans Compu Biol Bioinform. 2022;19(6):3444-3455. Doi:10.1109/TCBB.2021.3113131
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20(18):3363-3369. doi:10.1093/bioinformatics/bth408
Sedlazeck FJ, Rescheneder P, Smolka M, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15(6):461-468. doi:10.1038/s41592-018-0001-7
Illumina. Data on file. 2022.
PrecisionFDA. Truth Challenge V2: Calling Variants from Short and Long Reads in Difficult-to-Map Regions. precision.fda.gov/ challenges/10. Accessed January 12, 2023.

NovaSeq X innovation roadmap

Illumina 5-base solution

NGS Workflow Finder - now with oncology workflows

Illumina Connected Multiomics

NGS-based proteomics services

TruPath Genome solution

Innovation Roadmap

Innovation Roadmap

Innovation Roadmap

Innovation Roadmap

Innovation Roadmap

Innovation Roadmap

Innovation Roadmap

A unified future for SOMAmer proteomics

A unified future for SOMAmer proteomics

A unified future for SOMAmer proteomics

A unified future for SOMAmer proteomics

A unified future for SOMAmer proteomics

A unified future for SOMAmer proteomics

A unified future for SOMAmer proteomics

A unified future for SOMAmer proteomics

MiSeq i100 Series

MiSeq i100 Series

MiSeq i100 Series

MiSeq i100 Series

MiSeq i100 Series

MiSeq i100 Series

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina StrataMap Spatial

Illumina workflow solutions

Illumina Complete Long Reads software analysis workflow for human WGS

Introduction

How it works: Assay overview

Illumina Complete Long Reads generation

Highly accurate WGS

Conclusion

Learn more

References