At Illumina, innovation and research align with a commitment to scientific access, data, and community and we continuously strive to bring the latest advancements in genomics to our users. We are happy to announce that the HP Advanced Custom Recipe is now available to download. Our research teams have developed this recipe through a modification in the standard clustering protocol to improve sequencing in difficult-to-read regions. The HP Advanced Custom Recipe may be of interest to users who have a need for higher variant calling performance in certain classes of repetitive sequences for research purposes, and we have therefore released it on our Advanced Research Protocols portal.
The HP Advanced Custom Recipe has been tested on the NextSeq 2000 P3 and P4 XLEAP-SBS reagent kits and has been demonstrated to significantly reduce errors and missed calls associated with strings of repeated nucleotides (homopolymers) and dinucleotide motifs.
Important notice
Sequencing recipes, scripts, and protocols released through the Advanced Research Protocols portal have been developed and tested by Illumina R&D scientists but have not gone through the formal product development process. As a result, official specifications may not be applicable when using these protocols, that is, reported Q30 score, output, and run time may vary relative to instrument specifications. The Illumina products mentioned herein are intended for research use only, not for use in diagnostic procedures. Support for Illumina Research scripts falls outside the scope of Illumina’s standard service plan coverage; however, select on-demand services may be available. Contact your sales representative for more information.
Relevance of homopolymers and dinucleotide repeats
Homopolymers are repeating units of a single nucleic acid (for example, AAAAAAAAAAA), while dinucleotide repeats are repeating units of two nucleic acids (for example, ATATATATAT). Repetitive sequences can create challenges for alignment and variant calling, owing to their low complexity, heterogeneity, movement, and duplication within the genome.1,2 Sequencing artifacts in these contexts are marked by a higher rate of mismatches and increased soft-clipping, and can negatively impact variant calling performance.3,4 Illumina DRAGEN secondary analysis software provides speed and accuracy for variant calling and incorporates specialized methods that address the challenges created by repetitive sequences.5,6
To assess the relevance of the sequence contexts addressed by this custom recipe, we cross-referenced genomic stratifications for these sequence contexts with the ClinVar database (August 25, 2024 release). Table 1 details the sizes of homopolymers, dinucleotide repeats, and associated flanking regions relative to the genome, as well as the density for each stratification of ClinVar germline pathogenic and likely pathogenic variants with review status ≥ 2 stars. While homopolymer and dinucleotide repeats have a lower density of ClinVar variants than the genome average, they overall account for a total of 1205.
Stratification | Definition | Region size (% of genome) | Number of ClinVar P+LP 2+ variants | Density of ClinVar P+LP 2+ variants (variants per Mb) |
---|---|---|---|---|
Homopolymers (≥ 10 bp) | Perfect homopolymers of length ≥ 10 bp | 0.50% | 73 | 5.1 |
Homopolymer flanks (50 bp) | 50 bp regions flanking perfect homopolymers of length ≥10 bp | 3.27% | 693 | 7.4 |
Dinucleotide repeats ≥5 | Perfect repeats of dinucleotide motifs with size ≥5 repeats | 0.30% | 59 | 6.8 |
Dinucleotide repeat flanks (50 bp) | 50 bp regions flanking perfect repeats of dinucleotide motifs with size ≥5 repeats | 1.53% | 380 | 8.6 |
Exome | All exons | 3.3% | 59,450 | 616.8 |
Genome-wide (autosomes) | All autosomal chromosomes (chr1-chr22) | 100% | 66,016 | 23.0 |
How does the recipe perform?
To assess the performance of the HP Advanced Custom Recipe, we first assessed NA24385 (HG002) with PCR-free whole-genome sequencing. NA24385 is a human cell line sample that has been well characterized by the Genome in a Bottle Consortium, which generated a truth set of small variants that can be used for benchmarking purposes.5
We prepared replicates of NA24385 (HG002) using the TruSeq DNA PCR-Free library prep kit, and sequenced libraries using 2x151 bp read length with NextSeq 2000 P4 XLEAP-SBS reagents. We sequenced the libraries in three runs performed with the HP Advanced Custom Recipe and three runs performed with the default recipe available in the NextSeq 1000/2000 Control Software Suite v1.7.1. We then analyzed the sequencing data with the DRAGEN Germline v4.3.6 workflow after downsampling to 30× coverage for variant calling comparisons. Average run metrics across the three HP Advanced Custom Recipe runs are shown in Table 2.
HP Advanced Custom Recipe with XLEAP-SBS P4 Reagent Kit (300 Cycles)c | NextSeq 1000/2000 XLEAP-SBS P4 Reagent Kit (300 Cycles) Specificationa,b,c | |
---|---|---|
Reads passing filter | 1.76 B | 1.8 B |
Yield (Gb) | 515 Gb | 540 Gb |
Quality score | 91.39% ≥ Q30 | 90% ≥ Q30 |
Run time | 47 hours, 27 minutes | 44 hours |
a. Output specifications are based on an Illumina PhiX control library at supported cluster densities.
b. Quality scores are based on an Illumina PhiX control library; performance may vary based on library type and quality, insert size, loading concentration, and other experimental factors.
c. Run time includes cluster generation, sequencing, and base calling.
Homopolymer resolution
To investigate correct resolution of homopolymer regions, we calculated the accuracy of the hompolymer length reported in sequenced reads. To exclude any potential variant from the accuracy calculation, we only considered homopolymers present in the confident regions of the NIST 4.2.1 truth set that are distant > 50 bp from any true variant. For those homopolymers, we compared the length reported in the reads that fully span the event to the length in the human reference GRCh38 to calculate an accuracy metric. This assessment demonstrated high accuracy for short homopolymers (< 10 bp) in both the XLEAP-SBS Standard Recipe and the HP Advanced Custom Recipe. For longer homopolymers, which are more challenging to sequencing technologies, the HP Advanced Custom Recipe yielded a significant improvement in accuracy compared to the XLEAP-SBS Standard Recipe.
Variant calling performance
We measured analytical sensitivity and specificity for small variants called with DRAGEN 4.3.6 multigenome (graph) aligner against the NIST 4.2.1 benchmarking set. Variant calling errors are shown genome-wide, in homopolymer ≥ 10 bp regions and 50 bp flanks, in dinucleotide repeats ≥ 5 regions and 50 bp flanks.
On average, the custom recipe delivers an 8% reduction in small variants errors, with consistent benefits in low-complexity regions. Notably, the impact of improved resolution of homopolymers and flanks can only be partially demonstrated using the NIST 4.2.1 benchmark, which covers only 69% of these regions.
Examples of improved support for variant calling
Figures 5–7 provide examples of locations of true variants in the NA24385 (HG002) genome where HP Advanced Custom Recipe has enabled correct variant calling through improved resolution of homopolymers and dinucleotide repeats.
How to access the HP Advanced Custom Recipe
This HP Advanced Custom Recipe has been tested only on and is compatible only with the NextSeq 2000 P3 and P4 XLEAP-SBS Reagent Kits. It is not currently available on other platforms or kit configurations. To enable the custom recipe for a sequencing run, please visit the Advanced Research Protocols web page to download the recipe file.
References:
1. Liao X, Zhu W, Zhou J, et al. Repetitive DNA sequence detection and its role in the human genome. Commun Biol. 2023;6(954). doi:10.1038/s42003-023-05322-y
2. Rajan-Babu I-S, Dolzhenko E, Eberle MA, Friedman JM. Sequence composition changes in short tandem repeats: heterogeneity, detection, mechanisms and clinical implications. Nat Rev Genet. 2024;25:476-499. doi:10.1038/s41576-024-00696-z
3. Singer-Berk M, Gudmundsson S, Baxter S, et al. Advanced variant classification framework reduces the false positive rate of predicted loss-of-function variants in population sequencing data. Am J Hum Genet. 2023;110(9):1496-1508. doi:10.1016/j.ajhg.2023.08.005
4. Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform. 2021;3(1). doi:10.1093/nargab/lqab019
5. Behera S, Catreux S, Rossi M, et al. Comprehensive and accurate genome analysis at scale using DRAGEN accelerated algorithms. Nat Biotechnol. 2024. Published 2024 Oct 25. doi:10.1038/s41587-024-02382-1
6. Illumina. Fully featured genome: Expanding the hunt for genomic variation with DRAGEN STR. illumina.com/science/genomics-research/articles/str-expansionhunter.html. Published October 10, 2022. Accessed September 13, 2024.
7. National Institute of Standards and Technology. Genome in a Bottle. nist.gov/programs-projects/genome-bottle. Accessed September 13, 2024.