Predicting the clinical impact of human mutation with deep neural networks

Hong Gao and Kyle Farh; published April 21, 2021

Introduction

Millions of human genomes and exomes have been sequenced, but their clinical applications remain limited due to the difficulty of distinguishing disease-causing mutations from benign genetic variation^1,2. Because of their deleterious effects on fitness, clinically significant genetic variants tend to be extremely rare in the population³. Therefore, the observation of a variant at high frequencies in the population is strong evidence in favor of benign consequence^2,4, enabling pathogenic mutations to be systematically identified by process of elimination. Assaying common variation across diverse human populations is an effective strategy for cataloguing benign variants⁵, but the total amount of common variation in present day humans is limited. Out of more than 70 million potential missense variants in the reference genome, only roughly 1 in 1000 are present at greater than 0.1% overall population allele frequency^5,6.

Outside of modern human populations, chimpanzees comprise the next closest extant species, and share 99.4% amino acid sequence identity⁷. The near-identity of protein-coding sequence in humans and chimpanzees suggests that natural selection operating on chimpanzee protein-coding variants might also model the consequences on fitness of human identical mutations. If polymorphisms that are identical-by-state similarly affect fitness in the two species, the presence of a variant at high allele frequencies in chimpanzee populations should indicate benign consequence in human, expanding the catalog of known benign variants substantially. This formulates the hypothesis that needs to be verified with chimpanzee variants.

We demonstrated that common primate variants tend to be benign in human population. Using hundreds of thousands of common variants from population sequencing of six non-human primate species as training data, we developed PrimateAI, a deep neural network that predicts pathogenic mutations with high accuracy.

Common variants in other primates are largely benign in human

The recent availability of aggregated exome data, comprising 123,136 humans collected in the Exome Aggregation Consortium (ExAC) and Genome Aggregation Database (gnomAD), allows us to measure the impact of natural selection on missense and synonymous mutations across the allele frequency spectrum⁵. Singleton variants (observed only once in the cohort) closely match the expected 2.2:1 missense:synonymous ratio predicted by de novo mutation after adjusting for confounding factors (Fig. 1a)⁸, but at higher allele frequencies the number of observed missense variants decreases due to the purging of deleterious mutations by natural selection.

Figure 1 Missense: synonymous ratios across the human allele frequency spectrum.

Primate variants were obtained from the great ape genome sequencing project and dbSNP^9,10. We first examined common chimpanzee variants that are identical-by-state with human variants (Fig. 1b), and discovered the missense:synonymous ratio is largely constant across the human allele frequency spectrum, which is consistent with absence of negative selection against common chimpanzee variants in the human population. The low missense:synonymous ratio observed in human variants that are identical-by-state with common chimpanzee variants is consistent with the larger effective population size in chimpanzee, which enables more efficient filtering of mildly deleterious variation^11,12.

We next identified human variants that are identical-by-state with variation observed in at least one of six non-human primate species. Variation in each of the six species largely represent common variants based on the limited number of individuals sequenced and the low missense:synonymous ratios observed for each species. Similar to chimpanzee, we find that the missense:synonymous ratios for variants from the six non-human primate species are roughly equal across the human allele frequency spectrum, other than a mild depletion of missense variation at common allele frequencies (Fig. 2), which is expected due to the inclusion of a minority of rare variants.

We find that human missense variants that are identical-by-state with observed primate variants are strongly enriched for benign consequence in the ClinVar database¹³. After excluding variants of uncertain significance and those with conflicting annotations, ClinVar variants that are present in at least one non-human primate species are annotated as Benign or Likely Benign on average 90% of the time, compared to 35% for ClinVar missense variants in general (Fig. 3). The pathogenicity of ClinVar annotations for primate variants is slightly greater than that observed from sampling a similarly sized cohort of healthy humans (~95% Benign or Likely Benign consequence).

A deep learning network for variant pathogenicity classification

The importance of variant classification for clinical applications has inspired numerous attempts to use supervised machine learning to address the problem, but these efforts have been hindered by the lack of an adequately-sized truth dataset containing confidently labeled benign and pathogenic variants for training^14-24. Existing databases of human expert curated variants cover a small fraction of the genome, with ~50% of the variants in the ClinVar database coming from only 200 genes (~1% of human protein-coding genes). Moreover, systematic studies reveal that many human expert annotations have questionable supporting evidence^5,25, underscoring the difficulty of interpreting rare variants that may be observed in only a single patient. To reduce human interpretation biases, recent classifiers have been trained on common human polymorphisms or fixed human-chimpanzee substitutions^26-29, but these classifiers also use as their input the prediction scores of earlier classifiers that were trained on human curated databases. Objective benchmarking of the performance of these various methods has been elusive in the absence of an independent, bias-free truth dataset³⁰.

Variation from the six non-human primates (chimpanzee, bonobo, gorilla, orangutan, rhesus, and marmoset) contributes over 300,000 unique missense variants that are non-overlapping with common human variation, and largely represent common variants of benign consequence that have been through the sieve of purifying selection, greatly enlarging the training dataset available for machine learning approaches. On average, each primate species contributes the equivalent of 50K variants, more variants than the current total in the whole of the ClinVar database. Additionally, this content is free from biases in human interpretation.

Using a dataset consisting of common human variants and primate variation, we trained a novel deep residual network, PrimateAI (https://github.com/Illumina/PrimateAI), which takes as input the amino acid sequence flanking the variant of interest and the orthologous sequence alignments in other species (Fig. 4a)³¹. Unlike existing classifiers which employ human-engineered features, our deep learning network learns to extract features directly from primary sequence. To incorporate information about protein structure, we trained separate networks to predict secondary structure and solvent accessibility from sequence alone^32,33, and then included these as sub-networks in the full model (Fig. 4b). Given the small number of human proteins that have been successfully crystallized, inferring structure from primary sequence has the advantage of avoiding biases due to incomplete protein structure and functional domain annotation. The total depth of the network, with protein structure included, was 36 layers of convolutions, consisting of roughly 400,000 trainable parameters.

To train a classifier using only variants with benign labels, we framed the prediction problem as whether a given mutation is likely to be observed as a common variant in the population. Several factors influence the probability of observing a variant at high allele frequencies, of which we are interested only in deleteriousness. We matched each variant in the benign training set with a unlabeled missense mutation, controlling for the confounding factors, and trained the deep learning network to distinguish between benign variants and matched controls⁸. As the number of unlabeled variants greatly exceeds the size of the labeled benign training dataset, we trained eight networks in parallel, each using a different set of unlabeled variants matched to the benign training dataset, to obtain a consensus prediction.

Example of Pathogenicity Prediction

Using only primary amino acid sequence as its input, the deep learning network accurately assigns high pathogenicity scores to residues at critical protein functional domains, as shown for the voltage-gated sodium channel SCN2A (Fig. 5), a major disease gene in epilepsy, autism, and intellectual disability. The structure of the SCN2A consists of four homologous repeats, each containing six transmembrane helixes (S1-S6)^34,35. Upon membrane depolarization, the positively-charged S4 transmembrane helix moves towards the extracellular side of the membrane, causing the S5/S6 pore-forming domains to open via the S4-S5 linker. Mutations in the S4, S4-S5 linker, and S5 domains, which are clinically associated with early onset epileptic encephalopathy³⁶, are predicted by the network to have the highest pathogenicity scores in the gene, and are depleted for variants in the healthy population.

We compared the performance of our network with existing classification algorithms, using 10,000 common primate variants that were withheld from training. Because ~50% of all newly arising human missense variants are filtered by purifying selection at common allele frequencies (Fig. 1a), we determined the 50th-percentile score for each classifier using randomly selected variants that were matched to the 10,000 common primate variants by mutational rate and sequencing coverage, and evaluated the accuracy of each classifier at that threshold (Fig. 6). Our deep learning network (91% accuracy) surpassed the performance of other classifiers (80% accuracy for the next best model) at assigning benign consequence to the 10,000 withheld common primate variants. Roughly half the improvement over existing methods comes from using the deep learning network, and half comes from augmenting the training dataset with primate variation, as compared to the accuracy of the network trained with human variation data only (Fig. 6).

To test classification of variants of uncertain significance in a clinical scenario, we evaluated the ability of the deep learning network to distinguish between de novo mutations occurring in patients with neurodevelopmental disorders versus healthy controls. By prevalence, neurodevelopmental disorders constitute one of the largest categories of rare genetic diseases³⁷, and recent trio sequencing studies have implicated the central role of de novo missense and protein truncating mutations^38-41. We classified each confidently called de novo missense variant in 4,293 affected individuals from the Deciphering Developmental Disorders cohort (DDD)⁴², versus de novo missense variants from 2,517 unaffected siblings in the Simon’s Simplex Collection cohort (SSC)⁴³, and assessed the difference in prediction scores between the two distributions with the Wilcoxon rank-sum test (Fig. 7a). The deep learning network clearly outperforms other classifiers on this task (Fig. 7b).

We next sought to estimate the accuracy of the deep learning network at classifying benign versus pathogenic mutations within the same gene. Given that the DDD population largely consists of index cases of affected children without affected first degree relatives, it is essential to show that the classifier has not inflated its accuracy by favoring pathogenicity in genes with de novo dominant modes of inheritance. We restricted the analysis to 605 genes that were nominally significant for disease association in the DDD study, calculated from protein-truncating variation only⁴². Within these genes, de novo missense mutations are enriched 3:1 compared to expectation (Fig. 8a), indicating that ~67% are pathogenic. The deep learning network was able to discriminate pathogenic and benign de novo variants within the same set of genes (Fig. 8b), outperforming other methods by a large margin (Fig. 8c).

At a binary cutoff of ≥ 0.803 (Fig. 9a), 65% of de novo missense mutations in cases are classified by the deep learning network as pathogenic, compared to 14% of de novo missense mutations in controls, corresponding to a classification accuracy of 88% (Fig. 9b). Given frequent incomplete penetrance and variable expressivity in neurodevelopmental disorders⁴⁴, this figure likely underestimates the accuracy of our classifier due to the inclusion of partially penetrant pathogenic variants in controls.

Our results suggest that systematic primate population sequencing is an effective strategy to classify the millions of human variants of uncertain significance that currently limit clinical genome interpretation. The accuracy of our deep learning network on both withheld common primate variants and clinical variants increases with the number of benign variants used to train the network. Cataloging common variation from additional primate species would improve interpretation for millions of variants of uncertain significance, further advancing the clinical utility of human genome sequencing.

Acknowledgements

We would like to thank J. K. Pritchard, M. E. Hurles, J. W. Belmont, and R. E. Green for insightful discussions. We would like to thank the Genome Aggregation Database (gnomAD) and the groups that provided exome and genome variant data to this resource. Yanjun Li and Xiaolin Li were partially supported by R01GM110240 from the National Institute of General Medical Sciences and National Science Foundation (grants CNS- 1747783, CNS- 1624782, and OAC-1229576). We would like to acknowledge the authors in the original paper, including Laksshman Sundaram, Samskruthi Reddy Padigepati, Jeremy F. McRae, Yanjun Li, Jack A. Kosmicki, Nondas Fritzilas, Jorg Hakenberg, Anindita Dutta, John Shon, Jinbo Xu, Serafim Batzloglou, and Xiaolin Li.

External links

Publication: https://pubmed.ncbi.nlm.nih.gov/30038395/

Software: https://github.com/Illumina/PrimateAI

Primate polymorphisms from the great ape genome project:
https://eichlerlab.gs.washington.edu/greatape/data.html

And from dbSNP database: https://www.ncbi.nlm.nih.gov/snp/

PrimateAI scores of 70 million variants: https://basespace.illumina.com/s/cPgCSmecvhb4

References

MacArthur, D. G. et al. Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469-476, doi:10.1038/nature13127 (2014).
Rehm, H. L., J. S. Berg, L. D. Brooks, C. D. Bustamante, J. P. Evans, M. J. Landrum, D. H. Ledbetter, D. R. Maglott, C. L. Martin, R. L. Nussbaum, S. E. Plon, E. M. Ramos, S. T. Sherry, M. S. Watson. ClinGen--the Clinical Genome Resource. N. Engl. J. Med. 372, 2235-2242 (2015).
Bamshad, M. J., S. B. Ng, A. W. Bigham, H. K. Tabor, M. J. Emond, D. A. Nickerson, J. Shendure. Exome sequencing as a tool for Mendelian disease gene discovery. Nat. Rev. Genet. 12, 745–755 (2011).
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17, 405-424, doi:10.1038/gim.2015.30 (2015).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285-291, doi:10.1038/nature19057 (2016).
Liu, X., X. Jian, E. Boerwinkle. dbNSFP: A lightweight database of human nonsynonymous SNPs and their functional predictions. . Human Mutation 32, 894–899 (2011).
Chimpanzee Sequencing Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69-87, doi:10.1038/nature04072 (2005).
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat Genet 46, 944-950, doi:10.1038/ng.3050 (2014).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308-311, doi:10.1093/nar/29.1.308 (2001).
Prado-Martinez, J. et al. Great ape genome diversity and population history. Nature 499, 471-475 (2013).
Kimura, M. The neutral theory of molecular evolution. Cambridge University Press, 1983
de Manuel, M. et al. Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science 354, 477-481, doi:10.1126/science.aag2602 (2016).
Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res 44, D862-868, doi:10.1093/nar/gkv1222 (2016).
Ng, P. C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome Res 11, 863-874, doi:10.1101/gr.176601 (2001).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat Methods 7, 248-249, doi:10.1038/nmeth0410-248 (2010).
Chun, S., J. C. Fay. Identification of deleterious mutations within three human genomes. Genome Research 19, 1553-1561 (2009).
Schwarz, J. M., C. Rödelsperger, M. Schuelke, D. Seelow. MutationTaster evaluates disease-causing potential of sequence alterations. Nat. Methods 7, 575–576 (2010).
Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 39, e118, doi:10.1093/nar/gkr407 (2011).
Dong, C. et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet 24, 2125-2137, doi:10.1093/hmg/ddu733 (2015).
Carter, H., Douville, C., Stenson, P. D., Cooper, D. N. & Karchin, R. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics 14 Suppl 3, S3, doi:10.1186/1471-2164-14-S3-S3 (2013).
Choi, Y., Sims, G. E., Murphy, S., Miller, J. R. & Chan, A. P. Predicting the functional effect of amino acid substitutions and indels. PLoS One 7, e46688, doi:10.1371/journal.pone.0046688 (2012).
Gulko, B., Hubisz, M. J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat Genet 47, 276-283, doi:10.1038/ng.3196 (2015).
Shihab, H. A. et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 31, 1536-1543, doi:10.1093/bioinformatics/btv009 (2015).
Quang, D., Chen, Y. & Xie, X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31, 761-763, doi:10.1093/bioinformatics/btu703 (2015).
Bell, C. J., D. L. Dinwiddie, N. A. Miller, S. L. Hateley, E. E. Ganusova, J. Midge, R. J. Langley, L. Zhang, C. L. Lee, R. D. Schilkey, J. E. Woodward, H. E. Peckham, G. P. Schroth, R. W. Kim, S. F. Kingsmore. Comprehensive carrier testing for severe childhood recessive diseases by next generation sequencing. Sci. Transl. Med. 3, 65ra64 (2011).
Kircher, M., D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper, J. Shendure. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310-315 (2014).
Smedley, D. et al. A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. Am J Hum Genet 99, 595-606, doi:10.1016/j.ajhg.2016.07.005 (2016).
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am J Hum Genet 99, 877-885, doi:10.1016/j.ajhg.2016.08.016 (2016).
Jagadeesh, K. A., A. M. Wenger, M. J. Berger, H. Guturu, P. D. Stenson, D. N. Cooper, J. A. Bernstein, G. Bejerano. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nature Genetics 48, 1581-1586 (2016).
Grimm, D. G. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Human Mutation 36, 513-523 (2015).
He, K., X. Zhang, S. Ren, J. Sun. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE 770-778.
Heffernan, R. et al. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Sci Rep 5, 11476, doi:10.1038/srep11476 (2015).
Wang, S., J. Peng, J. Ma, J. Xu. Protein secondary structure prediction using deep convolutional neural fields. Scientific Reports 6, 18962-18962 (2016).
Payandeh, J., Scheuer, T., Zheng, N. & Catterall, W. A. The crystal structure of a voltage-gated sodium channel. https://www.nature.com/articles/nature10238
Shen, H. et al. Structure of a eukaryotic voltage-gated sodium channel at near-atomic resolution. https://science.sciencemag.org/content/355/6328/eaal4326
Nakamura, K. et al. Clinical spectrum of SCN2A mutations expanding to Ohtahara syndrome. Neurology 81, 992-998, doi:10.1212/WNL.0b013e3182a43e57 (2013).
Vissers, L. E., Gilissen, C. & Veltman, J. A. Genetic studies in intellectual disability and related disorders. Nat Rev Genet 17, 9-18, doi:10.1038/nrg3999 (2016).
Neale, B. M. et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242-245, doi:10.1038/nature11011 (2012).
Sanders, S. J. et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485, 237-241, doi:10.1038/nature10945 (2012).
De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209-215, doi:10.1038/nature13772 (2014).
Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223-228, doi:10.1038/nature14135 (2015).
Deciphering Developmental Disorders Study. Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433-438, doi:10.1038/nature21062 (2017).
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216-221, doi:10.1038/nature13908 (2014).
Zhu, X., Need, A. C., Petrovski, S. & Goldstein, D. B. One gene, many neuropsychiatric disorders: lessons from Mendelian diseases. Nat Neurosci 17, 773-781, doi:10.1038/nn.3713 (2014).

For every lab, everywhere

Illumina financial solutions

NGS Workflow Finder

Illumina Connected Multiomics

Illumina Proactive Instrument Performance Service

Do more, faster than ever

Advancing genomic research with AI

Advancing genomic research with AI

Advancing genomic research with AI

Advancing genomic research with AI

Advancing genomic research with AI

Advancing genomic research with AI

Advancing genomic research with AI

Illumina and SomaLogic unite

Illumina and SomaLogic unite

Illumina and SomaLogic unite

Illumina and SomaLogic unite

Illumina and SomaLogic unite

Illumina and SomaLogic unite

Illumina and SomaLogic unite

Illumina and SomaLogic unite

MiSeq i100 Series

MiSeq i100 Series

MiSeq i100 Series

MiSeq i100 Series

MiSeq i100 Series

MiSeq i100 Series

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Genome and methylome in a single assay

Illumina workflow solutions

Predicting the clinical impact of human mutation with deep neural networks

Introduction

Common variants in other primates are largely benign in human

A deep learning network for variant pathogenicity classification

Example of Pathogenicity Prediction

Acknowledgements

External links

References