University of Dundee Mapping genetic variations to three-dimensional protein structures to enhance variant interpretation

semanticscholar(2017)

引用 0|浏览6
暂无评分
摘要
The translation of personal genomics to precision medicine depends on the accurate interpretation of the multitude of genetic variants observed for each individual. However, even when genetic variants are predicted to modify a protein, their functional implications may be unclear. Many diseases are caused by genetic variants affecting important protein features, such as enzyme active sites or interaction interfaces. The scientific community has catalogued millions of genetic variants in genomic databases and thousands of protein structures in the Protein Data Bank. Mapping mutations onto three-dimensional (3D) structures enables atomic-level analyses of protein positions that may be important for the stability or formation of interactions; these may explain the effect of mutations and in some cases even open a path for targeted drug development. To accelerate progress in the integration of these data types, we held a two-day Gene Variation to 3D (GVto3D) workshop to report on the latest advances and to discuss unmet needs. The overarching goal of the workshop was to address the question: what can be done together as a community to advance the integration of genetic variants and 3D protein structures that could not be done by a single investigator or laboratory? Here we describe the workshop outcomes, review the state of the field, and propose the development of a framework with which to promote progress in this arena. The framework will include a set of standard formats, common ontologies, a common application programming interface to enable interoperation of the resources, and a Tool Registry to make it easy to find and apply the tools to specific analysis problems. Interoperability will enable integration of diverse data sources and tools and collaborative development of variant effect prediction methods. Background Recent progress in DNA-sequencing technologies has ushered in an era of rapid and cost-effective genome sequencing, enabling clinical applications [1] and the potential for personalized systems medicine [2] through the understanding of an individual’s genetic risks and by integration with longitudinal phenotype measurements * Correspondence: Gustavo@SystemsBiology.org Institute for Systems Biology, Seattle, WA 98109, USA Full list of author information is available at the end of the article © The Author(s). 2017 Open Access This artic International License (http://creativecommons reproduction in any medium, provided you g the Creative Commons license, and indicate if (http://creativecommons.org/publicdomain/ze [3].The detailed knowledge of an individual’s genotype poses a significant interpretation challenge: while genetic variants disrupting transcript structure and proteincoding sequences (for example, nonsense mutations) have long been considered “low hanging fruit” relative to variants in non-coding sequences, the field still struggles with interpreting missense mutations, which are more common, and more frequently associated with disease [4]. This has led to an increasing number of variants of uncertain significance (VUS). To address the resulting le is distributed under the terms of the Creative Commons Attribution 4.0 .org/licenses/by/4.0/), which permits unrestricted use, distribution, and ive appropriate credit to the original author(s) and the source, provide a link to changes were made. The Creative Commons Public Domain Dedication waiver ro/1.0/) applies to the data made available in this article, unless otherwise stated. Glusman et al. Genome Medicine (2017) 9:113 Page 2 of 10 annotation and reporting challenges [5, 6], the American College for Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) have released variant interpretation guidelines based on pathogenicity [7]. The interpretation of variants relies on a combination of multiple lines of evidence, including the frequency of the variant in the population (common variants are less likely to be pathogenic), the mode of segregation in pedigrees (for example, de novo mutations not observed in parents are more likely to be pathogenic than those that are inherited), the mode of presentation in affected individuals (for example, single dominant variant, single variant in homozygous state, two variants in compound heterozygous state), the predicted effect on RNA and protein sequence and structure, and prior knowledge accumulated in curated databases. Many computational tools have been developed to support these assessments (Additional file 1: Table S1). However, multiple challenges remain in the rapidly evolving field of clinical variant interpretation, including differences in allele frequency among different populations, a growing but still incomplete understanding of how variants affect gene regulation, the sequence and structure of RNA and protein products, and the partial, inconsistently presented and sometimes conflicting knowledge in databases. To assess the potential pathogenicity of genetic variants, singly or in combinations, it is useful to assess their frequency in control or general populations, as already mentioned. Public databases are burgeoning with information about genetic variants in humans and in many model organisms. Resources such as dbSNP [8], dbVar [9], COSMIC [10], cBioPortal [11], UniProt [12], Kaviar [13], Clinvar [14], HGMD [15], ExAC, and gnomAD [16] provide data on hundreds of millions of singlenucleotide variants (SNVs) and other types of genetic variations. Each database has a different focus, different sources of data, processing methods, level of coverage, and degree of metadata associated with each variation; some focus only on human variation, while others cover many species. Similarly, each database has differing mechanisms for data access and differing levels of crossreferencing. The biomedical research community is fortunate to have access to such a wealth of information, but its sheer size and disparate nature are also daunting. In addition to public databases, hundreds of DNAand RNAsequencing experiments are revealing manifold genetic variants and mutations each year, and an increasing number of these can be linked to protein structure. For example, protein structure analysis of a novel variant in the ubiquitin-protein ligase TRIM11, observed in individuals affected with inflammatory bowel disease, helped determine that the variant is more likely to affect protein–protein interactions rather than protein folding and stability [17]. Functionally important somatic variants in cancer may form statistically significant spatial clusters in three-dimensional protein structure, which are not detectable in one-dimensional sequence, such as kidneycancer-specific variants in the tumor suppressor gene VHL, which are proximal to the binding site of VHL for its ubiquitination target HIF1A [18]. Simultaneously, there has been great progress in characterizing the 3D structures of proteins [19, 20], both experimentally and computationally. Essentially, all publicly available experimentally derived structures are deposited in the Protein Data Bank (PDB) [21]. When experimentally determined structures are not available for proteins, structural models may be used instead. Protein Model Portal [22] aggregates precomputed models from multiple resources, whereas most methods generate models interactively on request, for example, I-TASSER [23], ModWeb [24], Phyre2 [25], HHpred [26], or SWISSMODEL [27]. Currently available homology models with 40–50% sequence identity to experimental structures already cover approximately 40% of the residues in the human proteome [28], although this does not always include the full-length protein in the correct quaternary structure, but often only specific domains. Beyond simply having 3D models of proteins, it is crucial to annotate the functional substructures in these models with such information as the locations of ligand-binding and active sites, functional domains, regions that are externally accessible versus in the protected interior, protein–protein interaction interfaces, and other structural features that might be related to function [29]. However, the connections between genetic variations and protein structure are not always easy to find. A few computational tools have begun to emerge (cBioPortal [11], COSMIC-3D [30], CRAVAT [31], Jalview [32], MuPIT [33], MutDB [34], STRUM [35], Cancer3D [36]) that enable users to take individual genetic variations, or a list of them, and visualize these in the context of protein structures. For example, CRAVAT [31] allows a user to upload a variant call format (VCF) file [37] (a file format used for representing DNA sequence variations) containing many genetic variants and assess which of those variants map to proteins, and then to explore individual variants in a 3D visualization of each protein when available. STRUM [35] allows users to visualize the structural model of a protein while, in addition, providing the profiles of the folding free-energy changes induced by the single-nucleotide polymorphisms (SNPs) or mutations. The starting point of STRUM is the wildtype sequence with SNPs or mutations, whereas ITASSER is used to generate 3D protein models from which the impact of genetic mutations on protein stability can be more accurately calculated compared with the sequence-based approaches. Other tools, such as Jalview Glusman et al. Genome Medicine (2017) 9:113 Page 3 of 10 [32], provide a workbench for exploring variants in context with multiple sequence alignments, molecular structures, and annotations. COSMIC-3D and cBioPortal [11] map and visualize variants in their databases on 3D protein structures. The VIPUR pipeline [38] goes one step further and allows automatic interpretation of the effect of the mutation on the protein structure. The input to VIPUR is the wild-type sequence and the mutation of interest, and, based on the availability of a known structure or homology model, the tool maps the mut
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要