The Ninth International Biocuration Conference will be held in the hometown of the Swiss-Prot Database, Geneva, Switzerland, from April 10-14, 2016.
The Ninth International Biocuration Conference will be held in the hometown of the Swiss-Prot Database, Geneva, Switzerland, from April 10-14, 2016.
# | Title | Authors | |
---|---|---|---|
1 |
Each year, millions of people around world suffer from the consequence of the misdiagnosis and
ineffective treatment of various disease, especially those intractable diseases and rare diseases.
Integration of various data related to human diseases help us not only for identifying drug targets,
connecting genetic variations of phenotypes and understanding molecular pathways relevant to novel
treatment, but also for coupling clinical care and biomedical researches. To this end, we built the Rare
disease Annotation & Medicine (RAM) standards-based database which can provide reference to map and
extract disease-specified information from multitude of biomedical resources such as free text articles
in MEDLINE and Electronic Medical Records (EMRs). RAM integrates disease-specified concepts from ICD-9,
ICD-10, SNOMED-CT and MeSH (http://www.nlm.nih.gov/mesh/MBrowser.html) extracted from the Unified
Medical Language System (UMLS) based on the UMLS Concept Unique Identifiers for each Disease Term. We
also integrated phenotypes from OMIM for each disease term, which link underlying mechanisms and
clinical observation. Moreover, we used disease-manifestation (D-M) pairs from existing biomedical
ontologies as prior knowledge to automatically recognize D-M–specific syntactic patterns from full
text articles in MEDLINE. Considering that most of the record-based disease information in public
databases are textual format, we extracted disease terms and their related biomedical descriptive
phrases from Online Mendelian Inheritance in Man (OMIM), National Organization for Rare Disorders (NORD)
and Orphanet using UMLS Thesaurus. Currently, RAM contains standardized 2,842 rare disease records,
27,329 phenotypes and 75,335 symptoms. Each record-based disease term in RAM now has 8 annotation fields
containing definition, synonyms, symptom, phenotype, causes, diagnosis, treatment and cross-linkage, 5
of them have been presented in both textual and standards-based structural format. We continue updating
RAM by mapping and extending standardized concepts for each annotation fieldÕs content through ICD,
SNOMED-CT, Human Phenotype Ontology (HPO) and UMLS CUI. which make RAM as a standardized information
system for biomedical information exchange and data integration between different standard-compliant
databases.
|
Jinmeng Jia and Tieliu Shi | |
2 |
Mechanistic comprehension of the impact of treatment on disease initiation and progression calls for
techniques that convert ever-increasing literature-based scientific knowledge into a format that is
suitable for modelling, reasoning, and data interpretation. The BEL Information Extraction workFlow
(BELIEF) facilitates the transformation of unstructured information described in the literature into
structured knowledge. BELIEF automatically extracts causal molecular relationships from text and encodes
them in BEL statements. BEL (Biological Expression Language) is a standard language for representing,
integrating, storing, and exchanging biological knowledge extracted from scientific literature in a
computable form. The assembled BEL statements are used to build biological network models that can link
the upstream events to downstream measureable and quantifiable entities, represented as differential
gene expression.
Using BELIEF, we have built a biological network model that describes xenobiotic metabolism in the liver context. Liver is a critical organ responsible for the elimination of toxic compounds by converting them into suitable forms during the xenobiotic metabolism process. During this process, the liver is susceptible to injury although the molecular aspects of this injury are not completely understood. Through the BELIEF platform, we were able to create a network model that captures important players in the context of liver xenobiotic metabolism. Nuclear receptors and transcription factors, such as the aryl hydrocarbon receptor, play a pivotal role in this comprehensive network model. This network model also includes the signalling pathways that lead to the activation of enzymes responsible for phase I (e.g., CYP1 family) that convert lipophilic chemical compounds into their hydrophilic forms. It also activates phase II conjugation enzymes (mainly the transferases), responsible for processes such as glucuronidation, sulfation, methylation, and acetylation as well as phase III membrane transporters, responsible for the elimination of xenobiotic metabolites. This two-layered causal biological network model may represent a step forward in toxicological risk assessment and improve our understanding of how different toxicants are metabolized by the liver. |
Justyna Szostak, Iro Oikonomidi, Giuseppe Lo Sasso, Marja Talikka, Sam Ansari, Juliane Fluck, Sumit Madan, Florian Martin, Manuel-C. Peitsch and Julia Hoeng | |
3 |
The Gene Ontology (GO) is a valuable resource for the functional annotation of gene products. In line
with usersÕ demands to accurately capture, interpret, analyze and evaluate the available functional
information on genes and their attributes, its role in the scientific community is ever more important.
GO is constantly expanding and improving to best represent our knowledge in various areas of biology,
reflecting the development of new methods and the exponential growth of biosample resources and their
experimental characterization. The current version of GO contains just under 45, 000 terms, to which
over 250 million annotations have been made. It is therefore critical to be able to easily and quickly
mine and visualise the available information.
The Gene Ontology Annotation (GOA) team would like to announce the release of a new version of our popular web-based tool for browsing and interpreting the GO and all associated annotations, QuickGO. Benefiting from the constant communication with our international community (lab scientists, expert reviewers, bioinformaticians, clinicians and curators), the new QuickGO offers many more features in addition to improved speed, stability and data visualisation. Addressing the recent expansion of GO scope and aims, QuickGO now provides functional annotations not only to proteins, but also to protein complexes and RNA. The new user interface allows quick and easy searching of gene names, accessions, GO terms and annotations as well as evidence codes - recently connected to the Evidence Code Ontology. The interface, as proven by extensive user testing, is friendly and highly intuitive for users with different levels of GO experience. Additionally, it provides integrated contextual help covering all aspects of functionality. Following the needs and requests of the scientific community contributing to and employing GO data, the filtering options in QuickGO have been significantly redesigned. The basic and most popular filters can now be applied directly while visualizing annotations of interest, while experienced users have access to advanced filters to retrieve, interpret and further analyze their datasets. The front page of the QuickGO browser has direct links to major functions, including our GO slimming wizard. This tool allows mapping of the more GO granular terms to a smaller number of higher level, broader parent terms, and thus enables a quick overview of a genome or the large experimental datasets, of utmost importance for GO enrichment analysis. The wizardÕs design has been completely redeveloped to focus on simplicity and intuitive browsing, while preserving its accuracy and functionality. The release of this new version of QuickGO, powered by recent innovations in technology, reflects the development of GO and biological sciences. It addresses the various needs of our users from contributing to GO annotation to applying it to discover functional information about genes and gene products, interpreting the results from the particular biological disciplines or looking for new ways to direct research. |
Aleksandra Shypitsyna, Melanie Courtot, Elena Speretta, Alexander Holmes, Tony Sawford, Tony Wardell, Sangya Pundir, Xavier Watkins, Maria Martin and Claire O'Donovan | |
4 |
Genetic screens, transcriptome profiling, and other kinds of genome-wide surveys report lists of genes
as their main result. Analysis of this data is greatly facilitated by comparison to sets of genes
categorized by properties they have in common. Gene sets define 'what the genes do' by describing
biological processes and results of genomic studies in a very simple and intuitive manner. For example,
all genes known to function in a signaling pathway constitute a gene set. Gene Set Enrichment Analysis
(GSEA) methods gain additional power by considering many genes as a group. The Molecular Signatures
Database (MSigDB) was originally developed to supply gene sets for GSEA. A decade later, MSigDB is one
of the most widely used and comprehensive repositories of gene sets. The MSigDB now contains over 13
thousand gene sets, which are organized into eight collections according to their derivation. The
collections include genes grouped by chromosomal locations (C1), canonical pathways and lists curated
from papers (C2), genes sharing cis-regulatory motifs (C3), clusters of genes co-expressed in microarray
compendia (C4), genes grouped according to gene ontology associations (C5), as well as expression
signatures of oncogenic pathway activation (C6) and immunology (C7). A special collection of 'hallmark'
gene sets (H) consists of signatures derived from multiple 'founder' sets. The hallmarks summarize the
original founder sets into signatures that display coherent expression in specific, clearly defined
biological states.
|
Arthur Liberzon, Helga Thorvaldsd—ttir, Pablo Tamayo and Jill P. Mesirov | |
5 |
With the rapid development of computer science and information technology, a large amount of medical
data are generated from medical research, medical experiments, medical services, health care, health
management, etc. These data are essential for clinical diagnosis, research and hospital management. How
to manage, store, organize, and use multiple types of data from various sources effectively has become
an important challenge. Currently, multi-source data management and sharing are based on datasets, not
data records, it is difficult to achieve a shared purpose. Therefore, according to medical data
lifecycle, this study has explored to build a medical scientific data platform for efficient data
sharing, management and utilization. We have designed and developed a platform, called Medical Science
Data Management and Sharing Platform (SDMSPM). The architecture of SDMSPM contained four layers: a
support layer, a data storage layer, a functional layer and a application layer, and supported by the
data standards system and the security system. This platform uses the portal and B/S framework, focusing
on unstructured data storage, data submission through web interface and batch mode, metadata
specifications, data associationn, data classification and data sharing. It is worth mentioning that the
system uses MongoDB to store unstructured data which uses BSON-format file and dynamic mode, so that a
various types of data can be integrated easier and faster. In addition, we have realized the
establishment of dataset metadata specification, data recording metadata specification and data sets
classification criteria in datasets management and records management. We also have carried out semantic
computing based on existing data fields or Medical Subject Headings (MESH) thesaurus to discovery more
relationship between different datasets, enriching relationship storage and providing the associated
services. Experimental data are from Population and Health Science Data Sharing Platform of China. We
also have produced a special data analysis for the field of cancer, combining with the Google trends and
PubMed statistical data. Html5, D3.js,Echart and other visualization techniques are applied for data
dynamic display. User permissions and share permissions have also been set up to ensure the security of
sharing platform.
|
Zhang Ze, He Xiaolin, Sun Xiaokang, Qian Qing and Wu Sizhu | |
6 |
The Type 2 Diabetes Knowledge Portal (http://www.type2diabetesgenetics.org) is an open-access resource
for human genetic information on type 2 diabetes (T2D). It is a central repository for data from large
genomic studies that identify DNA variants whose presence is linked to altered risk of having T2D or
related traits such as high body mass index. Pinpointing these DNA variants, and the genes they affect,
will spur novel insights about how T2D develops and suggest new potential targets for drugs or other
therapies to treat T2D.
The T2D Knowledge Portal aggregates data in a framework that facilitates analysis across disparate data sets while properly crediting researchers and protecting patient privacy. It provides a user-friendly interface that enables scientists to search for information on particular genes or variants and to build queries for the sets of variants associated with particular traits. Data will be added to the knowledgebase on an ongoing basis via the collection of existing data sets and the incorporation of new data sets as they become available, continually increasing the power of analyses that can be performed. Currently, data are stored in a Data Collection Center at the Broad Institute. In the near future, federated nodes at other sites will also receive data and will connect with the T2D Knowledge Portal to allow analyses across all data at all sites. Such federation will enable each node to protect individual patient data in accordance with local regulations while facilitating global access to analyses of the data. Financial support for this project is provided by the Accelerating Medicines Partnership in Type 2 Diabetes—a collaboration of the National Institutes of Health, five major pharmaceutical companies, and three large non-profits—and by the Carlos Slim Foundation. |
Maria Costanzo and Accelerating Medicines Partnership | |
7 |
eMouseAtlas develops tools and resources that enable high-end visualisation of embryo data in the
context of a web browser. Mouse embryo anatomy is delineated to a very high standard and is assigned
EMAPA anatomy ontology terms, enabling queries across EMAGE and GXD gene expression databases. A section
viewer allows visualisation of anatomy on arbitrary section through mouse embryos – much like a
virtual microtome – whilst a 3D anatomy pop-up window allows users to visualise the delineated
anatomical components in an interactive 3D-context as either a wireframe or surface-rendered model. The
new viewer uses IIP3D and WebGL technology to allow interactive exploration of 3D anatomy in a
HTML5-compatible and WebGL-enabled web-browser and without the need for data download.
Both the WebGL navigation tool for the section viewer and the 3D anatomy pop-up window require surface generation of embryo models, and this requires curation effort and sub-sampling of 3D models. We report on our efforts to streamline this process to enable high-throughput visualisation of 3D embryo data. Furthermore, we report on our use of this 3D viewer in prototype web-based visualisation of 3D gene expression and phenotype data. URL: http://www.emouseatlas.org/eAtlasViewer_ema/application/ema/anatomy/EMA27.php |
Chris Armit, Bill Hill, Nick Burton, Lorna Richardson, Liz Graham, Yiya Yang and Richard Baldock | |
8 |
As the rate of data production continues to increase in the field of cancer research it is becoming
increasingly important not only to capture that data but also to develop tools for its efficient
searching and visualization. The Mouse Tumor Biology Database (MTB; http://tumor.informatics.jax.org)
contains data on hundreds of tumor types, some having thousands of records. We have implemented a
faceted, iterative search tool to facilitate efficient searching of large volumes of heterogeneous data
to allow researchers to rapidly identify mouse models of human cancer most relevant to their
research.
The faceted search page presents a summary of all data in MTB. As search terms are selected by the user, the corresponding search results are dynamically displayed. Subsequent search terms are limited to those relevant given the preceding search term selection and search results are updated automatically. Multiple terms may be combined to iteratively focus in on a small set of data. Results may be sorted by clicking the column header of the desired sort field. Color-coded tumor frequency summary information is provided to further aid in data review. Hyperlinks direct the user to more detailed data regarding each model. Access to MTB is provided free of charge and presents researchers with tools to facilitate the identification of experimental models for cancer research. The data in MTB are primarily related to genetically engineered mouse models and Patient Derived Xenograft (PDX) models of human cancer. Data related to cancer cell lines are not comprehensively represented in MTB. Our expert biocurators use controlled vocabularies and internationally accepted gene nomenclature standards to aid the integration of data from a variety of sources. The primary source for data in MTB is the published literature supplemented by submissions from the cancer genetics research community. MTB is supported by NCI grant CA089713. |
Debra M. Krupke, Dale A. Begley, Steven B. Neuhauser, Joel E. Richardson, John P. Sundberg, Carol J. Bult and Janan T. Eppig | |
9 |
Open biological data are distributed over many resources making them challenging to integrate, to update
and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose
and also provides tight integration with Wikipedia.
In order to improve the state of biological data, facilitate data management and dissemination, we imported all human and mouse genes, and all human and mouse proteins into Wikidata. In total, 59,721 human genes and 73,355 mouse genes have been imported from NCBI and 27,306 human proteins and 16,728 mouse proteins have been imported from the Swissprot subset of UniProt. As Wikidata is open and can be edited by anybody, our corpus of imported data serves as the starting point for integration of further data by scientists, the Wikidata community and citizen scientists alike. The first use case for these data is to populate Wikipedia Gene Wiki infoboxes directly from Wikidata with the data integrated above. This enables immediate updates of the Gene Wiki infoboxes as soon as the data in Wikidata are modified. Although Gene Wiki pages are currently only on the English language version of Wikipedia, the multilingual nature of Wikidata allows for usage of the data we imported in all 280 different language Wikipedias. Apart from the Gene Wiki infobox use case, a SPARQL endpoint and exporting functionality to several standard formats (e.g. JSON, XML) enable use of the data by scientists by both direct query and via mediating domain-specific applications. In summary, we created a fully open and extensible data resource for human and mouse molecular biology and biochemistry data. This resource enriches all the Wikipedias with structured information and serves as a new linking hub for the biological semantic web. |
Sebastian Burgstaller-Muehlbacher, Andra Waagmeester, Elvira Mitraka, Julia Turner, Tim Putman, Justin Leong, Chinmay Naik, Paul Pavlidis, Lynn Schriml, Benjamin Good and Andrew Su | |
10 |
Delineating protein and genetic interaction networks is critical to understanding complex biological
pathways underlying both normal and diseased states. To further this understanding, the Biological
General Repository for Interaction Datasets (BioGRID) (www.thebiogrid.org) curates genetic and protein
interactions for human and major model organisms, including yeast, worm, fly, and mouse. As of February
2016, BioGRID contains over 1,052,000 interactions manually curated from high throughput data sets and
low throughput studies, as documented in more than 45,800 publications. This includes over 350,000 human
interactions focused on normal and disease-related processes such as the Ubiquitin Proteasome System
(UPS), which is implicated in metabolic, cardiovascular and neurodegenerative diseases, as well as
cancer. Recently, BioGRID has begun to incorporate chemical-protein interaction data, and thereby allow
the association of drugs, toxins and other small molecules with genetic and protein interaction
networks. To date, BioGRID has incorporated over 25,000 manually curated small molecule-target
interactions from DrugBank (www.drugbank.ca), which encompass more than 2,100 unique human proteins and
over 4,300 bioactive molecules. Interestingly, 815 of these human genes are known to be associated with
approximately 810 diseases. Visualization of these drug-target associations is facilitated by a new
interactive Network Viewer that is embedded in BioGRID search page results. The combination of expertly
curated genetic, protein and chemical interaction data into a single resource should facilitate
network-based approaches to drug discovery. The entire BioGRID interaction dataset is freely available
and may be downloaded in easily computable standardized formats.
|
Rose Oughtred, Bobby-Joe Breitkreutz, Lorrie Boucher, Christie S. Chang, Jennifer M. Rust, Andrew Chatr-Aryamontri, Nadine Kolas, Lara O'Donnell, Chandra L. Theesfeld, Chris Stark, Kara Dolinski and Mike Tyers | |
11 |
Usually protein databases have been built using resource-consuming manual curation. The influx of
high-throughput sequencing data requires a higher level of automated curation solutions for keeping with
the ever growing amount. PRINTS, one of the oldest protein databases, is the target of our research,
which is aimed at developing of an automated model of inter-database integration and building an
enriched fingerprint knowledge base. The main focus of the work is fingerprint annotation, not sequence
content.
More precise and machine-readable cross references to other databases are made for automatic processing and reasoning; biomedical ontologies, and Gene Ontology in particular, are used in all relevant places. The extracted knowledge is represented using Semantic Web technologies (RDF and OWL) allowing to make complex SPARQL queries that span multiple online databases. The main research objective is to construct an ontology of fingerprints, in which they are described using GO terms, as well as terms from other biomedical ontologies. Current state of this work will be presented and problems will be discussed. |
Ognyan Kulev | |
12 |
The EMBL-EBI Complex Portal (www.ebi.ac.uk/intact/complex), a central service for information on
macromolecular complexes, provides manually curated information on stable, protein complexes. We provide
unique identifiers, names and synonyms, list of complex members with their unique identifiers (UniProt,
ChEBI, RNAcentral), function, binding and stoichiometry annotations, descriptions of their topology,
assembly structure, ligands and associated diseases as well as cross-references to the same complex in
other databases (e.g. ChEMBL, GO, PDB, Reactome). Complexes are curated into the IntAct database using
IntAct tools, rules and quality control and are available for search and download via a dedicated
website. They are available for use as annotation objects in a number of resources, including the Gene
Ontology where they can be accessed via Protein2GO. We have also developed a novel JavaScript
visualisation tool that creates a schematic view of the topology and stoichiometry of each complex. The
viewer generates the graphic on the fly, based on the latest database version. Our focus for the next
two years lies in increasing content and coverage by importing 'unreviewed' complexes from GO, PDB and
Reactome, in addition to increased manual curation, and further improving our graphical options by
including 3D structural viewers as well as pathway and expression overlays. A pipeline into InterMine,
including incorporation of the complex viewer has been developed, enabling export of organism-specific
complexes to model organism resources.
This is a collaborative project, which has already been contributed to by groups such as UniProtKB, Saccharomyces Genome Database, the UCL Gene Annotation Team and MINT database. We welcome groups who are willing to contribute their expertise and will make editorial access and training available to you. Individual complexes will also be added to the dataset, on request. Contact us on intact-help@ebi.ac.uk for further information. |
Birgit Meldal, Colin Combe, Josh Heimbach, Henning Hermjakob and Sandra Orchard | |
13 |
The US National Cancer Institute (NCI) cancer Nanotechnology Laboratory (caNanoLab) data portal is an
online nanomaterial database that allows users to submit and retrieve information on well-characterized
nanomaterials used in biomedicine, including composition, in vitro and in vivo experimental
characterizations, experimental protocols, and related publications. Currently, more than 1,100 curated
nanomaterial records are publicly accessible and can be queried directly from the caNanoLab homepage.
The primary customers of the data are the cancer nanotechnology research community including clinicians
and the NCI Alliance for Nanotechnology in Cancer. However, the content of caNanoLab is relevant to the
broader biomedical research field with interests in the use of nanotechnology for the development of
diagnostics and therapeutics. The database structure is based on characterization assays required for
clinical regulatory review performed by the Nanotechnology Characterization Laboratory (NCL) and the
Centers of Cancer Nanotechnology Excellence (CCNEs) in the NCI Alliance for Nanotechnology in Cancer.
The caNanoLab data model was informed by standards such as the NanoParticle Ontology (NPO) and ISA-TAB.
Class names and attributes are maintained in the NCI cancer Data Standards Repository (caDSR) and
definitions for caNanoLab concepts are maintained in the NCI Thesaurus. The curation of nanotechnology
information is accomplished by selecting relevant publications, manually extracting reported text and
data, and submitting extracted information into caNanoLab. Curated caNanoLab data are converted to
ISA-TAB-Nano files to enable data exchange between individual users or other databases, which is an
interest of the caNanoLab team, along with participation in activities focused on the development of
standards enabling data exchange and supporting interoperability between databases. The caNanoLab team
is also engaged in many activities to better serve the needs of the nanotechnology research community.
Activities range from engaging publication vendors to facilitate linkages between publications and
nanotechnology databases, to working with other groups to develop data standards and guidelines for data
submission and sharing including community-based programs such as the NCI Nanotechnology Working Group
(Nano WG) and the National Nanotechnology Initiative (NNI) a federal initiative to develop data
standards and deposition guidelines.
|
Mervi Heiskanen, Stephanie Morris, Sharon Gaheen, Michal Lijowski and Juli Klemm | |
14 |
The Protein Data Bank (PDB) is the single global repository for three-dimensional structures of
biological macromolecules and their complexes. Over the past decade, the size and complexity of
macromolecules and their complexes with small molecules deposited to the PDB have increased
significantly. The PDB archive now holds more than 115,000 experimentally determined structures of
biological macromolecules, which are all publicly accessible without restriction. These structures,
including ribosomes and viruses, provide essential information for understanding biochemical processes
and structure-based drug discovery. It is crucial to transform acquired data into readily usable
information and knowledge. The PDB archive represents one of the best-curated and most heavily used
digital data resources in Biology.
Data for each archival entry must be organized and categorized in a meaningful way to support effective data sharing. The PDBx/mmCIF dictionary uses controlled vocabularies to define deposited data items and metadata. Biocuration software tools have been built to use this dictionary to maintain data consistency across the PDB archive. To support scientific advancement and ensure the best data quality and completeness, a working group of community experts in structural biology software works with the wwPDB to enable direct use of PDBx/mmCIF format files across the structure determination pipeline. An overview of creating archive requirements and designing a data model for biological experimental data contributed by multiple data providers will be presented. wwPDB members are RCSB PDB (supported by NSF, NIH, and DOE), PDBe (EMBL-EBI, Wellcome Trust, BBSRC, NIGMS, and EU), PDBj (NBDC-JST) and BMRB (NLM). RCSB PDB, Rutgers, The State University of New Jersey, Piscataway, New Jersey, United States; PDBe, EMBL-European Bioinformatics Institute, Hinxton, United Kingdom; PDBj, Institute for Protein Research, Osaka University, Osaka, Japan; BMRB, BioMagResBank, University of Wisconsin-Madison, Madison, Wisconsin, United States |
Jasmine Young, John Westbrook, Stephen Burley, Rcsb Pdb Team and Wwpdb Team | |
15 |
The Drosophila anatomy ontology (DAO) defines the broad anatomy of the fruitfly Drosophila melanogaster,
a genetic model organism. It contains over 9000 classes, with close to half of these corresponding to
neuroanatomical terms. These terms are used by curators when capturing data from papers, and by users
when searching for information on FlyBase or Virtual Fly Brain.
When the DAO was first developed over 20 years ago, the majority of classes did not include textual information, such as a definition, synonyms or references. These details are essential for curators to be accurate and for users to understand the data. The initial DAO also lacked formalisation, which is critical to ensure correct classification with minimal intervention as the ontology grows. In the last few years, we have made a significant effort to add the missing textual information and to increase the number of inferred classifications. Recently, this work has focused on reviewing the DAO in a systematic manner, making use of the classification into 11 different organ systems, such as muscle, integumentary, adipose, etc. Classes within each organ system have been reviewed together, making it much easier to correct inconsistencies or duplications, and to spot patterns that can be used to write formal definitions. We have so far reviewed classes that belong to 9 of the 11 organ systems in the DAO. This work has increased the number of terms with a definition from 73% to 88%. Future work will focus on completing this effort, by revising the terms in the remaining 2 organ systems. |
Marta Costa, David Osumi-Sutherland, Steven Marygold and Nick Brown | |
16 |
BioSharing (http://www.biosharing.org) is a curated, web-based, searchable portal of three linked
registries of content standards, databases, and data policies in the life sciences, broadly encompassing
the biological, natural and biomedical sciences. Launched in 2011 and built by the same core team as the
successful MIBBI portal, BioSharing harnesses community curation to collate and cross-reference
resources across the life sciences from around the world. Every record is designed to be interlinked,
providing a detailed description not only on the resource itself, but also on its relations with other
life science infrastructures. Serving a variety of stakeholders, BioSharing cultivates a growing
community, to which it offers diverse benefits. It is a resource for funding bodies and journal
publishers to navigate the metadata landscape of the biological sciences; an educational resource for
librarians and information advisors; a publicising platform for standard and database
developers/curators; and a research tool for bench and computer scientists to plan their work.
With over 1,300 records, BioSharing content can be searched using simple or advanced searches, filtered via a filtering matrix, or grouped via the 'Collection' feature. Examples are the NPG Scientific Data and BioMedCentral Collections, collating and linking the recommended standards and repositories from their Data Policy for author. Similarly other publishers, projects and organizations are creating Collections by selecting and filtering standards and databases relevant to their work, such as the BD2K bioCADDIE project. As a community effort, BioSharing offers users the ability to 'claim' records, allowing their update. Each claimant also has a user profile that can be linked to their resources, publications and ORCID ID, thus providing visibility for them as an individual. Here, we introduce BioSharing to the International Society of Biocuration, and encourage members to register on the website and claim the record for their database, metadata standard or policy. |
Peter McQuilton, Alejandra Gonzalez-Beltran, Allyson Lister, Eamonn Maguire, Philippe Rocca-Serra, Milo Thurston and Susanna-Assunta Sansone | |
17 |
Genomic Epidemiology, the use of microbial genomic sequences to perform infectious disease outbreak
investigation and surveillance, is increasingly being deployed by many public health agencies worldwide.
During foodborne outbreaks, contextual information is key for identifying sources of pathogen
contamination and exposure. While sequence data usually adheres to a few standardized formats,
additional data such as surveillance and exposure information are mostly unstructured and without
interoperable standards. Currently, public health workers must rely heavily on computational text and
data mining for time-consuming manual curation and analysis of large datasets. A solution providing a
framework for integrating these diverse data types is the use of ontologies. Ontologies, well-defined
and standardized vocabularies interconnected by logical relationships, support logical reasoning over
the data annotated in their terms. Canada's Integrated Rapid Infectious Disease Analysis (IRIDA) project
is developing open-source, user-friendly tools for incorporating microbial genomic data into
epidemiological analyses to support real-time infectious disease surveillance and investigation. Our
research efforts include the development of a Genomic Epidemiology Application Ontology (GenEpiO), which
is crucial for epidemiological and genomics data integration.
To determine the scope and priorities of GenEpiO development, we interviewed public health stakeholders and domain experts and surveyed reporting forms and databases. User activities, lab management software, information and work flows, exposure tracking and reporting systems were profiled to better characterize users' needs. Community standards were reviewed to determine the utility of different ontologies for fulfilling the identified requirements. Laboratory and epidemiological resources were mined for important fields, terms and descriptors. Our work indicates that no single ontology currently covers all attributes required for a genomic epidemiology program. Furthermore, the very breadth of many ontologies hinders their practical use in real-time by users with little bioinformatics expertise. With this in mind, user profiles and data requirements were harmonized with different ontological standards to create a single resource. An initial OWL file containing metadata fields and terms describing isolate source attribution, clinical data, whole genome sequencing processes, quality metrics, patient demographics/histories/comorbidities and exposures was created adhering to the best practices of the Open Biomedical and Biological Ontology (OBO) Consortium. This application ontology was made more robust through testing in different pathogen surveillance initiatives. Key gaps in domain vocabulary requiring expansion were also identified, e.g. antimicrobial resistance, whole genome sequencing result reporting, food description and epidemiology. IRIDA's GenEpiO is being developed for integrating important laboratory, clinical and epidemiological data fields. Implementation of GenEpiO will facilitate data standardization and integration, validation, interoperability. Improved querying will facilitate automation of many analyses. Since harmonization of the genomic epidemiology ontology can only be achieved by consensus and wide adoption, IRIDA is currently forming an international consortium to build partnerships and solicit domain expertise. The methods developed in this work are also being applied to other datasets such as those associated with the Canadian Healthy Infant Longitudinal Development (CHILD) study. GenEpiO is a highly anticipated development that will enhance infectious disease investigations, but is also applicable to broader comparative genomic data mining. |
Emma Griffiths, Damion Dooley, Melanie Courtot, Josh Adam, Franklin Bristow, Joao A. Carrico, Bhavjinder K. Dhillon, Alex Keddy, Thomas Matthews, Aaron Petkau, Julie Shay, Geoff Winsor, Robert Beiko, Lynn M. Schriml, Eduardo Taboada, Gary Van Domselaar, Morag Graham, Fiona Brinkman and William Hsiao | |
18 |
An essential part of any useful 'omics' dataset is accurate information on provenance of the material
under investigation. Contextual information of a sample, a fundamental unit of a material entity
isolated from the surrounding environment and subjected to the investigation, is typically captured as a
set of key-value pairs in a sample record, an information artefact about the sampled material.
Requirements for the sample contextual information are shaped by the nature of the sample as well as the
method of investigation. It is therefore critical that an opinion on the contextual data is formulated
in a community of domain experts.
Here we describe ongoing efforts of the marine community to harmonise sample contextual data reporting across scientific domains, including the genomic, oceanographic and biodiversity data along with phenotypic traits of aquacultures and characteristics of bioactive natural products originating from marine microbial and microalgae strains promising for blue biotechnology. We will focus on the community-agreed contextual data standardisation and ontologies integration that significantly simplifies reporting of the sample contextual information to public data archives and leads to better discoverability of 'omics' datasets associated with the samples. |
Petra ten Hoopen, Guy Cochrane and Embric Consortium | |
19 |
The Evidence and Conclusion Ontology (ECO) is a community standard for describing biological research
evidence in a controlled and structured way. Annotations at the world's most heavily used biological
databases (e.g. UniProt, SwissProt, GO, various model organisms, et cetera) are associated with ECO
terms, which represent different types of evidence and thus document the supporting evidence for those
annotations. Evidence terms described by ECO include experimental and computational methods, author
statements curated from the literature, inferences drawn by curators, combinatorial methods, and even
statements of provenance. Because ECO is an ontology, where terms with standard definitions are
networked to one another using defined relationships, it is possible to conduct selective data queries
leveraging the structure of the ontology and automate quality control mechanisms for large-scale data
management. A growing number of resources are coming to rely on ECO, and we are actively developing ECO
to meet their evidence needs. Here we describe recent developments involving the ECO project and some of
its recent collaborations, most notably: (i) release of a new ECO website that contains user
documentation, a news section with up-to-date relevant information, visualization tools, and other
useful information; (ii) improvements to the ontology structure; (iii) moving ECO development to GitHub;
(iv) addition of numerous experimental evidence types; and (v) addition of new evidence classes
describing computationally derived evidence, for example 'position-specific scoring matrix motif search
evidence'. At present ECO is used in over 30 applications (of which we are aware). Recently, we have
worked with a number of groups to expand representation of evidence in ECO. These groups included
SwissProt (diverse experimental assays), UniProt (detection techniques), IntAct (biological system
reconstruction), Gene Ontology (logical inference & synapse research techniques), CollecTF (motif
prediction), Planteome (genotype-phenotype associations), Ontology of Microbial Phenotypes (microbial
assays), and so on. In addition, we have begun collaborating with the Ontology for Biomedical
Investigations (OBI) on representing evidence and conclusions, which we hope will ultimately serve as a
community model for cross-ontology coordination. As ECO continues to grow as a resource, we are seeking
new users and new use cases. Our goal is to become the de facto community standard for representing
evidence in biological curation and database annotations. This material (the ontology and related
resources) is based upon work supported by the National Science Foundation (of the USA) under Award
Number 1458400.
|
Marcus Chibucos, Suvarana Nadendla, Shoshannah Ball, Dustin Olley, Kimuel Villanova, Dinara Sagitova, Ivan Erill and Michelle Giglio | |
20 |
Genetic interactions reveal the functional roles genes play in different biological contexts and reflect
the buffering capacity of biological systems towards genetic or environmental perturbation. Charting the
genetic structure of biological networks is essential for understanding the basis of human health and
disease. A network-based approach to the study of disease requires consistent description of genetic
interactions in humans and in genetically tractable organisms that serve as instructive models for human
biology. Unified descriptors of genetic interactions are also needed to allow accurate comparisons of
mutant phenotypes across different species. Toward this end, WormBase (www.wormbase.org) and BioGRID
(www.thebiogrid.org) have collaborated on the development of a new Genetic Interaction (GI) Ontology,
the goal of which is to unify the nomenclature and interpretation of genetic interactions within the
research community and across various Model Organism Databases (MODs). This GI Ontology encapsulates
coherent definitions of all known genetic interaction types based on structured descriptors that
delineate specific relationships often shared between different interaction types. In order to ensure
consistent descriptions across multiple species, the GI Ontology has been developed with support from
other major MODs, including SGD, CGD, PomBase, FlyBase, and ZFIN. The GI ontology can be readily
combined with species-specific phenotype and tissue ontologies in order to precisely capture the varied
effects and contexts of genetic interactions. This compatibility will be extended to the comprehensive
cross-species phenotype ontology, UberPheno, as developed by the Monarch Initiative
(www.monarchinitiative.org). The BioGRID database will implement the GI Ontology for the curation of
genetic interactions in human and model organisms, including yeast, worm, fish and fly. Adoption of
standardized GI terms will facilitate the integration of genetic interaction datasets that can now be
produced by large-scale CRISPR/Cas9-based screens in human cells and other organisms. Cross-species
comparisons of genetic interaction networks will provide key insights into complex human diseases caused
by multiple genetic perturbations. The GI Ontology has been integrated as a separate Genetic
Interactions branch of the well-established Proteomics Standards Initiative - Molecular Interaction
(PSI-MI) ontology (www.obofoundry.org/ontology/mi.html).
|
Christian A. Grove, Rose Oughtred, Raymond Lee, Kara Dolinski, Mike Tyers, Anastasia Baryshnikova and Paul W. Sternberg | |
21 |
The advent of genome-wide measurements, such as produced by transcriptomics, has led to the publication
of thousands of studies and of the corresponding data files. It is assumed that the availability of
these data to the wider research community will facilitate re-analysis and meta-analysis, leading to
novel insights that go beyond the primary purpose of these studies.
A major challenge in integrating such public studies, however, is the heterogeneity with which they are described. Not only are basic sample information frequently missing, but the descriptions provided often use different vocabularies. Since over ten years, the curation team at the Swiss company NEBION has developed and continuously improved application ontologies describing important biological dimensions such as tissues, genotypes, diseases, cancers, perturbations, or other factors. The main goal of ontology development was to standardize experimental descriptions for optimal use in data analysis. As a result, ontologies were built with minimal redundancy, slim tree depth, and with vocabularies that most biologists understand. In this talk, we will present the main concepts behind our ontologies. In particular, we will discuss how a new Anatomy ontology was developed in collaboration with the CALIPHO group at the SIB in Geneva and successfully applied into an analysis tool like GENEVESTIGATOR. |
Kirsten Hochholzer, Jana Sponarova and Philip Zimmermann | |
22 |
The complete dictyBase overhaul and introduction of state of the art software infrastructure will allow
curators to begin annotating new biological features and use existing annotations to represent and
connect data in novel ways. Curated protein interactions via the Gene Ontology (GO) will be used to
represent protein-protein interactions. Curators already privately annotate spatial expression with the
Dictyostelium anatomy ontology and we recently started annotating Dictyostelium disease orthologs with
their respective disease ontology (DO) terms. The updated database will also allow representing GO
annotations with 'GO extensions', which add deeper context to those annotations. In the near future
HTML5 technology will revolutionize the way curators add annotations to the database, allowing the
direct editing of gene pages. Furthermore, it will open the door for direct community annotations on the
gene page for interested users.
Basu S, Fey P, Jimenez-Morales D, Dodson RJ, Chisholm RL. dictyBase 2015: Expanding data and annotations in a new software environment. Genesis. 2015 PMID: 26088819 |
Petra Fey, Siddhartha Basu, Robert Dodson and Rex L. Chisholm | |
23 |
Enzymes play an essential role in all life processes including metabolism, cell communication and DNA
replication. In humans, about 27% of enzymes are associated with diseases making them important targets
for drug discovery. In addition, their extensive use in biomedicine and industry highlights the
necessity of repositories for enzyme-related data. The UniProt Knowledgebase (UniProtKB) fulfills this
need by providing the scientific community with accurate, concise and easy to access information with
the aim of facilitating enzyme research.
UniProtKB collects and centralises functional information on proteins, with accurate, consistent and rich annotation. For enzymes, which represent between 20-40% of proteomes, UniProtKB provides, in addition to the core annotation, information about EC classification, catalytic activity, cofactors, enzyme regulation, kinetics and pathways all based on the critical assessment of experimental data. Computer-based analysis and, if available, structural data are used to enrich the annotation of the sequence with the identification of active sites and binding sites. Mutagenesis and variants are also annotated. Collectively, they provide valuable information to understand the aetiology of diseases and for the development of medical drugs. By providing accurate annotation of enzymes across a wide range of species, UniProtKB is a valuable resource to make enzyme research easier for scientists and health researchers working in both academia and industry. |
Rossana Zaru, Elisabeth Coudert, Kristian Axelsen, Anne Morgat, and Uniprot Consortium | |
24 |
The Universal Protein Resource (UniProt) provides the scientific community with a comprehensive and
richly curated resource of protein sequences and functional information. The centerpiece of UniProt is
the knowledgebase (UniProtKB) which is composed of the expert curated UniProtKB/Swiss-Prot section and
its automatically annotated complement, UniProtKB/TrEMBL.
Expert curation combines the manually verified sequence with experimental evidence derived from biochemical and genetic analyses, 3D-structures, mutagenesis experiments, information about protein interactions and post-translational modifications. Besides harvesting, interpreting, standardizing and integrating data from literature and numerous resources, curators are also checking, and often correcting, gene model predictions. For plants, this task is focused on Arabidopsis thaliana and Oryza sativa subsp. japonica. By the end of January 2016, 14'358 manually reviewed entries from Arabidopsis thaliana are present in UniProtKB/Swiss-Prot, including most of the proteins associated with a least one publication containing some functional characterization. Manual expert curation of UniProtKB/Swiss-Prot is complemented by expert-driven automatic annotation of UniProtKB/TrEMBL entries to build a comprehensive, high quality set of proteins covering the complete proteome of Arabidopsis thaliana. This complete set, containing currently 31'477 proteins, is downloadable from the UniProt Web site (http://www.uniprot.org/proteomes/UP000006548). It is based on the latest data provided by the community and we are completing the knowledgebase by importing missing information from EnsemblPlants. We recently started to collaborate with Araport, the Arabidopsis portal, and we provide Araport with all the gene model corrections that we introduced on the bases of our trans-species family annotation. Data from high-throughput proteomics experiments constitute a rich potential source of annotations for UniProtKB. Certified experimental peptides that uniquely match the product of a single gene are used to generate annotations describing post-translational modifications and protein processing events, and to confirm protein existence. Around 2'600 Arabidopsis entries are now containing annotations extracted from 11 reviewed articles describing large scale proteomics experiments. |
Michel Schneider, Damien Lieberherr and Uniprot Consortium | |
25 |
UniProt provides the scientific community with a comprehensive, high-quality and freely accessible
resource of protein sequence and functional information. Each UniProt Knowledgebase (UniProtKB) entry
contains as much information as possible and includes core mandatory data (the amino acid sequence,
protein name or description, taxonomic data and citation information) as well as widely accepted
biological ontologies, classifications, cross-references, and clear indications of the quality of
annotation in the form of evidence attribution of experimental and computational data. The fruit fly
Drosophila melanogaster has been utilised as a model organism for genetic investigations for over a
century. The Drosophila protein annotation program at UniProtKB focuses on the manual annotation of
characterised D. melanogaster proteins, and UniProtKB currently contains 3,273 reviewed entries from D.
melanogaster. This number continues to increase with each release while existing reviewed entries are
revisited and updated as new information becomes available. The UniProt manual curation process for
Drosophila will be presented, with emphasis on how UniProtKB entries are structured to aid the retrieval
of information by the user.
|
Kate Warner | |
26 |
Zebrafish are increasingly used to model and study human disease. Publications are reporting zebrafish
mutants, wildtype fish treated with Morpholinos, TALENs, or CRISPRs, or zebrafish that have been exposed
to chemical treatments as models of human disease. To facilitate the curation of this information from
publications, curation interfaces have been developed at ZFIN (zfin.org) to annotate zebrafish models of
human disease utilizing the Disease Ontology (DO, http://disease-ontology.org/) in conjunction with
pertinent genotype, sequence targeting reagents and experimental conditions. Each disease model
annotation has an evidence code to indicate how it is supported (TAS: Traceable Author Statement; IC:
Inferred By Curator), along with a citation to the original publication. Disease model annotations can
be viewed on publication pages as well as on disease term pages. Disease term pages include information
about the disease term like disease name, synonyms, definition, cross-references and ontological
relationships. In addition the disease term page has a section that lists and links to the human genes
known to be associated with the disease, OMIM, pages and the corresponding zebrafish orthologs,
providing easy access to related information such as zebrafish gene expression data and zebrafish mutant
phenotype data. In addition to the data view pages, ZFIN produces download files of these annotations
making this information more readily available for the biomedical research community. To enable
searching for zebrafish models of human disease, the category 'Human Disease' has been added to the ZFIN
single box search results, making it easy to find specific disease terms, see relevant genes, and
associated models. Likewise, a 'Human Disease' filter was added to the single box search Gene results to
filter gene sets for those that are associated with a specific human disease. Taken together, the
addition of the disease ontology, curated zebrafish models of human diseases, and added data search
support will streamline the identification of zebrafish models of human diseases.
|
YvonneBradford, Sridhar Ramachandran, Sabrina Toro and Doug Howe | |
27 |
Rhea (www.rhea-db.org) is a comprehensive and non-redundant resource of expert curated biochemical
reactions described using species from the ChEBI (Chemical Entities of Biological Interest) ontology of
small molecules.
Rhea has been designed for the functional annotation of enzymes and the description, analysis and reconciliation of genome-scale metabolic networks. All Rhea reactions are balanced at the level of mass and charge. Rhea includes enzyme-catalyzed reactions (covering the IUBMB Enzyme Nomenclature list and additional reactions), transport reactions and spontaneously occurring reactions. Reactions involving complex macromolecules such as proteins, nucleic acids and other polymers that lie outside the scope of ChEBI are also included. Rhea reactions are extensively curated with links to source literature and are mapped to other publicly available metabolic resources such as MetaCyc/EcoCyc, KEGG, Reactome and UniPathway. Rhea reactions are used as a reference for the reconciliation of genome-scale metabolic networks in the MetaNetX resource (www.metanetx.org) and serve as the basis for the computational generation of a library of lipid structures and analytes in SwissLipids (www.swisslipids.org). Here we describe recent developments in Rhea, which include a new website and substantial growth of Rhea through sustained literature curation efforts. At the time of writing, Rhea (release 69, of January 2016) includes 8805 unique approved reactions involving 7688 unique reaction participants, and cites 8500 unique PubMed identifiers. |
Anne Morgat, Thierry Lombardot, Kristian B. Axelsen, Lucila Aimo, Anne Niknejad, Nevila Hyka-Nouspikel, Elisabeth Coudert, Steven Rosanoff, Joseph Onwubiko, Nicole Redaschi, Lydie Bougueleret, Ioannis Xenarios and Alan Bridge | |
28 |
The Gene Ontology (GO) offers a unified framework to describe the roles of gene products in a
species-independent manner. Descriptive terms are linked with genes, gene products, or proteins. When
direct experimental support is available, genes are associated with the appropriate term and the
annotations tagged as having direct evidence. However, the vast majority of proteins will likely never
be studied experimentally. To predict the function of these uncharacterized proteins, requires methods
that are both efficient and accurate. To this end, the GO consortium has developed PAINT, the
Phylogenetic Annotation and INference Tool, based on the protein families generated by the PANTHER
protein family classification system.
We present here some general observations about phylogenetic annotation inference and primary GO annotations, as illustrated by 36 protein families that participate in two well-characterized cellular processes, apoptosis and autophagy. These processes are well conserved in animals and eukaryotes respectively, and phylogenetic analysis with PAINT reveals their elaboration during evolution. We show that annotation integration via phylogenetic relationships can be used to select high confidence annotations that represent the core functions of protein families. The GO phylogenetic annotation project is extending the coverage of proteins annotated, providing a coherent annotation corpus across a number of representative species. In addition, PAINT improves the quality of the entire set of GO annotations by uncovering discrepancies and inaccuracies in the primary annotations. |
Marc Feuermann, Pascale Gaudet, Huaiyu Mi, Suzanna Lewis and Paul Thomas | |
29 |
Ion channels allow ions to flow across membranes in all living cells. They play an important role in key
physiological processes such as nervous transmission, muscle contraction, learning and memory,
secretion, cell proliferation, regulation of blood pressure, fertilization and cell death. In human, 344
genes encode ion channels. Mutations in more than 126 ion channel and ion channel-interacting protein
genes have been reported to cause diseases, known as channelopathies. Knowledge on the effect of these
mutations is spread throughout the scientific literature. The consolidation of this data on a single
platform will help scientists and clinicians to have a source of validated information for diagnosis and
therapeutic counseling.
The NavMutPredict project focuses specifically on the voltage-gated sodium channel gene family (SCN). The aim is to use all pertinent knowledge concerning mutations in the 9 human sodium channel proteins and their impact on the biophysical properties of the channels. Ultimately, this information should help predicting the pathogenicity of newly discovered genetic variations. To do this, we extract information from the biomedical literature, especially findings related to pathologies and functional data. This data is captured using our biocuration software platform, the BioEditor. This tool allows the capture of structured annotations with a very high level of precision using standardized vocabularies and ontologies, such as GeneOntology, or the Ion Channel ElectroPhysiology Ontology, developed by our group. So far, the BioEditor contains 791 variants found in the SCN proteins and 4127 annotations. All this data will be available to the scientific community via neXtProt and will undoubtedly be a useful resource for a better understanding of ion channel function, essential for understanding channelopathies and developing drugs for new treatments. |
Aurore Britan, Valerie Hinard, Monique Zahn and Pascale Gaudet | |
30 |
The Gene Ontology (GO) is a community resource that represents biological knowledge of physiological
gene functions through the use of a structured and controlled vocabulary. As part of an ongoing project
to improve the representation of specific biological domains in GO, we have focused on autophagy, a
process in which cells digest parts of their own cytoplasm and organelles. This allows for both
recycling of macromolecules under conditions of cellular stress and remodelling of the intracellular
structure during cell differentiation. Well-conserved across species, autophagy is involved in several
pathophysiological events relevant to human health, including cancer, metabolic disorders, and
cardiovascular and pulmonary diseases; as well as, neurodegenerative processes, such as ParkinsonÕs
disease. Autophagy is also implicated in the response to aging and to exercise.
We have made significant modifications to the ontology structure and hierarchy for autophagy (www.ebi.ac.uk/QuickGO/GTerm?id=GO:0006914), such as making chaperone-mediated autophagy a direct child of autophagy, rather than a synonym. Some existing terms were renamed to reflect their use in the literature and also new terms were created, e.g. 'protein lipidation involved in autophagosome assembly'. Furthermore, we have created terms such as 'mitophagy in response to mitochondrial depolarization' and 'parkin-mediated mitophagy in response to mitochondrial depolarization', because the recruitment of the Parkin / PINK1 pathway as a result of mitochondrial membrane depolarisation is a key part of the selective autophagy of mitochondria (mitophagy). In some cases, it has been necessary to introduce taxon constraints, which restrict usage of some GO terms to specific taxa; e.g. molecular evidence of classical microautophagy was found only in yeast literature. A similar, yet distinct, process known as late endosomal microautophagy was reported initially in mammals, but it is uncertain whether this should be restricted to multicellular eukaryotes. In addition to improving the ontology, substantial effort was applied to annotate the human and mouse proteins involved in autophagy and the regulation of autophagy. So far we have associated 337 GO terms with 249 human proteins, through the expert curation of 60 papers. It was expected that all of the proteins in the Reactome pathway for macroautophagy (http://www.reactome.org/PathwayBrowser/#R-HSA-1632852) would also be annotated with autophagy-related GO terms; however, we found 8 (out of the 67 proteins in the pathway) discrepancies. These differences were reconciled following further literature searches and GO annotation. Through expansion and refinement of the ontology and the annotation of selected literature, we have substantially enriched and updated the representation of autophagy in the GO resource. This work will support the rapid evaluation of new experimental data, and thus help further elucidate the role of autophagy in disease. Funded by NIH:NHGRI HG22073; Gene Ontology Consortium (PIs: JA Blake, JM Cherry, S Lewis, PW Sternberg and P Thomas) and Parkinson's UK: G-1307. |
Paul Denny, Marc Feuermann, David Hill, Paola Roncaglia and Ruth Lovering | |
31 |
MicroRNA (miRNA) regulation of developmental and cellular processes is a relatively new field of study,
however the data generated from such research has so far not been organised optimally to allow inclusion
of this data in pathway and network analyses tools. The association of gene products with terms from the
Gene Ontology (GO) has proven highly effective for large-scale analysis of functional data, but until
recently there has been no substantial effort dedicated to applying GO terms to miRNAs. This lack of
functional annotation for miRNAs has been identified as a problem when performing functional analysis,
where scientists have to rely on annotation of the miRNA gene targets rather than that of the miRNAs. We
have recognised this gap and have started an initiative to curate miRNAs with GO functional annotation,
using the Gene Ontology Consortium guidelines for curation of miRNAs http://wiki.geneontology.org/index.php/MicroRNA_GO_annotation_manual.
Our plan over the next few years is to build a resource comprising of high-quality, reliable functional annotations for cardiovascular-related miRNAs; annotations that will be a valuable addition to the advancement of miRNA research in this field. Funded by British Heart Foundation grant RG/13/5/30112 |
Rachael Huntley, Tony Sawford, Maria Martin, Manuel Mayr and Ruth Lovering | |
32 |
SABIO-RK (http://sabio.h-its.org) is a web-accessible, manually curated database that has been
established as a resource for biochemical reactions and their kinetic properties with a focus on
supporting the computational modelling to create models of biochemical reaction networks. It contains
annotations to controlled vocabularies, ontologies and is interlinked with and linked to different
databases. A flexible way of exporting database search results in table-like format is provided. Users
can tailor their own custom-made export format by selecting properties of the entries in the result set
the user wants to export. Both the export and the import of data are possible via SBML format.
The already-existing ways of SABIO-RK to collect feedback from users has been extended recently to improve the SABIO-RK database content and to better match user requirements. In case of an empty result, the user is presented with the opportunity to directly request addition of the corresponding data. Similarly, the user can also proactively ask via SABIOs Services for curation of individual papers or e.g. pathway- and organism-associated data. These user requests can be hidden or made visible to the public as curation priority list. SABIO-RK is part of the data management node NBI-SysBio within the de.NBI (German Network of Bioinformatics Infrastructure) program which is a newly-established BMBF-funded initiative having the mission to provide comprehensive first-class bioinformatics services to users in life sciences. |
Maja Rey, Ulrike Wittig, Renate Kania, Andreas Weidemann and Wolfgang Muller | |
33 |
Enzymes play a vital role in all life processes and are used extensively in biomedicine and
biotechnology. Information about enzymes is available in several different Bioinformatics resources,
each of which is built with different communities in mind. Researchers are not always aware of how much
is available for them and often go to the same resources, missing out on potentially valuable
information. The Enzyme Portal brings all of this public information together in one place – it is
a unique, invaluable resource for scientists and health researchers working in both academia and
industry. EMBL-EBI relaunched the Enzyme Portal in October 2015. Now fully integrated with UniProt and
the EBI Search, the Enzyme Portal integrates information from several resources, saving researchers
valuable time by providing a concise, cross-linked summary of their enzyme or protein of interest.
The Enzyme Portal searches major public resources including the UniProt Knowledgebase (UniProt KB), the Protein Data Bank in Europe (PDBe), Rhea, Reactome, IntEnz, ChEBI and ChEMBL, and provides a summary of catalytic activity, protein function, small-molecule chemistry, biochemical pathways, drug-like compounds, catalytic activity and taxonomy information. It also provides cross-references to the underlying data resources, making it easier to explore the data further. Within the Enzyme Portal, the powerful EBI Search engine can now search a more refined set of enzymes from UniProt, including new enzymes and orthologs in a wide range of species. The sequence search is now based on the enzyme sequence library, using EBI-NCBI Blast. The service offers new search entry points to enzymes according to disease, Enzyme Classification, taxonomy (model organisms) and pathway. There is also a basket functionality, which allows users to store and compare multiple enzymes. The Enzyme Portal's interface is now more intuitive, re-designed based on usability research. Enzyme summaries on the result and entry pages now offer clearer descriptions and set out more complete orthology relationships. Users can easily view enzyme summaries with annotations from PDBe, Rhea, Reactome, IntEnz, ChEBI, ChEMBL and Europe PMC, all of which now provide data to the Enzyme Portal through web services. |
Joseph Onwubiko, Sangya Pundir, Xavier Watkins, Rosanna Zaru, Steven Rosanoff, Claire O'Donovan and Maria Martin | |
34 |
Meaningful visualization of sequence function annotation at both gene and protein level is one of the
cornerstones in Biology and Bioinformatics. In the protein field, sequence annotations, a.k.a. protein
features, are used to describe regions or sites of biological interest; for instance secondary
structures, domains, post-translational modifications and binding sites amongst others, play a critical
role in the understanding of what the protein does. With the growth in the amount and complexity of
biological data, integration and visualization becomes increasingly important in order to expose
different data aspects that might be otherwise unclear or difficult to grasp.
Here we present the UniProt Features Viewer, a BioJS component bringing together curated and large-scale experimental data. Our viewer displays protein features in different tracks providing an intuitive and compact picture of co-localized elements; initial tracks currently include domains & sites, molecule processing, post translational modifications, sequence information, secondary structure, topology, mutagenesis and natural variants. Each track can be expanded to a detailed view revealing a more in-depth view of the underlying data, e.g., topological domain, trans-membrane and intra-membrane. The variant track offers a novel visualization using a matrix which maps amino acids to their position on the sequence, therefore allowing the display of large number of variants in a restricted space. The UniProt Features Viewer presents tracks under a ruler that represents sequence length for this protein. Anchors located on the left and right sides of the ruler make it easier for users to zoom-in to a particular region. Zooming can also be done via an icon located on top of the categories, by positioning the cursor on top of the features area and scrolling, and by using gestures on desktops and mobile devices. Customization is also possible, particularly, category tracks can be hidden so users can focus on those categories more relevant to their research. Features can be selected by clicking on them; on feature selection additional information such as description, positions and evidence, will be displayed on a tool tip. Some of the type tracks provide a particular tailored view, e.g., for variant data, or use distinctive shapes, e.g., triangles for modified residues or hexagons for glycosylation sites. Modularity and easy integration are core to the UniProt Features Viewer. It has been already integrated into the CTTV website in order to provide a graphical view of proteins, e.g., https://www.targetvalidation.org/target/ENSG00000157764. Other groups such as InterMine (http://intermine.org) have already express their interest in using it. Our viewer has also been tested with users in order to assess usability of the product. We will continue to integrate selected large-scale experimental data, we plan to include proteomics related data, i.e., peptides, as well as antigenic data, i.e., epitope bindings. |
Leyla Jael Garcia, Xavier Watkins, Sangya Pundir, Maria Martin and Uniprot Consortium | |
35 |
Protein 3D-structures provide essential information about protein function in health and disease.
UniProtKB/Swiss-Prot biocurators make use of this wealth of data, combining 3D-structure data with
information derived from the scientific literature to verify protein function and mode of action,
validate enzyme active sites, identify physiologically relevant ligand binding sites and
post-translational modifications, and interactions between proteins, or proteins and nucleic acids. This
information is shown in a structured format in the UniProtKB/Swiss-Prot entries to facilitate retrieval
of specific pieces of information: protein function and subunit structure, cofactor requirements, the
role of specific residues, domains and regions, post-translational modifications, membrane topoplogy,
etc., with evidence tags to indicate the sources of the information. Information from well-characterized
proteins is then propagated to close family members. As a result, out of roughly 550'000
UniProtKB/Swiss-Prot entries, ca. 88'000 contain information about metal-binding sites, 137'000 contain
information about the binding sites for nucleotides or other small organic ligands and about 95'000
contain active site annotations, to cite only the most abundant types of annotation.
In UniProtKB, cross-references to PDB and PDBSum facilitate access to experimental 3D-structures, while cross-references to Swiss Model repository (SMR) and Protein Model Portal facilitate access to theoretical models. In January 2016, UniProtKB/Swiss-Prot contained 123'700 cross-references to PDB, corresponding to over 22'900 entries, mostly from model organisms. Over 25% (5'700) of the 20'200 human entries have a cross-reference to PDB, and the majority of these have at least one matching literature citation. The situation is similar for other model organisms. |
Ursula Hinz and Uniprot Consortium | |
36 |
Non-coding RNAs (ncRNAs) such as for example microRNAs (miRNAs) are frequently dysregulated in cancer
and have shown great potential as tissue-based markers for cancer classification and prognostication.
ncRNAs are present in membrane-bound vesicles, such as exosomes, in extracellular human body fluids.
Circulating miRNAs are also present in human plasma and serum cofractionate with the Argonaute2 (Ago2)
protein and the High-density lipoprotein (HDL). Since miRNAs and the other ncRNAs circulate in the
bloodstream in a highly stable, extracellular forms, they may be used as blood-based biomarkers for
cancer and other diseases. A knowledge base of non-invasive biomarkers is a fundamental tool for
biomedical research.
Data is manually collected from ExoCarta, a database of exosomal proteins, RNA and lipids and PubMed. Articles containing information on circulating RNAs are collected by querying PubMed database using keywords such as 'microRNA', 'miRNA', 'extracellular' and 'circulating'. Data is then manually extracted from the retrieved papers. General information about miRNAs is obtained from miRBase. The aim of miRandola is to collect data concerning RNAs contained not only in exosomes but in all extracellular types functionally enriched with information such as diseases, processes, functions, associated tissues, and their potential roles as biomarkers. Here, we present an updated version of the miRandola database, a comprehensive manually curated collection and classification of extracellular circulating RNAs. The first version of the database has been published in 2012 and it contained 89 papers, 2132 entries and 581 unique mature miRNAs. Now, we have updated the database with 271 papers, 2695 entries, 673 miRNAs and 12 long non-coding RNAs. RNAs are classified into several categories, based on their extracellular form: RNA-Ago2, RNA-exosome, RNA-microvesicles, RNA-HDL and RNA-circulating. Moreover, the database contains several tools, allowing users to infer the potential biological functions of circulating miRNAs, their connections with phenotypes and the drug effects on cellular and extracellular miRNAs. miRandola is the first online resource which gathers all the available data on circulating RNAs in a unique environment. It represents a usufeul reference tool for anyone investigating the role of extracellular RNAs as biomarkers as well as their physiological function and their involvement in pathologies. miRandola is constantly updated (usually once a year) and the online submission system is a crucial feature which helps ensuring that the system is always up-to-date. The future direction of the database is to be a resource for all the potential non-invasive biomarkers such as cell-free DNA, circular RNA and circulating tumor cells (CTCs). miRandola is available online at: http://atlas.dmi.unict.it/mirandola/ References 1) Francesco Russo, Sebastiano Di Bella, Giovanni Nigita, Valentina Macca, Alessandro Laganˆ, Rosalba Giugno, Alfredo Pulvirenti, Alfredo Ferro. miRandola: Extracellular Circulating microRNAs Database. PLoS ONE 7(10): e47786. doi:10.1371/journal.pone.0047786 2) Francesco Russo*, Sebastiano Di Bella*, Vincenzo Bonnici, Alessandro Laganˆ, Giuseppe Rainaldi, Marco Pellegrini, Alfredo Pulvirenti, Rosalba Giugno, Alfredo Ferro. A knowledge base for the discovery of function, diagnostic potential and drug effects on cellular and extracellular miRNAs. BMC Genomics 2014, 15(Suppl 3):S4. doi:10.1186/1471-2164-15-S3-S4 |
Francesco Russo, Sebastiano Di Bella, Giovanni Nigita, Federica Vannini, Gabriele Berti, Flavia Scoyni, Alessandro Laganˆ, Alfredo Pulvirenti, Rosalba Giugno, Marco Pellegrini, Kirstine Belling, S¿ren Brunak and Alfredo Ferro | |
37 |
Ribosomal 5S RNA (5S rRNA) is a conserved component of the large subunit of all cytoplasmic and the
majority of organellar ribosomes in all organisms. Due to its small size, abundance and conservation 5S
rRNA was used for many years as a model molecule in studies on RNA structure, RNA-protein interactions
as well as a molecular marker for phylogenetic analyses. 5SRNAdb is the first database that provides a
high quality reference set of ribosomal 5S RNAs (5S rRNA) across three domains of life.
To reduce the redundancy of the data set each individual database record represents a unique 5S rRNA sequence identified for particular species. Identical sequences from the same species and deposited under distinct accession numbers in the GenBank/ENA databases are collapsed into single records with links to to the original GenBank records. All of the records in the database are available in the form of manually curated structural sequence alignments in which each column corresponds to a particular position in the general model of the secondary structure of 5S rRNA. Each individual sequence record or a consensus sequence of multiple records is visualized as a secondary structure diagram showing the most general model based on all sequences from a particular group or from the set of records defined by the user. To make the comparison of alignments and general structure models possible, both the alignments and secondary structure diagrams are produced on the templates including all positions present in the master alignments of all sequences from respective taxonomic domains (i.e. Archaea, Bacteria and Eukaryota and organelles). The content of the alignment can be customized by users. The sequences can be added to the alignment by providing record identifiers or by performing database search. The nucleotide statistics and secondary structure models are dynamically recalculated to match the current set of sequences in the updated alignment. Alignments can also be generated from from scratch by adding subsequent search results. The user interface of the 5S rRNA database was designed to incorporate several solutions enhancing the efficiency of the data mining. All browse and search results are shown in separate collapsible windows allowing users to adjust the amount of information visible on each page. The database is available on-line at http://combio.pl/5srnadb/ |
Maciej Szymanski, Andrzej Zielezinski and Wojciech Karłowski | |
38 |
Although it has been well-documented that glycosylation regulates the development and progression of
cancer through involvement in fundamental processes like cell signaling, invasion, and tumor
angiogenesis, much needs to be developed for a full understanding of the cancer glycome and
glycoproteome. Glycosylation of proteins can be altered during malignant progression with respect to the
glycan structures but also in their associations with the glycoproteins (sites of attachment and their
occupancy). Glycosylation is one of the most prominent post-translational modifications, predicted to
affect more than half of all human proteins, but understanding of its full extent is incomplete,
especially in the cancer-context. Furthermore, because glycosylation is a coordinated enzymatic pathway,
observation of altered glycosylation products could be rationalized in terms of changes in enzyme
expression during neoplastic transformation. To study the interplay of proteins involved in
glycosylation, both glycoproteins and glycosyltransferases, we conducted genome-wide next-generation
sequencing (NGS) analysis of RNA-Seq samples labeled as liver hepatocellular carcinoma (LIHC) from The
Cancer Genome Atlas (TCGA). Using BioXpress, a curated gene expression and disease association database,
we identified all genes for the given type of cancer which are differentially expressed between
corresponding tumor and normal pairs. We then cross-referenced this gene list with a comprehensive list
of glycan binding proteins (GBPs) and glycosyltransferases. To generate the glycosylation-specific gene
list, we first retrieved the curated list of human GBPs and glycosyltransferases from the Consortium for
Functional Glycomics (CFG) Functional Glycomics Gateway (http://www.functionalglycomics.org/fg/). We
then retrieved all human UniProtKB/SwissProt entries with keywords Glycoprotein or Glycosyltransferase.
Reported differentially expressed genes were then filtered to report those genes involved in
glycosylation. Additionally, we retrieved expression information from BGEE, the database for Gene
Expression Evolution, and we further reduced the genes of interest to those designated to be orthologous
across a subset of organisms to study the cancer-associated glycogenes from an evolutionary perspective.
To demonstrate the critical function of curation in studies of this type, we re-analyzed the subset of
genes with predicted glycosylation sites derived from the NetNGlyc server (http://www.cbs.dtu.dk/). From
this simple comparison, we can readily see that the completeness, and perhaps more crucially the
correctness, of glycosylation-related annotations directly impacts our ability to derive functional
understanding from such an analysis. We plan to apply this pipeline to a comprehensive pan-cancer study
to determine possible glyco-profiles associated with gene expression in different types of cancer, and
to automate the entire pipeline through the BioXpress engine.
|
Hayley Dingerdissen, Radoslav Goldman and Raja Mazumder | |
39 |
In my talk I will describe three possible states of the protein molecules and the corresponding
databases and servers for predictions of the disordered and amyloidogenic regions, folding nucleus and
handedness. Disordered regions play important roles in protein adaptation to challenging environmental
conditions. Flexible and disordered residues have the highest propensities to alter the protein packing.
Therefore, identification of disordered/flexible regions is important for structural and functional
analysis of proteins. We created the first library of disordered regions based on the known protein
structures from the clustered protein data bank. Recently we analyzed the occurrence of the disordered
patterns in 122 eukaryotic and bacterial proteomes to create the HRaP database. Amyloid fibrils
formation in organs and tissues causes serious human diseases. Therefore identification of protein
regions responsible for amyloid formation is one of important tasks of theoretical and experimental
investigations. Recently the role of a mirror image conformation as a subtle effect in protein folding
has been considered. The understanding of chirality both in protein structures and amyloid
suprastructures is an important issue in molecular biology now. We are the first who have investigated
the relationship of the protein handedness with the rate of protein folding.
|
Oxana Galzitskaya | |
40 |
The Triticeae Toolbox (T3) triticeaeatoolbox.org is a database for wheat, barley, and oat that contains
genotype and phenotype data used by plant breeders. To allow breeders to select markers for developing
new germplasm we have done meta-analysis on all the trials. We preformed Genome-wide association studies
(GWAS) on each of the 334 phenotype experiments, 55 genotype trials, and 147 traits. The genotypes where
imputed using Beagle version 4 using a 1.3 million SNP haplotype map for better resolution. The
resulting quantitative trait loci (QTL) are identified by location in the reference genome and in
JBrowse genome browser. The QTLs are prioritized by the gene annotation. The tables provide links to
EnsemblPlant for identification of protein and comparative genomics. The website will also be using
QTLNetMinner ondex.rothamsted.ac.uk/QTLNetMiner to integrate gene information with annotation,
biochemical pathway, gene expression, comparative genomic, and publications. The QTLNetMinner also
provides us with a network viewer to visualize the connections of the integrated information.
|
Clay Birkett, David Matthews, Peter Bradbury and Jean-Luc Jannink | |
41 |
The Biological General Repository for Interaction Datasets (BioGRID) (http://www.thebiogrid.org) is an
open source database for protein and genetic interactions, protein post-translational modifications and
drug/chemical interactions, all manually curated from the primary biomedical literature. As of February
2016, BioGRID contains over 1,052,000 interactions captured from high throughput data sets and low
throughput studies experimentally documented in more than 45,800 publications.Comprehensive curation of
the literature has been completed for protein and genetic interactions in the budding yeast S.
cerevisiae and the fission yeast S. pombe and protein interactions in the model plant A. thaliana. The
comprehensive curation of human interaction data is currently not feasible due to the vast number of
relevant biomedical publications each month. To address this in part, we have taken the approach of
themed curation of interactions implicated in central cell biological processes, in particular those
implicated in human disease. In order to enrich for publications that contain relevant interaction data,
we are using state-of-the-art text mining methods, which have effectively doubled the rate and coverage
of our manual curation throughput over the past five years. To date, we have curated themed human
interaction data in the ubiquitin-proteasome system (UPS), the autophagy system, the chromatin
modification (CM) network and the DNA damage response (DDR) network. With the recent development
CRISPR/Cas9 genome editing technology, the stage is set to explosively expand the landscape of human
genetic interaction data through genome-wide phenotypic and genetic interaction screens. In conjunction
with WormBase and other model organism database partners, we have developed a genetic interaction
ontology to allow rigorous annotation of genetic interactions across all species, including humans. We
are currently building a dedicated portal with BioGRID to allow interrogation of high-throughput human
genetic interaction data. A curation pipeline has also been established to capture chemical/drug
interaction data from the literature and other existing resources (see poster by Oughtred et al.). All
BioGRID data is archived as monthly releases and freely available to all users via the website search
pages and customizable dataset downloads. These new developments in data curation, along with the
intuitive query tools in BioGRID that allow facile data mining and visualization should help enable
fundamental and applied discovery by the biomedical research community.
|
Lorrie Boucher, Rose Oughtred, Jennifer Rust, Christie Chang, Bobby-Joe Breitkreutz, Nadine Kolas, Lara O'Donnell, Chris Stark, Andrew Chatr-Aryamontri, Kara Dolinski and Mike Tyers | |
42 |
Knowledge on transcription factors (TF) and their binding sites (TFBS) provide basis for a wide spectrum
of studies in regulatory genomics, from reconstruction of regulatory networks to functional annotation
of transcripts and sequence variants. While TFs may recognize different sequence patterns in different
conditions, it is pragmatic to have a single generic model for each particular TF as a baseline for
practical applications. We provide the HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO,
http://hocomoco.autosome.ru) containing DNA binding patterns for nearly 6 hundreds of human and 4
hundreds of mouse TFs.
ChIP-Seq data appears to be the most informative data source on TF binding in vivo. Yet, a ChIP-seq peak does not warrant the tested protein to bind DNA directly, without any mediators, and provides only an approximate location of the protein binding site. In vitro technologies like HT-SELEX warrant direct binding but usually yield less accurate and biased motif specificity. The precise location of binding sites in genomic segments can be predicted by computational methods based on binding motif models such as positional weight matrices (PWMs) or dinucleotide PWMs (diPWMs). To obtain comprehensive, non-redundant, and unbiased set of PWMs we developed a pipeline that integrates multiple ChIP-Seq and HT-SELEX datasets, and validates the resulting models at in vivo data. We used 1690 human and mouse publicly available ChIP-Seq data sets, reproduced read mapping and peak calling with the same pipeline, combined them with 542 HT-SELEX datasets, and supplied to ChIPMunk motif discovery tools to obtain PWMs and diPWMs. The resulting motifs were subject for manual curation. As valid models we selected those that were (i) similar to the already known, (ii) consistent within a TF family, or, at least, (iii) with a clearly exhibited consensus (based on LOGO representation). About 50% of models passed the curation and were included into automated benchmarking. The tested motifs were added to the aggregated set of TFBS models obtained from JASPAR, HOMER, SWISSREGULON, original HT-SELEX databases, as well as the previous release (v9) of the HOCOMOCO. Iterative benchmarking identified multiple {PWM, ChIP-seq dataset} pairs with predicted TFBS consistent with ChIP-Seq peak calls. Benchmarking allowed discarding ChIP-Seq datasets that were not confirmed by any TF bindig model and binding models poorly performing at any ChIP-Seq dataset. Similarly, we constructed the largest up to date collection of dinucleotide PWM models for dozens of human and mouse TFs. To facilitate practical application all HOCOMOCO models are linked to gene and protein databases (Entrez Gene, HGNC, UniProt) and accompanied by precomputed thresholds. For TFs involved in epigenetic modifications (DNA methylation, histone modification, chromatin remodeling, etc), the TF data are cross-linked with a sister database EpiFactors that contains information about epigenetic regulators, their complexes, targets and products (http://epifactors.autosome.ru/). Particularly, it includes transcription activity data (FANTOM5 CAGE), across a collection of human primary cell samples for approximately 200 cell types, 255 different cancer cell lines, and 134 human postmortem tissues. |
Ilya Vorontsov, Ivan Kulakovskiy, Ivan Yevshin, Anastasiia Soboleva, Artem Kasianov, Haitham Ashoor, Wail Ba-Alawi, Vladimir Bajic, Yulia Medvedeva, Fedor Kolpakov and Vsevolod Makeev | |
43 |
The UniProt Knowledgebase (UniProtKB) endeavours to provide the scientific community with a
comprehensive and freely accessible resource of protein sequence and functional information on a large
variety of model organisms to further scientific research. The nematode worm, Caenorhabditis elegans
(C.elegans), is a transparent roundworm, which is approximately 1mm in length and has a relatively short
life cycle. The use of C.elegans as a versatile model organism for studying gene and protein function in
complex biological processes is adopted by thousands of scientists worldwide. It was the first
multicellular organism to be sequenced and was also the organism in which RNA interference was first
discovered. Such scientific breakthroughs have paved the way for numerous other large scale sequencing
and knockout projects in other multicellular organisms. Nevertheless, C. elegans still remains an
essential model organism and practical genetic tool in which to study gene and protein function. The
UniProtKB Caenorhabditis protein annotation project focuses on the manual annotation of experimentally
characterised C. elegans proteins and also contains entries from other Caenorhabditis species including
briggsae, drosophilae, remanei and vulgaris. To date, there are 3,671 manually curated protein entries
for C. elegans with this number continuously increasing through on-going annotation. Each comprehensive
UniProtKB entry presents primary protein data (including amino acid sequence, gene and protein names and
taxonomic data), amalgamated data from a range of sources (including scientific literature, model
organism databases and sequence analysis tools) as well as biological ontologies, classifications and
cross-references in order to provide an accurate overview. In particular, UniProtKB works closely with
both the nematode worm research community and with WormBase, the database of the biology and genome of
C. elegans and related nematode species, to ensure that UniProtKB presents detailed and current
proteomes, sequences and functional annotation of nematode proteins in a clear, informative and
organised manner in order to facilitate and promote further nematode research.
|
Hema Bye-A-Jee | |
44 |
The wide availability of web-based desktop and mobile applications has tremendously accelerated the pace
of online content generation and consumption. Researchers are producing and consuming more and more
online content than ever, in various forms such as documents, images, audios, videos, software codes,
and many more. There is countless information hidden in multimedia contents, though it is often not
discoverable due to lack of relevant structured and curated annotations.
iCLiKVAL (Interactive Crowdsourced Literature Key-Value Annotation Library) is a web-based application (http://iclikval.riken.jp) that uses the power of crowdsourcing to collect annotation information for all scientific literature and media found online. The iCLiKVAL browser extension is an open source easy-to-use tool, which uses the iCLiKVAL API to save free-form, but structured, annotations as 'key-value' pairs with an optional 'relationship' between them. The basic idea is to map the online media to a unique URI (Uniform Resource Identifier) and then to assign semantic value to the media to make information easier to find and allow for much richer data searches and integration with other data sources. The browser extension facilitates users to bookmark the content or to mark for 'review later'.It can be used in offline mode, and the data is automatically synchronized when it is back online. To use this browser extension, users need to be registered with the iCLiKVAL web application. Currently, it is available for the Google Chrome browser, and later it will be available for other popular cross-platform browsers. |
Naveen Kumar and Todd Taylor | |
45 |
Literature curation by model organism databases (MODs) results in the interconnection of papers, genes,
gene functions, and other experimentally supported biological information, and aims to make research
data more discoverable and accessible to the research community. That said, there is a paucity of
quantitative data about how literature curation affects access and reuse of the curated data. One
potential measure of data reuse is the citation rate of the article used in curation. If articles and
their corresponding data are easier to find, then we might expect that curated articles would exhibit
different citation profiles when compared to articles that are not curated. That is, what are the
effects of having scholarly articles curated by MODs on their citation rates? To address this question
we have been comparing the citation behavior of different groups of articles and asking the following
questions: (1) given a collection of 'similar' articles about Arabidopsis, is there a difference in the
citation numbers between articles that have been curated in TAIR and ones that have not, (2) for
articles annotated in TAIR, is there a difference in the citation behavior before vs. after curation
and, (3) is there a difference in citation behavior between Arabidopsis articles added to TAIR's
database and those that are not in TAIR? Our data indicate that curated articles do have a different
citation profile than non-curated articles that appears to result from increased visibility in TAIR. We
believe data of this type can be used to quantify the impact of literature curation on data reuse and
may also be useful for MODs and funders seeking incentives for community literature curation. This
project is a research partnership between TAIR (The Arabidopsis Information Resource) and Elsevier Labs.
|
Tanya Berardini, Leonore Reiser, Michael Lauruhn and Ron Daniel Jr. | |
46 |
Scientific media comes in a variety of languages and formats, including journal articles, books, images,
videos, blog and database entries, and so on. In the case of textual media, there is often additional
information, such as tables, figures and supplementary data, associated with or embedded in the text.
While there are many good resources for browsing, searching and annotating some of this media, there is
no single place to search them all at once, and generalized search engines such as Google do not allow
for the type of comprehensive and precise searches that researchers require. One could argue that any
scientific media that is on the web is therefore connected, but much of it remains offline (e.g., books)
or is inaccessible (not open source, only found in libraries, etc.) and is therefore neither
discoverable nor connected. To address these issues, we created iCLiKVAL (http://iclikval.riken.jp/), an
easy-to-use web-based tool that uses the power of crowdsourcing to accumulate annotation information for
all scientific media found online (and potentially offline). Annotations in the form of
key-relationship-value tuples (in any language), added by users through a variety of options, make
information easier to find and allow for much richer data searches by basically linking all media
together through common terminology.
Users can create or join common interest groups, both public and private, to annotate related media together as a community. Users can also create and edit their own controlled vocabulary lists, or import established vocabularies such as Medical Subject Headings (MeSH) and Gene Ontology (GO) terms, and then easily select which lists they would like to use for auto-suggest terms in the key, value and relationship fields. Within the user groups, vocabulary and bookmark lists can be shared so that everyone uses the same standards and works towards a common goal. In addition, we have implemented a notification center, several customization options, and a one-stop annotations feature where users can view and edit all of their own annotations. Most of the pages used for tracking progress, such as annotations, bookmarks, search history, reviews and vocabularies, are searchable, sortable and filterable, so users can quickly find what they are looking for. Our goal is the ability to add key-value pairs to any type of scientific media. While iCLiKVAL was initially limited to the annotation of PubMed articles, we recently added the capability to curate media from YouTube, Flickr and SoundCloud. And, to really broaden our scope, anything with a digital object identifier (DOI) can now be annotated using iCLiKVAL, allowing for the inclusion of hundreds of millions of media objects and more. While the interface is very intuitive and easy to use on almost every browser and platform, we have also created a Chrome Browser extension that allows any non-password protected online media to bookmarked and annotated, to facilitate the linking of all scientific media. The iCLiKVAL database is completely searchable, and all of the collected data is freely available to registered users via our API. |
Todd Taylor and Naveen Kumar | |
47 |
Intracellular defense against pathogens is a fundamental component of human innate immunity. However,
the numbers of genes defined as being part of innate immunity are largely inconsistent among existing
annotation datasets. Therefore, there is a need for better criteria to identify the subset of genes
acting as intracellular effectors. In this study, we aim to approach this using machine learning
methods. Our primary analysis with a shallow implementation of a classifier of innate immunity genes
suggested that better cross-validation results can be obtained with the use of innate immunity genes
that are covered by more existing annotation datasets as the true-positives, highlighting the importance
of high-quality training data. We thus have launched a crowd-sourcing curation project in order to
define two high-quality training datasets for the machine learning, one consists of innate immunity
genes, the other consist of none-immunity-related genes. In total 2000 genes were randomly selected,
among which about half are putative innate immunity genes covered by public datasets, the more number of
datasets a gene is covered, the higher likelihood it will be included in the datasets; while the other
half are putative none-immunity-related genes that were randomly selected from the other human genes
excluding those that are covered by any existing innate-immunity databases, those that are homologous to
these genes and those that are adaptive-immunity-related. For each gene we collected their annotations
from public databases, expression changes upon interferon stimuli, Gene Ontology classification and gene
knockout phenotypes. Based on these information, curators will be asked to vote if it is definitely
innate-immunity-related, unsure or definitely not. Each curator will be asked to vote randomly on 250
genes out of the 2000; each gene will be voted by at least 11 curators. Then we will calculate a
consensus for each curation task based the 'majority' rule and assign genes accordingly to the
'positive' and 'negative' datasets. We will train the machine learning algorithms on the two datasets
with various genetic, genomic and evolutionary features that we have already collected for all human
genes. Through feature selection we should be able to obtain a list of features that are informative in
distinguishing innate-immunity genes from others. In addition, we will apply the resulting model to
other genes to search for putative new innate immunity genes, and submit them for further experimental
validation.
|
Weihua Chen, Antonio Rausell, Fellay Jacques, Amalio Telenti and Evgeny Zdobnov | |
48 |
PomBase obtains its highest-quality data by manual curation of the fission yeast literature, which
provides experimentally supported annotations representing gene structure, function, expression and
more. Approximately 5000 papers suitable for manual curation have been published on fission yeast to
date, of which about 2100 have been fully curated.
To supplement the work of its small staff of professional curators, PomBase has developed a community curation model that enables researchers to participate directly in curating data from their own publications. As of April 2015, the fission yeast community has contributed annotations for over 260 publications. Community curation improves the visibility of recent publications, and enables researchers and professional curators to work together to ensure that PomBase presents comprehensive, up-to-date and accurate representation of published results. Furthermore, because PomBase is one of only three databases that provide manual literature curation for fungal species, electronic data transfer of high-confidence fission yeast annotations to other fungal species is an essential source of functional data for the latter. Community contributions to PomBase therefore support research not only within the fission yeast community, but also throughout the broader community studying fungi. |
Antonia Lock, Midori Harris, Kim Rutherford, Valerie Wood and Jurg Bahler | |
49 |
PomBase, the model organism database for fission yeast, makes the comprehensive and detailed
representation of phenotypes one of its key features. We have made considerable progress in developing a
modular ontology, the Fission Yeast Phenotype Ontology (FYPO), for phenotype descriptions, and in making
phenotype annotations for single mutants available. Canto, the PomBase community curation tool, provides
an intuitive interface for curators and community users alike to associate alleles with FYPO terms and
supporting metadata such as evidence, experimental conditions, and annotation extensions that capture
expressivity and penetrance. The PomBase web site displays phenotype annotations on gene pages, and
supports FYPO term searching by name or ID.
We are now extending the PomBase phenotype annotation resources to annotate phenotypes observed in double mutants, triple mutants, etc. The Chado database underlying PomBase supports annotation of specific alleles, singly or in combinations, by associating phenotypes with genotypes which in turn link to their constituent alleles. Extensive additions and adaptations of the Canto phenotype annotation interface enable curators and researchers to capture multi-allele phenotype data and metadata. We invite comments on extending the PomBase gene page display and search options to accommodate and use the new data. |
Midori Harris, Antonia Lock, Kim Rutherford, Mark Mcdowall and Valerie Wood | |
50 |
RNA interference (RNAi) represents a powerful strategy for the systematic abrogation of gene expression.
High-throughput screening experiments, performed for a wide variety of underlying biological questions,
result in the description of loss-of-function phenotypes across many fields in biology. These phenotypes
represent a major contribution to the annotation of gene function.
The GenomeRNAi database holds information on RNAi phenotypes and reagents, aiming to provide a platform for data mining and screen comparisons. The database is populated by manual data curation from the literature, or via direct submission by data producers. Structured annotation guidelines are applied for curatorial review. Where possible, a controlled vocabulary is defined for given data fields. At present, the database contains 452 experiments in human, and 201 experiments in Drosophila, providing more than 2,5 million individual gene-phenotype associations in total. A recent major contribution has been the Broad Institute's Achilles project with genome-wide shRNA screening data for 216 different cancer cell lines. GenomeRNAi also integrates information on efficiency and specificity for more than 400,000 RNAi reagents, obtained by running a quality assessment pipeline on a regular basis. The GenomeRNAi website (www.genomernai.org) features search functions (by gene, reagent, phenotype or screen details), as well as options for browsing and downloading experiments. Further features are a 'frequent hitter' page, and a functionality for the overlay of genes sharing the same phenotype, onto known gene networks provided by the String database (www.string-db.org). GenomeRNAi data are well integrated with external resources, providing e.g. mutual links with FlyBase, GeneCards and UniProt. GenomeRNAi functional data have also been incorporated into the FlyMine tool. Given the sharp increase in data volume we are currently working on visualisation options for a more intuitive, overview-like display of the data. These will be presented at the conference. |
Esther E. Schmidt, Oliver Pelz, Benedikt Rauscher and Michael Boutros | |
51 |
The International Mouse Phenotyping Consortium (IMPC) was founded with the purpose of building a truly
comprehensive, functional catalogue of the mouse genome by means of high-throughput phenotyping and
classification of single-gene knock-out mouse lines, allowing discovery of new mouse models for human
diseases. As of January 2016, over 95000 specimens belonging to 3274 KO lines (and controls) had been
phenotyped, and around 1,260,000 experimental procedures were performed. To effectively and usefully
benefit from this wealth of data, new robust statistical analysis methodologies have been developed and
implemented to test for the existence of phenodeviants, and to automatically classify them using the
Mammalian Phenotype ontology developed by Mouse Genome Informatics (MGI) at Jackson Laboratory.
From the onset of the project until the present day, the strategies and mechanisms used in annotating the data have changed and evolved; a mix of manual and automated annotation is used, with the former being performed at the very specialised level and only in certain data contexts – such as anatomical or image observation. To date, we have identified 1461 phenotypes, all annotated according to the IMPC phenotyping pipeline that includes as part of its standardized protocols the ontological outcome(s) to be assigned when a statistically significant outlier strain is detected for a phenotype test . This process, developed with biologists from all the IMPC partner centres, is an important step to ensure that the automatic annotation of phenodeviancy reflects the underlying biological system that is altered. Apart from contributing to maintaining and expanding MGI's MP terms catalogue, the IMPC also employs programmatic solutions to infer relationships between mouse and human phenotype representations, making the most use of resources such as MP, OMIM, Orphanet and PhenoDigm to discover new mouse models of disease and provide insights into disease mechanism. |
Luis Santos, Ann-Marie Mallon and Mouse Phenotyping Informatics Infrastructure | |
52 |
Scientific research is inherently a collaborative task; in our case it is a dialog among different
researchers to reach a shared understanding of the underlying biology. To facilitate this dialog we have
developed two web-based annotation tools: Apollo (http://genomearchitect.org/), a genomic feature
editor, designed to support structural annotation of gene models, and Noctua
(http://noctua.berkeleybop.org/), a biological-process model builder designed for describing the
functional roles of gene products. Here we wish to outline an inventory of essential requirements that,
in our experience, enable an annotation tool to meet the needs of both professional biocurators as well
as other members of the research community. Below are the general requirements, beyond specific
functional requirements, that any annotation tool must satisfy. These include:
- Real time updates to allow geographically dispersed curators to conduct joint efforts; - Immediate communication between curators through parallel chat mechanisms; - Rigorously documenting the experimental or inferential basis for all of the annotations that are made with credit assigned through citations; - Well supported history mechanisms providing the ability to comment on versions, browse versions to see different edits and commentary, and revert to earlier versions; - Providing different levels of permissions for users and administrators, for example so that a curator might 'doodle' within their own work area before releasing their version for feedback from others; - Offering incentives for adoption, such as facilitating the publication process; - Prompt responsiveness to users' requests and informative documentation, and dedicated resources for training and user support, from online seminars to video tutorials to repositories with teaching materials; - Functional stability and ease of migrating forward when new software is released; - And, most importantly, a publishing mechanism, such that biocurators and other contributors receive credit for their insights and contributions to our collective understanding of biology. |
Monica C Munoz-Torres, Chris Mungall, Nathan Dunn, Seth Carbon, Heiko Dietze, Nicole Washington, Jeremy Nguyen Xuan, Paul Thomas and Suzanna Lewis | |
53 |
The current incarnation of GigaDB relies upon the paid curators at GigaScience to read, understand and
annotate each entry manually. This gives the best possible annotations, but obviously cannot be
sustained with ever increasing numbers of submissions. We are investing significant time and effort to
enable more people to provide those annotations, such as authors (through a submission wizard) who
obviously know their data best of all; the reviewers whom have an active interest in the articles and
are therefore in a prime position to suggest relevant annotations; as well as the general users - either
casually clicking through or seriously making use of the data. Everyone can add something, if only they
were given the tools to do so.
Here we present the current and already active addition of hypothes.is, a web curation layer, to the GigaDB pages which allows anyone to add comments and annotations to our pages, either publicly or to make private notes about the data for themselves or their 'group'. In addition we will show some future initiatives that we are planning to help keep data held in GigaDB up to date with the as much metadata as is available, making the data as discoverable and as useful to as many people as possible. |
Christopher Hunter, Xiao Sizhe, Peter Li, Laurie Goodman and Scott Edmunds | |
54 |
The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) hosts, maintains and presents nucleotide
sequence data along with associated sample and experimental metadata. Functional annotation has always
been hand-in-hand with the storage of traditional sequence records, providing an interpretation of
genetic features together with the sequence itself. Such information enables discoverability of the
data, regardless of whether the annotation is supported by experiment or inference alone. As nucleotide
sequencing has become cheaper and more productive, particularly in the area of whole genome sequencing,
the number of assembled sequences, and therefore functional annotations, has continued to grow at an
unprecedented rate. ENA has been addressing these challenges with developments over the last six years.
These include: (1) an increase in the number of checklists useful for simple and common functional
annotations (such as bacterial genes, rRNA genes and phylogenetic markers); (2) provision of editable
skeleton files (known as 'templates'), useful for the more advanced user in submitting more complex
annotation; (3) support for the submission of viral genomes through the Assembly Pipeline; and (4)
extension of automatic biological rule-based validations at the time of submission. Such ongoing changes
are not only providing a smoother experience and faster turnaround for the user but are re-shaping the
role of the ENA biocurator to a more sustainable biocuration.
|
Richard Gibson, Blaise Alako, Clara Amid, Ana Cerdeno-Tarraga, Iain Cleland, Neil Goodgame, Petra ten Hoopen, Suran Jayathilaka, Simon Kay, Rasko Leinonen, Xin Liu, Swapna Pallreddy, Nima Pakseresht, Jeena Rajan, Marc Rossello, Nicole Silvester, Dmitriy Smirnov, Ana Luisa Toribio, Daniel Vaughan, Vadim Zalunin and Guy Cochrane | |
55 |
FlyBase has had a successful community curation tool in place for several years
called Fast-Track Your Paper (FTYP). It is a web application that presents users with a series of forms and creates curation records from their input. These records then feed into our curation pipeline with no special processing necessary. The records serve to help triage publications for further curation by FlyBase staff and to expedite the linking of genes to publications in our database. This web application works in conjunction with another system that emails the authors of recently published Drosophila papers requesting that they fill out the forms. We have consistently gotten an approximately fifty percent response rate to these emails over about five years time leading to some seven thousand curation records created by users. We will present more usage statistics and experiences from the deployment of the tool over this time. The technical details of our rewrite of the original tool released this past year will also be presented. We will also discuss possible future directions of our community curation efforts at FlyBase. |
Gary Grumbling, Jim Thurmond, Josh Goodman, Thom Kaufman and Flybase Consortium | |
56 |
National Bioscience Database Center (NBDC; http://biosciencedbc.jp/en/) was founded in 2011, as a core
institution for the integration of life science databases in Japan. We took over our mission from
Integrated Database Project (2006-2010). For integration and usage promotion of life science databases
scattered among many research institutes, we have conducted the following activities:
- Strategic planning for database research and development - Enhancement of life science databases - Sharing research data - International cooperation And we have released, expanded and sophisticated the following services: Integbio Database Catalog (http://integbio.jp/dbcatalog/?lang=en): The catalog includes basic information of greater part of life science databases in Japan and major life science databases around the world. It consists of URL, database maintenance site, category, organism, operational status, and so on. We curate those information according to our consistent curation policy. Now we prepare to release the RDF formatted version of the catalog. Life Science Database Cross Search (http://biosciencedbc.jp/dbsearch/index.php?lang=en): Life Science Database Cross Search is a searching service across more than 160 databases which include literatures and patent publications. This system is composed of a lot of servers in 5 organizations of 4 ministries. To realize distributed search among remote organization, we adopted full text search engine, Hyper estraier. Now, we started changing this engine to "Elasticsearch" so as to resolve growing index data. Curators play an important role in adding a new database to this system. Curators investigate the crawling region in each database, predict the URIs of each entry, and classify attribute information. So, cross search curators are needed to have two backgrounds, information technology and life science. We constructed a reasonable workflow for advanced curation of cross search, and we achieved to add new database tenfold faster than before. Now, user can also search "Deep-web" data (even Google cannot find) from this cross search system. This system can be used as an infrastructure of comprehensive database search. Life Science Database Archive (http://dbarchive.biosciencedbc.jp/index-e.html): The archive collects life science databases scattered among many research institutes. We will stably keep and maintain them over the long term. To help users to understand databases, we provide detailed metadata of databases and curate them. Each metadata links to the information of researchers (ORCID, researchmap), articles (PubMed, J-GLOBAL) and funds (Life science projects in Japan, J-GLOBAL). To promote reuse, databases in the archive are published under Creative Commons Attribution-Share Alike (CC BY-SA) in principle. Integrated Search: Connecting databases organically enables users to search them with complicated conditions. To realize it: - We are RDFizing Integbio Database Catalog and databases in Life Science Database Archive. - We have released RDF portal (http://integbio.jp/rdf/). It collects life science data in RDF format. - We are developing tools for RDF search in collaboration with Database Center for Life Science (DBCLS; http://dbcls.rois.ac.jp/en/). |
Shigeru Yatsuzuka, Jun-Ichi Onami, Tomoe Nobusada, Atsuko Miyazaki, Hideki Hatanaka and Toshihisa Takagi | |
57 |
IMGT©, the international ImMunoGeneTics information system, http://www.imgt.org, is the global
reference in immunogenetics and immunoinformatics [1]. By managing the extreme diversity and complexity
of the antigen receptors of the adaptive immune response, the immunoglobulins (IG) or antibodies and the
T cell receptors (TR) [2, 3] (2.1012 different specificities per individual), IMGT© is at the
origin of immunoinformatics, a science at the interface between immunogenetics and bioinformatics [4].
IMGT© is based on the concepts of IMGT-ONTOLOGY [5] and these concepts are used for expert
annotation and standardized knowledge in IMGT/LIGM-DB, the IMGT© database of IG and TR nucleotide
sequences from human and other vertebrate species and in IMGT/GENE-DB, the IMGT© gene and allele
database. The IMGT/LIGM-DB biocuration pipeline of IG and TR sequences includes IMGT/LIGMotif, for the
analysis of large genomic DNA sequences, and IMGT/Automat, for the automatic annotation of rearranged
cDNA sequences. Analysis results are checked for consistency, both manually and by using IMGT©
tools (IMGT/NtiToVald, IMGT/V-QUEST, IMGT/BLAST, etc.). The annotated sequences are integrated in
IMGT/LIGM-DB and include the sequence identification (IMGT© keywords), the gene and allele
classification (IMGT© nomenclature), the constitutive and specific motif description (IMGT©
labels in capital letters, no plural), the translation of the coding regions (IMGT© unique
numbering) [4, 5]. For genomic IMGT/LIGM-DB sequences containing either an IG or TR variable (V),
diversity (D) or joining (J) gene in germline configuration or a constant (C) gene, the gene and allele
information is entered in IMGT/GENE-DB. In parallel, the IMGT© Repertoire is updated (Locus
representations, Gene tables and Protein displays (for new genes), Alignments of alleles (for new and/or
confirmatory alleles)) and the IMGT© reference directory [1, 4] is completed (sequences used for
gene and allele comparison and assignment in IMGT© tools (IMGT/V-QUEST, IMGT/HighV-QUEST for next
generation sequencing (NGS), IMGT/DomainGapAlign) and databases (IMGT/2Dstructure-DB,
IMGT/3Dstructure-DB). An IMGT/GENE-DB entry also provides information on the rearranged cDNA and gDNA
entries (with links to IMGT/LIGM-DB) and on the three-dimensional structures (with links to
IMGT/3Dstructure-DB). IMGT/GENE-DB is the official repository of IG and TR genes and alleles. IMGT©
gene names were approved by HGNC and endorsed by WHO-IUIS, the World Health Organization
(WHO)-International Union of Immunological Societies (IUIS) Nomenclature Subcommittee for IG and TR.
Reciprocal links exist between IMGT/GENE-DB and HGNC, NCBI and Vega. The definition of antibodies
published by the WHO International Nonproprietary Name (INN) Programme is based on the IMGT©
concepts [6], and allows easy retrieval via IMGT/mAb-DB query [1, 4]. The IMGT© standardized
annotation has allowed to bridge the gaps for IG or antibodies and TR between fundamental and medical
research, veterinary research, repertoire analysis, biotechnology related to antibody engineering,
diagnostics and therapeutical approaches.
[1] Lefranc M-P et al. Nucleic Acids Res 43:413-422 (2015) PMID: 25378316, [2] Lefranc M-P, Lefranc G. The Immunoglobulin FactsBook (2001), [3] Lefranc M-P, Lefranc G. The T cell receptor FactsBook (2001), [4] Lefranc M-P. Front Immunol 5:22 (2014) PMID: 24600447, [5] Giudicelli V, Lefranc, M-P. Front Genet 3:79 (2012) PMID: 22654892, [6] Lefranc M-P. mAbs 3(1):1-2 (2011) PMID: 21099347. |
Geraldine Folch, Joumana Jabado-Michaloud, Marine Peralta, Melanie Arrivet, Imene Chentli, Melissa Cambon, Pascal Bento, Souphatta Sasorith, Typhaine Paysan-Lafosse, Patrice Duroux, Veronique Giudicelli, Sofia Kossida and Marie-Paule Lefranc | |
58 |
The FlyBase 'Gene Groups' resource provides sets of Drosophila melanogaster genes that share common
features. These groups are currently restricted to well-defined, easily delimited groupings such as
evolutionary-related gene families (e.g. actins, Wnts), subunits of macromolecular complexes (e.g.
ribosome, SAGA complex), sets of genes whose products share a common molecular function (e.g.
phosphatases, GPCRs, ubiquitin ligases) and gene complexes (e.g. Enhancer of split complex). Gene Groups
are manually curated from the primary literature ensuring that the groups are of a high-quality and
fully attributed. FlyBase has integrated the building of this resource with a review of gene annotation
data. First, for each group the membership is checked to ensure that the group is complete and
represents the current research literature and genome annotation. Second, the Gene Ontology (GO)
annotation is reviewed to ensure that groups of genes are annotated with terms that reflect their core
common biology. Third, a review of the gene nomenclature is conducted to improve consistency and reflect
community usage. Gene Group data in FlyBase are presented in the form of Gene Group Reports that include
convenient download and analysis options, together with links to equivalent gene groups at other
databases. This new resource will enable researchers with diverse backgrounds and interests to easily
view and analyse acknowledged sets of fly genes.
|
Helen Attrill and Giulia Antonazzo | |
59 |
Taxonomy encompasses the description, identification, nomenclature and classification of organisms.
Unfortunately the scientific literature and data repositories are plagued by incorrect taxonomic
assignments, with organism names that are erroneously assigned, ambiguous, out-dated, or simply
misspelled, errors that complicate data integration and exploitation. It is therefore crucial to build
and maintain taxonomy databases that provide standardized nomenclature and identifiers, and to promote
their usage in the research and bioinformatics community. There are many taxonomy standardization
efforts that all rely on expert curation. At UniProt we employ the NCBI taxonomy database as our base
for taxonomic descriptions. We systematically review and curate the taxonomic assignment for every
organism which enters UniProtKB/Swiss-Prot and discuss and resolve inconsistencies and errors with the
NCBI taxonomists on a daily basis. This informal collaboration is a significant contribution to
maintaining an important resource that is used by many other bioinformatics databases.
|
Sandrine Pilbout, Teresa Manuela Batista Neto and Nicole Redaschi | |
60 |
UniProt provides the scientific community with a comprehensive, high-quality and freely accessible
resource of protein sequence and functional information. It facilitates scientific discovery by
organising biological knowledge and enabling researchers to rapidly comprehend complex areas of biology.
Information in the UniProt Knowledgebase (UniProtKB) is integrated from a range of sources such as
scientific literature, protein sequence analysis tools, other databases and automatic annotation systems
to provide an integrated overview of available protein knowledge. As the data are derived from multiple
disparate sources, it is important that users can easily trace the origin of all information. To achieve
this, UniProt makes use of a subset of evidence codes from the Evidence Ontology to indicate data
origin. This system allows users to trace the source of all information and to differentiate easily
between experimental and computationally-derived data. An overview of the evidence codes used, how are
these are displayed to users and how they can be used to retrieve specific categories of data will be
presented. All data are freely available from www.uniprot.org.
|
Michele Magrane, Cecilia Arighi, Sylvain Poux, Nicole Redaschi, Maria Martin, Claire O'Donovan and Uniprot Consortium | |
61 |
The National Center for Biotechnology Information (NCBI) has developed a robust prokaryotic genome
annotation pipeline which is offered as a service to GenBank submitters and is used to annotate RefSeq
prokaryotic genomes. NCBI's annotation pipeline integrates annotation quality standards which were
developed in part through a series of microbial assembly and annotation workshops held by NCBI. These
workshops defined annotation quality criteria for complete genomes, standards for reporting support
evidence, and protein naming protocols which are used by UniProt, GenBank, and RefSeq for eukaryotic and
prokaryotic genomes. In 2015 the RefSeq project implemented a number of changes impacting the
prokaryotic genomes data set. These include: a) development of a new framework for evidence based
protein naming, name curation, and evidence tracking; b) increased collaboration to improve protein
names; c) completion of the transition to a new data model for managing prokaryotic protein sequences;
d) expanded assembly and annotation quality testing; e) increased capacity in NCBI's prokaryotic genome
annotation pipeline; and f) comprehensive reannotation of all RefSeq prokaryotic genomes. These
developments expand on the established annotation QA, evidence, and name standards. As a result, RefSeq
provides significant annotation consistency which facilitates comparative genomics. Our more rigorous QA
criteria resulted in the suppression of over two thousand RefSeq prokaryotic genomes that do not meet
these criteria, thus ensuring continued high quality of the RefSeq prokaryotic genomes dataset. Based on
our new infrastructure to support protein name curation, we have established evidence-based and tracked
names for approximately 25% of the RefSeq proteins thus far. Curation of protein names and name evidence
uses a multi-faceted approach that leverages Hidden Markov Models (HMMs), domain architectures,
available curated name data from Swiss-Prot and the Enzyme Commission, collaboration, and curation by
NCBI staff. We use HMMs from several sources including TIGRfams and Pfams, and are creating new HMMs
(NCBIfams) when further refinement is needed. The RefSeq group collaborates with NCBI's Conserved
Domains Database (CDD) group to provide functional names based on reviewed domain architectures. We
collaborate with individual scientists and expert databases to provide the best names for some classes
of proteins. NCBI staff curate name information both at the level of the support evidence (HMMs), and at
the level of individual RefSeq protein names. Database tables support tracking the name update history,
the biocurator or collaborator source of the update, and all support evidences for the name at the time
of the update. We plan to start reporting functional evidence information on RefSeq protein records in
2016. The presentation will summarize the current RefSeq prokaryotic genomes process flows for genome
annotation and curation/collaboration, our current assembly and annotation quality criteria, examples of
annotation improvements resulting from collaboration and NCBI staff curation, and a proposal for
reporting sets of functional evidence in the context of NCBI sequence displays.
|
Kim Pruitt, Stacy Ciufo, Michael Dicuccio, Daniel Haft, Wenjun Li and Kathleen O'Neill | |
62 |
The Encyclopedia of DNA elements (ENCODE) project, currently in its 10th year of production scale, is a
collaborative effort toward cataloging genomic annotations. The research institutes within the
consortium have produced data from more than 5,000 experiments using a variety of techniques to study
the structure, regulation, and transcription profiles of human and mouse genomes. Furthermore, the
ENCODE site (https://www.encodeproject.org/) has incorporated data generated through other projects
involving genomic assays of fly and worm genomes. All of the data displayed on the ENCODE portal first
passes through the Data Coordination Center (DCC) for basic validation and metadata standardization. As
the amount of data that goes through the DCC continues to grow exponentially, it is necessary to
increase the attention and effort given to the curation and organization of metadata. At the ENCODE DCC,
we have made vast use of a variety of tools to aid in capturing and integrating experimental details.
The ENCODE DCC's active contribution to ontology databases and our use of ontologies as a means to
standardize metadata allows for the ease of identifying and comparing related experiments. Additionally,
the ENCODE project is structured such that the DCC interacts with production centers from the proposal
of the project all the way through data submission. This results in constant and efficient metadata
modeling, as well as high quality data even before publication. Here, we discuss the strategies employed
by the ENCODE DCC to maximize accessibility to epigenomic data and analysis.
|
Jason Hilton, Cricket Sloan, Ben Hitz, Esther Chan, Jean Davidson, Idan Gabdank, J Seth Strattan and J Michael Cherry | |
63 |
The National Center for Biotechnology Information (NCBI) hosts two resources BioProject and BioSample,
which facilitate the capture of structured metadata for diverse biological research projects and samples
represented in NCBI's archival databases. BioProject (http://www.ncbi.nlm.nih.gov/bioproject/) is an
organizational framework to access information about a research initiative. Links to data submitted to
NCBI and other International Nucleotide Sequence Database Collaboration (INSDC) archives are aggregated
in BioProject for easy access to the submitted data. Publications, grant information and descriptive
information are also captured. BioProjects can exist in a hierarchical structure to represent studies
that are part of a larger research initiative. The BioSample (http://www.ncbi.nlm.nih.gov/biosample/)
database stores descriptive information or metadata about the biological materials used in studies for
which data is submitted to NCBI and other INSDC archives. BioSample packages represent broad categories
of sample types and help guide submitters to provide the appropriate descriptive attributes. Rich sample
descriptions are important for data users to fully understand the context of the experiments and allow
them to more fully interpret the outcomes. As with BioProject, links to submitted data are presented in
BioSample for access to all of the data generated from a particular sample. BioProjects and BioSamples
are created as part of the data submission process or linkages can be asserted to previously registered
records. For users, both of these resources provide an entry point to the submitted data stored in the
archives and allow users to access data based on queries based on specific attributes of interest to
their research.
|
Karen Clark, Tanya Barrett and Ilene Mizrachi | |
64 |
Classification algorithms are intensively used in discovering new information in large datasets. In this
work several classification algorithms are applied to a dataset of prokaryotic organisms. A comparative
analysis of the algorithms is performed based on the variations of the data types, dataset dimensions
and presence/absence of the attributes. The analysis indicates which of the considered classification
models is most suitable for this dataset. Results obtained in this analysis can be useful in further
researches devoted to grouping of the considered organisms.
|
Milana Grbic | |
65 |
With the huge body of information already published and the rapid increase in the number of papers
indexed in PubMed every year, it is becoming increasingly important to be able to search in a systematic
way through biological data. Compounding this issue is the fact that most data are published in the form
of figures. In figure format, the experimental data, which provides the evidence for scientific claims,
are not machine-readable and, therefore, are neither re-usable nor discoverable. In this way valuable
data might be lost or unnecessarily repeated.
To address these issues, we have initiated the SourceData project (http://sourcedata.embo.org) with the aim to provide a scalable, structured way to annotate and search through data resulting from hypothesis-driven research in cell and molecular biology. To this end, we have developed a biocuration interface for data editors, to extract and integrate machine-readable metadata, which can be added to figures and their underlying source data. This process can be integrated into the publication workflow and hence does not require authors to alter their submission process. Taking advantage of the information provided by authors in figure legends, data editors identify biological entities in experiments and specify their role in the experimental design. Furthermore these entities are disambiguated, by linking them to entries in existing biological databases. Once a paper is curated by data editors, a secondary validation interface is presented to authors so that they can review the process and ensure the result is an accurate reflection of their data. The resulting semantic information is used to build a searchable 'scientific knowledge graph' that is objectively grounded on published data and not on the potentially subjective interpretation of the results. The resulting SourceData search platform and SmartFigure viewer enable targeted searches of scientific papers based on their data content, thus complementing current keyword-based search strategies. By enhancing the discoverability of research data, SourceData aims at stimulating new discoveries by helping scientists to find, compare and combine data over multiple studies. |
Robin Liechti, Lou Goetz, Sara El-Gebali, Nancy George, Isaac Crespo, Ioannis Xenarios and Thomas Lemberger | |
66 |
The task of structured output learning is to learn a function that enables prediction of complex objects
such as sequences, trees or graphs for a given input. One of the problems where such methods can be
applied is protein function prediction. With growing number of newly discovered proteins and slow and
expensive experimental methods for their functional annotation, the necessity for fast and accurate
tools for protein function prediction has risen in the past several years. Reliable information on
protein function is especially important in context of human diseases, since many of them can occur due
to alteration of function upon mutation. In protein function prediction, the aim is to find one or more
functions that it performs in a cell according to its characteristics such as its primary sequence,
phylogenetic information, protein-protein interactions, etc. The space of all known protein functions is
defined by a directed acyclic graph known as Gene Ontology (GO), where each node represents one function
and each edge encodes a relationship such as is-a, part-of, etc. Each output, on the other hand,
represents the subgraph of GO, consistent in a sense that it contains a protein's functions propagated
to the root of the ontology.
In this research, we developed structured output predictor that determines protein function according to the histogram of 4-grams that appear in the protein's sequence. The predictor is based on the machine learning method of structural support vector machines (SSVM), which represents generalization of the well-known SVM optimizers on structured outputs. Adjusting SSVM to this specific problem required the development of an optimization algorithm that maximizes an objective function over the vast set of all possible consistent subgraphs of protein functional terms as well as careful choice of loss functions. To investigate the influence of the organism that the protein originates from on quality of the protein function prediction, we constructed 5 prediction models trained on proteins of single organisms (human, rat, mouse, E.coli and A. thaliana) and cross-tested each model on proteins from each other organism. The results obtained are comparable with the last CAFA (Critical Assessment of Function Annotation) competition results - for rat, mouse and A. thaliana the results are in the top 15%. As expected, best results for an organism are obtained by the model trained on proteins of the organism itself, except for mouse and rat for which human proteins-trained model performed better. The results suggested dependence of the developed predictor on the volume and quality of training data, and confirmed protein function similarity of evolutionarily close organisms. |
Jovana Kovacevic and Predrag Radivojac | |
67 |
Introduction: Biological databases serve an important need of storing, organizing and presenting
research data to the research community, as the sheer amount of data published make it essentially
impossible for an expert to have the corpus knowledge of pertinent to her/his research field. The rapid
increase in the number of published articles also poses a challenge for curated databases to remain
up-to-date. To help the scientific community and database curators to deal with this issue, we have
developed an application - neXtA5 - which prioritizes the literature for specific curation requirements.
Methods: Our system, neXtA5, is composed of two main elements. The first is a module based on text-mining that weekly annotates MEDLINE over 5 axes: diseases, the three aspects of the Gene Ontology (GO), and protein interactions. Afterwards, it stores findings in our local database, BioMed. Additional entities such as the species or the chemical compounds are also extracted and displayed to further facilitate the work of curators. The second element is exploiting an Information Retrieval component, which uses the density of entities in abstracts, to prioritize the publications. The ranking function is performed independently on the five different annotation axes. Results: Based on the Text REtrieval Conference evaluation model, we increased precision from 0.28 to 0.42 for the disease axis, and from 0.36, to 0.44 for the GO Biological Process axis. We are currently working on optimizing parameters to improve Molecular Function, Cellular Components and the protein-protein interaction axis. Conclusion: Our application aims to improve the efficiency of annotators by providing a semi-automated curation workflow. A user-friendly interface powered with a set of JSON web services are currently being implemented into the neXtProt annotation pipeline. Available on: http://babar.unige.ch:8082/neXtA5 |
Luc Mottin, Julien Gobeill, Emilie Pasche, Pierre-Andre Michel, Isabelle Cusin, Pascale Gaudet and Patrick Ruch | |
68 |
Duplication is a key data quality problem in many domains. It is related to redundancy (near-identical
instances), incompleteness (fragmentary records), and inconsistency (records with contradictory
information) – issues that undermine the efficiency of data retrieval and the accuracy of answers
to queries. It is problematic in bioinformatics databases, where data volumes are rapidly increasing.
The high data volume makes fully manual approaches, recognized as the most precise way to determine
whether a pair is duplicate, infeasible. Automatic approaches may be feasible, but methods to date have
examined only limited types of duplicates under simple assumptions. A further fundamental issue is that
the definition of 'duplicate' is context dependent. A pair considered as duplicates by one community, or
for one task, may not necessarily be so in another context; a duplicate detection method that achieves
high performance in a dataset gathered under restrictive assumptions may not necessarily be effective on
another dataset.
We have collected records that can be regarded as duplicates under a range of assumptions, and created related benchmarks. We built three DNA sequence database benchmarks, based on information drawn from a range of resources, including information derived by mapping between databases. Each benchmark has distinct characteristics. We quantitatively measure these characteristics and argue for their complementary value in evaluation of duplication detection techniques. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs. They are also the first benchmarks targeting the primary nucleotide databases. The records cover the 21 most studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of tremendous value for development and evaluation of duplicate detection methods. They also represent the diversity and complexity of duplication in biological databases. |
Qingyu Chen, Justin Zobel and Karin Verspoor | |
69 |
Ever-increasing scientific literature enhances our understanding on how toxicants impact biological
systems. In order to utilize this information in the growing field of systems biology and toxicology,
the published knowledge must be transformed into a structured format highly efficient for modelling,
reasoning, and ultimately high throughput data analysis and interpretation. Consequently, there is an
increasing demand from systems biologists and toxicologists to access such knowledge in a computable
format, here biological network models.
The Biological Expression Language (BEL) is a machine- and human-readable language that represents molecular relationships and events as semantic triples: subject–relationship–object. These triples are called BEL statements. BEL statements associated with their evidence are encapsulated into the large BEL document. The BEL document also captures additional information such as article references, PMID of the scientific article and a large annotation dataset that accurately defines the context of knowledge such as the organism, tissue and disease state. BEL statements can computationally be assembled to biological network models. To facilitate encoding and curation of biological findings in BEL, a BEL Information Extraction workFlow (BELIEF) was developed. BELIEF contains a text mining pipeline for the automatic generation of BEL compliant knowledge statements and a web-based curation interface - the BELIEF Dashboard - that facilitates manual curation of the automatically generated BEL statements. The text mining pipeline is UIMA-based and accommodates several named entity recognition processes and relationship extraction methods in order to detect concepts and BEL relationships from any text resource. The BELIEF Dashboard reuses the output of the BELIEF pipeline to create a web-based interface and facilitate the manual curation task. Although BEL itself enables curation given the human-readability of the syntax, BELIEF simplifies the curation process by highlighting named entities and disambiguating gene and protein names. In addition to the new features of BELIEF, we also present the competitive performance results based on the BioCreative V BEL track evaluation. The BELIEF pipeline automatically extracts normalized concepts with the best F-score of 76.8%. The detection of full relationships and entirely correct statements was achieved with the F-score of 43.1% (second-best) and 30.8% (highest). The participation at the Interactive task (IAT) track in BioCreative V revealed a Systems Usability Scale (SUS) of 67. Given the complexity of the task for untrained users this score certifies a high usability for BELIEF. In conclusion, BELIEF simplifies the curation process and facilitates the construction of biological network models that can be fully contextualized and used for the interpretation of systems biology and systems toxicology data. This workflow is currently being further developed to be used in an industrial setup for product risk assessment. |
Sam Ansari, Justyna Szostak, Sumit Madan, Sven Hodapp, Philipp Senger, Marja Talikka, Juliane Fluck and Julia Hoeng | |
70 |
We used machine learning in order to put to the test the functional annotation consistency of the five
most curated species in the Gene Ontology Annotation (GOA) database: homo sapiens, saccharomyces
cerevisiae S288c, rattus norvegicus, drosophila melanogaster and mus musculus. The studied task was to
automatically infer a list of Gene Ontology (GO) concepts from some MEDLINE citations. Machine learning
systems aim to exploit a knowledge base containing already annotated contents (gene products, GO
concepts, PMIDs), in order to propose a ranked list of GO concepts for any not yet curated input PMID.
In other words, such systems learn from the knowledge base in order to reproduce the curators' behavior.
In this study, we used the GOCat system, our local GO classifier that implements a k-Nearest Neighbors
algorithm. For the design of the knowledge base, we used GOA in order to collect the MEDLINE citations
associated with a gene and a set of GO concepts; we thus populated the knowledge base with an equal
amount of 9,000 MEDLINE abstracts for each species, along with their associated GO concepts. 1,000
supplementary abstracts (200 for each species) were used to build the test set: their associated GO
concepts were hidden, and GOCat had to recover them. As the knowledge base and the test set come from
the same source, the ability to assign a GO concept should directly depend on the consistency of the
annotations. GOCat outputs a list of generated GO concepts ranked by confidence scores. The performances
of GOCat are measured using two standard metrics. Top Precision (P0) is the fraction of the proposed GO
concepts that are correct in the top of the GOCat ranking. Recall at 20 (R20) is the fraction of the GO
concepts that were in GOA and that were successfully propoed by GOCat in the first 20 ranks. The
baseline results for the whole test set (1,000 abstract) are 0.45 for P0 and 0.57 for R20. Yet,
significant differences are observed across species. Homo sapiens and saccharomyces cerevisiae S288c
abstracts obtain better results than the baseline: respectively 0.52 for P0 (+16%) and 0.60 for R20
(+5%), 0.45 for P0 (equal) and 0.67 for R20 (+18%). In contrast, rattus norvegicus and mus musculus
obtain results lower than the baseline: respectively 0.38 for P0 (-16%) and 0.52 for R20 (-9%), 0.42 for
P0 (-7%) and 0.47 for R20 (-18%). Drosophila melanogaster performances are similar to the baseline.
These results show that the homo sapiens functional annotation in GOA is more consistent than the rattus
norvegicus and mus musculus ones.
|
Julien Gobeill, Luc Mottin, Emilie Pasche and Patrick Ruch | |
71 |
Text2LOD: building high-quality linked open annotation data concerning biological interests
Biological features and metadata of organisms are mainly described in literature, and therefore it is difficult for computational methods to analyze and understand a plethora of genomic data in terms of biological aspects. Many efforts have been taken to extract biological knowledge from literature by using text mining technologies and by developing domain specific dictionaries and ontologies. However, knowledge of some biologically interesting aspects haven't been fully extracted and stored in structured formats such as environments or place where each organism grows and lives. In this situation, we are developing a system to automatically extract them from full texts of papers that describe the genome sequence of an organism. Currently, we are building gold standard data sets focusing on several biologically interesting aspects, that is, habitat environments, sampling places, cell sizes, growth temperature and pH of (targeted) organisms/microbes/microbial species. Three domain experts are annotating papers that were obtained from PMC Open Access subsets. The total number of annotated papers is 2627 as of writing, and that of annotations is 3718. The most annotated aspect is living environments, and the numbers are 1395 and 1517, respectively. While we continue to annotate papers, we are also developing the extraction system. We employ a supervised machine learning approach and template based extraction methods depending on aspects. Our goal is to provide such datasets as Linked Open Data that can be accessed easily from both human and computers without registrations. Database Center for Life Science (DBCLS) provides a platform where you can easily search and browse multiple biological aspects of organisms called TogoGenome. The dataset of each aspect uses Resource Description Framework and can be accessed by SPARQL in addition to the TogoGenome site. Therefore, we plan to provide our datasets through TogoGenome. We discuss what we've learnt and future works. |
Yasunori Yamamoto, Shinobu Okamoto, Shuichi Kawashima, Toshiaki Katayama, Yuka Nakahira-Yanaka, Hiroko Maita and Sumiko Yamamoto | |
72 |
With the ever-increasing number of published scientific literature, the necessity of automatic
extraction of biological knowledge to build biological networks has become indispensable. In order to
approach the task of automatic extraction and network generation, the improvement of already available
methods as well as the development of new methods and systems is crucial. Track 4 at BioCreative V
offered a challenge to develop, adapt, and evaluate text mining systems using material of manually
curated biological networks, represented in Biological Expression Language (BEL). BEL is a standard
knowledge representation language for systems biology that allows to express molecular and cellular
relationships in form of nanopublications, also named BEL statements. The language has been specifically
designed to be both human-readable and machine-computable.
BioCreative V track 4 included two specific and independent tasks, evaluating two complementary aspects of the problem. Task 1 evaluated text mining systems that are capable of automated BEL statement construction from a given text snippet. For task 2, the systems needed to suggest 10 additional text snippets for a given BEL statement. A BEL statement consists of multiple biological terms, functions, and relations. For task 1, our chosen evaluation methodology considered these fragments of information expressed by BEL statements and evaluated the systems at each of these structural levels. The aim of this evaluation strategy was to help identify the key features and characteristics of each system. For the evaluation of task 2, the text snippets for a given BEL statement were manually assessed by experts on three different levels of increasing strictness. To perform the tasks, participating systems needed to be capable of high-quality recognition of biological terms, their normalization to database entries, the extraction of the relationships between terms, and the transformation of all this information into BEL syntax. The systems used state-of-the-art methods based on dictionary-lookup, rules derived from expert knowledge and advanced machine learning to perform well in the task. At term level, the best systems scored around 69%. For relation extraction, up to 72.7% F-score was reached. In contrast, the results on extracting protein function terms were relatively poor, around 30% F-score. False or missing function assignments were also one of the main reasons for the low score (18.2%) of full BEL statement extraction. The performance increased significantly to 25.6% when gold standard entities were provided by the organizers. For task 2, F-scores between 39.2% and 61.5% were reached depending on the strictness of the applied criterion. In summary, track 4 at BioCreative V showed that manually curated BEL networks can be used as training data to develop new text mining methods and systems. The training and test data as well as the evaluation environment is available for further development of these systems, and future extensions of the annotated data are planned. The resulting systems can hopefully be deployed to assist biocuration for network generation in the area of systems biology. |
Juliane Fluck, Sumit Madan, Tilia Ellendorff, Theo Mevissen, Simon Clematide, Adrian van der Lek and Fabio Rinaldi | |
73 |
BioCreative is a community-wide effort for evaluating text mining systems applied to the biological
domain. BioCreative has been running since 2004, providing the premier evaluation forum for this domain.
Previous editions of BioCreative have included tasks such as recognition of gene mentions and their
normalization to database identifiers, identification of protein-protein interactions, function
annotation from text using GO terms, and extraction of chemicals, drugs and diseases and relations among
them. BioCreative has spearheaded several innovations in the field, including the development of
corpora, evaluation methodologies and interoperability standards (BioC).
The tasks in BioCreative V, in addition to addressing suggestions from the biocuration community, included several novel tasks. The following specific tasks were evaluated: - Track 1 (BioC track- Collaborative Biocurator Assistant Track): Interoperability of components assembled for a curation task. Developers were invited to provide complementary text mining modules that could be seamlessly integrated into a system capable of assisting BioGRID curators. The simple BioC format ensured interoperability of the different components. - Track 2 (CHEMDNER-patents track) Processing of chemical entities in patent data, a resource type currently underrepresented in public annotation databases. - Track 3 (CDR track) Extraction of chemical-disease relations from the literature, using the Comparative Toxicogenomics database (CTD) as a potential curation target. - Track 4 (BEL track) Extraction of fragments of pathway networks, in particular causal networks, in a formal language known as Biological Expression Language (BEL), and the extraction of evidence sentences from the literature for given BEL statements. - Track 5 (IAT track) The curator-centric evaluation of interactive web-based text mining technologies. These tasks have resulted in valuable resources for developing and evaluating biomedical text mining systems. Overall 53 unique research teams participated across the tracks, representing more than 120 researchers, with some researchers taking part in multiple tracks (CDR: 24 teams, CHEMDNER-patents 22 teams, BioC 9 teams, IAT 6 teams, BEL 5 teams). The BioCreative V evaluation workshop (http://www.biocreative2015.org, 73 registered participants) provided a forum to discuss the results of each track and the participating text-mining systems. Additionally, there were three panel sessions on (a) text mining for literature curation, (b) crowdsourcing and curation and (c) disease annotation and medical literature. These sessions provided a forum to discuss current trends and limitations as well as future directions for biomedical text mining related to these trending research topics. The 63 workshop proceedings papers of BioCreative V providing descriptions of the evaluation and participating systems are available at http://www.biocreative.org/resources/biocreative-v/proceedings-biocreative5; a special issue of the journal Database is in preparation. |
Cecilia Arighi, Kevin Cohen, Donald C. Comeau, Rezarta Islamaj Dogan, Juliane Fluck, Lynette Hirschman, Sun Kim, Martin Kralliner, Zhiyong Lu, Fabio Rinaldi, Alfonso Valencia, Thomas Wiegers, W. John Wilbur and Cathy Wu | |
74 |
The purpose of the BioC track in BioCreative V was to create a set of complementary modules that could
be seamlessly integrated into a system capable of assisting BioGRID curators. Specifically, the
resulting interactive system triaged sentences from full text articles in order to identify text
passages reporting mentions and experimental methods for protein-protein and genetic interactions. These
sentences were then highlighted in the curation annotation suite. The task required the identification
of passages or sentences describing genes/proteins/species involved in the interaction and mentions
and/or experimental methods for molecular interactions. Nine teams from all over the world developed one
or more modules independently, integrated via BioC, to insure the interoperability of the different
systems.
The collaborative curation tool task provided several important achievements. 1. A fully operational system, which was achieved in three months of on-line collaboration between the teams. 2. Interoperability. Data was received, produced and exchanged in the BioC format. This simple format avoided many interoperability hurdles. 3. An easy-to-use system. The four participating curators gave positive feedback regarding the user-friendliness and the curation tool in general. 4. Annotated data. A corpus of 120 full text articles is available for the community containing curation-relevant annotations for mentions and experimental methods evidence for protein-protein and genetic interactions. Text mining based tools provide valid support to biocuration only if easy-to-learn and user-friendly. Reaching this goal requires the adoption of new evaluation metrics. The extended, direct interaction between text miners and curators permitted the identification of new questions, challenges, and opportunities for using text mining in manual annotation pipelines. |
Rezarta Islamaj Dogan, Sun Kim, Andrew Chatr-Aryamontri, W. John Wilbur and Donald C. Comeau | |
75 |
Life science databases play a crucial role in the organization of
scientific knowledge in the life science domain. The vastity and complexity of the life sciences require the presence of ``knowledge brokers'' who identify, organize and structure information derived from experimental results and extracted from the literature. This role is played by database curators, who act as intermediary between the producers of the knowledge (experimental scientist) and its consumers. This important role requires highly-skilled individuals who have the biological expertise needed to recognize the crucial information that has to be inserted in a specific database. The complexity of the task and the specific competences required cannot cannot be fully replaced by automated systems, if the aim is to obtain the same quality of results. Nevertheless, it is becoming increasingly clear that this traditional approach cannot possibly cope with the deluge of new information being created by experimental scientists. PubMed, the reference repository of biomedical literature, at present contains more than 25 Million bibliographical entries, and it grows at a rate of about two publications per minute. Considering that only a very small subset of the information contained in a paper is actually required by a specific database, it appears as a waste of time and resources that curators often have to read full papers in order to find those items that they need. We propose automated strategies that help curators quickly locate that crucial information, and provide tools that support them in transposing this information from the paper to the database. Through a combination of text mining and a user-friendly interface, based on text filters and partially pre-filled forms, curators can conspicuously enhance their efficiency, thus giving them the opportunities to process larger amount of documents, without losing the quality of the traditional manual curation approach. The ultimate goal is to obtain a high throughput curation with the same quality as manual curation, or, when such level of quality cannot be reached, at least be able to provide a quantifiable measure of the difference in quality. The work presented here is part of an NIH-sponsored collaborative project aimed at improving the curation process of the RegulonDB database. RegulonDB is the primary database on transcriptional regulation in Escherichia coli K-12, containing knowledge manually curated from original scientific publications, complemented with high throughput datasets and comprehensive computational predictions. In this paper we describe the integration of text mining technologies in the curation pipeline of the RegulonDB database, and discuss how the process can enhance the productivity of the curators. Among the innovations that we propose, we describe in particular: - the integration of "text filters" in the curation interface, enabling curators to focus on the parts of the text most likely to yield the information that they are looking for - partially self-filling forms, compiled by the system using information from the paper, leaving to the curator the decision whether to accept or modify - a novel semantic linking capability, which enable curators to explore related information in other papers |
Fabio Rinaldi, Socorro Gama, Hilda Solano Lira, Alejandra Lopez-Fuentes, Oscar Lithgow and Julio Collado-Vides | |
76 |
The retrieval of biomedical literature is a critical task for scientific researchers and health care
practitioners. Open scientific literature databases contain a massive amount of data, which is
extensively used to support various research activities in life sciences. Lots of research efforts have
been made towards improving the retrieval of bioliterature, but the task is still challenging.
PubMed and PubMed Central (PMC) are scientific literature databases maintained by the U.S. National Library of Medicine. As of February 2016, PubMed holds over 25 million records, allowing users to search the content of article abstracts, while PMC holds over 3.7 million of free full-text articles. When utilizing databases such as PubMed and PMC to retrieve relevant information, researchers generally need to express their search needs using a specific query language. This makes the task difficult for users not experienced with query languages, and can compromise the knowledge discovery process. In this work, we present an open source search engine that aims to address two different aspects related to the retrieval of biomedical literature: improve the content access offered by PubMed or PMC, and facilitate the query formulation for users by processing queries in natural language. The system is composed of two modules: the indexation module and the complex query module. Based on the search platform Solr/Lucene, the indexation module generates the inverted index of the dataset, representing all documents using relevant content found in the article content (titles, abstract, body, keywords, references, etc.). The complex query module handles complex user queries, which are processed according to different query types. For each type, a specific search strategy is applied to better meet the user needs. In addition, query terms can be expanded using UMLS concepts. Our search engine was created based on the open-access scientific literature made available by the PubMed Baseline Database (BD), and the PMC Open Access (OA) Subset repository. A total of 25,403,053 articles from these sources was indexed as of February 2016. Information retrieval systems are often evaluated using reference judgments or pseudo-judgments. Here we proposed an evaluation method based on pseudo-judgments, and sets of annotated queries. Our evaluation dataset is composed of query-document sets manually annotated by curators working on the mycoCLAP database. The dataset utilized for preliminary evaluation has 19 query-document relations. From the total, 9 queries have a correct response document mapped to a PMC OA entry (full text article). The other 10 have a correct response document mapped to a PubMed BD entry (article abstract). For each query, we analyzed the first 20 ranked documents, and computed a Mean Reciprocal Rank (MRR) score for the correct response document, considering the position where it was found in the search result list. The MRR score over 0.5 indicates that the system retrieved the correct response document in first or second positions for more than half of the requests. Our work currently focuses on improving the retrieval of full-text documents. |
Hayda Almeida, Ludovic Jean-Louis and Marie-Jean Meurs | |
77 |
PubAnnotation is a literature annotation repository, to which annotations made to scientific literature
is collected. Particularly, its current primary target is life science literature: PubMed articles and
PMC Open Access articles. So far, annotation data sets produced by many different groups have been
collected and integrated. Examples include entity annotations and relation annotations. Entity types
vary from proteins to species or diseases. Some are linguistic annotations like syntactic parses or
coreferences. Some are automatic annotations and some are fully manual ones. Thanks to contribution from
many groups, those data sets are now accessible in an integrative way.
PubAnnotation also features a function to pull annotations from annotation servers. PubDictionaries is an example of annotation server. In fact, PubDictionaries is a repository for dictionaries. User-generated dictionaries, e.g., an Excel file with a collection of protein names and their UniProt IDs, are collected to PubDictionaries. Anyone who has such a dictionary can upload its CSV dump file to PubDictionaries. Then, a REST web service for text annotation based on the dictionary is immediately enabled. Using PubDictionaries, a user can quickly produce annotations based on dictionaries of his/her interest, and using PubAnnotation, he/she can easily check if there are already existing relevant annotations. As an application case, we compiled text mining resources for GlycoBiology, which we call GlycoTM. The GlycoTM collection is now only in a preliminary state. However, it demonstrates how such a collection can be produced using PubDictionareis and PubAnnotation. We will keep developing the GlycoTM collection as an open resource. For GlycoTM, 8 dictionaries are created for glycobiology from 4 databases and ontologies. Following is descriptions of the databases. (1) GlycoEpitope is a database to integrate carbohydrate antigens and their antibodies, as well as the related information such as glycoproteins, glycolipids, enzymes, tissues and diseases. (2) PACDB (Pathogen Adherence to Carbohydrate Database) contains literature-reported and experimentally obtained information of glycans and pathogens (binding and unbinding) and offers ontology-systemized data. It is well known that pathogens recognize glycans. The registered data were reconstructed with ontology implementation. The ontologies were named as PAConto. (3) GDGDB (Glyco-Disease Genes DataBase) contains information of disease induced by alteration of glycosyltransferase genes and glycosidase genes. Ontology-systematized data of glycan metabolism and clinical condition are also maintained. (4) cGGDB (Caenorhabditis elegans GlycoGene Database) is a database for C. elegans glycogenes. This database was designed so that researchers and students in medical biology can easily understand how glycogenes related to human disease are acting in a model organism, C. Elegans. Based on the dictionaries, text annotation collection is produced. Annotation is made to 2,931 PubMed articles from GlycoBiology (Oxford Journal). 5 annotation data sets are produced based on the 5 databases described above. Besides, annotations based on GO, FMA, and ICD10 lexicon or ontologies are also produced as supporting resources. Those annotations are all accessible through REST API of PubAnnotation. An example excerpt can be checked through this URL: http://pubannotation.org/docs/sourcedb/PubMed/sourceid/22459802/spans/689-836/annotations/visualize |
Jin-Dong Kim, Toshihide Shikanai, Shujiro Okuda and Shin Kawano | |
78 |
Relations between chemicals and diseases (Chemical-Disease Relations or CDRs) play critical roles in
drug discovery (toxicity), biocuration, pharmacovigilance, etc. Due to the high cost of manual curation
and rapid growth of the biomedical literature, several attempts have been made to extract CDRs using
automatic systems. In this study, we proposed a kernel learning method to identify CDRs from PubMed.
Kernel based learning algorithms have gained more and more popularity in the machine learning community
for its solid foundation and promising performance. Compared with single kernel function, multiple
kernel function has been proved to have better performance. However, multiple kernel function still
lacks of systematic research. In this study, we extracted semantic relations from text using machine
learning method which based on multiple kernel function. First of all, we constructed different kernel
functions according to different size text object, so as to reflect different semantic features of the
corresponding text. Then, we constructed multiple kernel learning framework by combining two or more
single kernel function, and achieving optimal extraction of semantic relations. Finally, we verified the
effectiveness of the proposed algorithm by applying it to the BioCreative V corpus released in 2015, and
give the comprehensive evaluation in accordance with international standards. The result shows that our
algorithm which based on multiple kernel function has better efficiency and accuracy.
|
Yan Liu, Yueping Sun, Li Hou and Jiao Li | |
79 |
The research community is now flooded with scientific literature, with thousands of journals and over 20
million abstracts in PubMed. Somewhere in this information lie the answers to questions not only for
scientific research but also for business research. A lot of times, a market researcher starts with a
question, then collects data and answers the question. But now we can start with public data. Then we
figure out a new, useful and valuable question we can ask and answer. Customers want digestible
information – everything relevant-not hundreds of journal articles to read. In this talk, we will
present case studies on how we used the ontologies and disambiguation techniques to address the needs
for business analytics in the pharmaceutical research. The results will be presented in the context of
identification of key opinion leaders.
|
Parthiban Srinivasan | |
80 |
With the ever-rising amount of biomedical literature, it is increasingly difficult for scientists to
keep up with the published work in their fields of research, much less related ones. The use of natural
language processing (NLP) tools can make the literature more accessible by aiding concept recognition
and information extraction. As NLP-based approaches have been increasingly used for biocuration, so too
have biomedical ontologies, whose use enables semantic integration across disparate curated resources,
and millions of biomedical entities have been annotated with them. Particularly important are the Open
Biomedical Ontologies (OBOs), a set of open, orthogonal, interoperable ontologies formally representing
knowledge over a wide range of biology, medicine, and related disciplines.
Manually annotated document corpora have become critical gold-standard resources for the training and testing of biomedical NLP systems. This was the motivation for the creation of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access journal articles from the biomedical literature. Within these articles, each mention of the concepts explicitly represented in eight prominent OBOs has been annotated, resulting in gold-standard markup of genes and gene products, chemicals and molecular entities, biomacromolecular sequence features, cells and cellular and extracellular components and locations, organisms, biological processes and molecular functionalities. With these ~100,000 concept annotations among the ~800,000 words in the 67 articles of the 1.0 release, it is one of the largest gold-standard biomedical semantically annotated corpora. In addition to this substantial conceptual markup, the corpus is fully annotated along a number of syntactic and other axes, notably by sentence segmentation, tokenization, part-of-speech tagging, syntactic parsing, text formatting, and document sectioning. In the several years since the initial release of the CRAFT Corpus, in addition to efforts within our group and in collaboration with others, including the first comprehensive gold-standard evaluation of current prominent concept-recognition systems, it has already been used in multiple external projects to drive development of higher-performing systems. Here we present our continuing work on the corpus along several fronts. First, to keep the corpus relevant, we are updating the concept annotations using newer versions of the ontologies already used to mark up the articles, removing annotations of obsoleted classes and editing previous annotations or creating new annotations of newly added classes. Additionally, to extend the domain of annotated concept types, we are also marking up mentions of concepts using the Molecular Process Ontology (for types of chemical processes) and the Uberon Anatomy Ontology (for anatomical components and life-cycle stages). Finally, to capture even more content, we are generating new annotations for roots of prefixed/suffixed words as well as annotations made with extension classes we have created. We will present updated annotation counts and interannotator agreement statistics for these continuing efforts as well as future plans. All of this work is designed to further increase the potential of the CRAFT Corpus to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. |
Michael Bada, Nicole Vasilevsky, Melissa Haendel and Lawrence Hunter | |
81 |
The BioHub Information and Knowledge Management System (IKMS)
aims to support the process of identifying chemical ingredients that can be sourced from sustainable biomass as alternatives to those from non-renewable resources such as fossil oils or earth minerals. The curation of chemical data in BioHub is performed in three stages: (i) A text mining stage whose aim is to mine facts about chemicals directly from the scientific literature (journal papers), patents and laboratory reports). (ii) An 'assertion generation' stage where 'assertions' (i.e. factual statements about feedstocks and chemicals) are selected as candidates for curation by querying the text mining results. (iii) A curation stage that allows curators to browse, edit, validate and store the assertions to the system's final data store. The text analytics system of BioHub is developed within the GATE platform. It processes documents in order to extract relevant information about chemicals such as the feedstock streams from which they are derived, their physical and chemical properties, possible transformations applied to them etc. This information is exported to OWL data stores enriched with links to the parts of the texts from which it comes. Curation is not performed directly on the text mining results but on the output of semantic queries applied to them. Assertions are generated on-the-fly via SPARQL-ing the OWL output and are transformed to JSON objects ready to be parsed and presented in tabular format on the BioHub curation User Interface (UI). The conceptual structure of an assertion is pre-specified. An example of an assertion type is "feedstock-has-chemical-with-proportion". This is derived from two binary relations (triples) identified in the text mining stage i.e. "feedstock-has-chemical" and "chemical-has proportion", both of which may have been extracted from the same part of the text (usually a sentence) and have been linked together by querying the text mining data. The main goal of the curation UI is to populate the BioHub data repository with validated assertions. Secondary goals are to aid performance evaluation and provide curated data for system training and refinement. Its design has been influenced by considerations of an intuitive interface aiming at rapid curation rather than considerations of a full-fledged annotation editor which might require much more user intervention and time resources for curating full text documents. The UI is employable by a Web-based client with client-side services for input data and server-side facilities for storing the final results. Its editing engine includes capabilities such as context-enabled curation (allows access to the text sentences, paragraphs or full text), content-editable fields for entities, grouping/sorting of assertions per various facets, logging, etc. The results of the curation stage are stored to the BioHub data repository and are used to support subsequent stages in the IKMS, such as the selection of ingredients based on functional characteristics and the computational optimisation of chemical pathways. The curation architecture of the BioHub IKMS demonstrates how text mining and semantic web technologies can be integrated within a distributed, goal-oriented curation infrastructure to facilitate the semi-automated development of knowledge bases in the chemical domain. |
George Demetriou, Warren Read, Martyn Fletcher, Noel Ruddock, Goran Nenadic, Tom Jackson, Robert Stevens and Jerry Winter | |
82 |
FlyBase has an active and diverse outreach program to engage with our user community. For example, the
FlyBase Community Advisory Group (FCAG) comprises over 550 FlyBase users from around the world and
provides essential feedback on new features and changes to FlyBase through regular surveys. We have also
implemented the online 'Fast-Track Your Paper' tool to facilitate community curation, with over 50% of
authors routinely associating genes with their publications in FlyBase and highlighting data types
requiring a deeper curation.
More recently, we have starting making video tutorials as a means to answer common queries and help people get the most of out of FlyBase. Topics covered so far include 'How to find all data related to a gene', 'How to generate an excel file of all alleles of a gene' and 'How to cite FlyBase'. To produce a video, a script is first written, then a screen recording is captured and a voice is added to it. The videos are available on the newly created FlyBase YouTube channel, FlyBase TV. Subsequent videos will focus on each of the various tools in FlyBase. |
Alix Rey, Laura Ponting, Gary Grumbling, Jim Thurmond, Jose-Maria Urbano, Gillian Millburn, Steven Marygold and Nick Brown | |
83 |
Enormous amounts of biomedical data have been and are being produced by investigators all over the
world. However, one crucial and limiting factor in data reuse is accurate, structured and complete
description of the data, or data about the data – defined as metadata. We propose a framework to
predict structured metadata terms from unstructured metadata for improving quality and quantity of
metadata, using the GEO microarray database. Our framework consists of a Latent Dirichlet Allocation
model (LDA) to reduce the dimensionality of the unstructured data, in combination with a supervised
classifier. We compared support vector machines and decision trees with the majority classifier as
baseline. Our results on the GEO database show that structured metadata terms can be accurately
predicted. This is a promising approach for metadata prediction that is likely to be applicable to other
datasets and has implications for researchers and practitioners interested in biomedical metadata
curation and metadata prediction.
|
Lisa Posch, Maryam Panahiazar, Michel Dumontier and Olivier Gevaert | |
85 |
The Conserved Domain Database (CDD) is a collection of multiple sequence alignments that represent
ancient conserved domains. One part of the CDD resource is a mirror of publicly available domain model
collections, including Pfam and TIGRFAMs, among others. These may be used as starting points for
manually-curated conserved domain models (accessions with a 'cd' prefix) arranged in a hierarchical
structure to reflect evolutionary diversification of ancient protein domain families. Most curated
models contain annotation of features that are conserved across the domain family, supported by evidence
obtained from 3D structures as well as the published literature. Curated domain family models are also
created de-novo for previously uncharacterized families, often identified via novel 3D structures with
no conserved domain annotation. Hierarchical classification and curation of protein domains, using our
in-house tools CDTree (hierarchy viewer) and Cn3D (structure viewer and multiple alignment editor), have
been the focus of our manual curation efforts. In addition, we develop structural motif models
(accessions with an 'sd' prefix) to represent protein sequence segments such as short repeats, coiled
coils, and transmembrane regions. We also manually validate superfamily clusters (accessions with a 'cl'
prefix), formed by an automated clustering procedure as sets of conserved domain models that generate
overlapping annotation on the same protein sequences. Superfamily clustering allows the organization of
data within CDD in a non-redundant way, as each data source may have its own model for a specific
conserved domain. Cluster validation is aided by using Cytoscape as a visualization tool for the degree
of overlap between conserved domain models. More recently, our manual curation efforts are focused on
providing functional labels for domain architectures, using an in-house procedure called SPARCLE ("Specific
ARChitecture Labeling Engine"). While we are able to assign functional labels to a large fraction
of proteins, we have also identified areas of insufficient coverage and resolution of the current
protein domain models that comprise CDD. In this poster, we will discuss all aspects of manual curation
in CDD. The need for manual curation work always exceeds available resources and we hope to automate
hierarchical classifications to some degree in the near future.
Acknowledgement This research was supported [in part] by the Intramural Research Program of the National Library of Medicine, NIH. |
Noreen Gonzales, Farideh Chitsaz, Myra Derbyshire, Lewis Geer, Marc Gwadz, Lianyi Han, Jane He, David Hurwitz, Christopher Lanczycki, Fu Lu, Gabriele Marchler, James Song, Narmada Thanki, Josie Wang, Roxanne Yamashita, Chanjuan Zheng, Steve Bryant and Aron Marchler-Bauer | |
86 |
The genomes of seven malaria parasite species (Plasmodium spp) are currently being curated, including
those of the rodent-malaria parasites, P. chabaudi, P. yoelii and P. berghei; the human-infective
species, P. falciparum, P. vivax and P. knowlesi; and the chimpanzee parasite P. reichenowi. Thousands
of additional genomes are being sequenced from clinically isolated parasites from across the globe to
study the evolving genetics of parasite populations. In addition, draft genomes of additional species
are being used to understand the structure and evolution of Plasmodium genomes. Manual curation of all
of these data would be impossible. Therefore we are focusing curation activities on one genotype per
reference species. This enables the malaria community to use these reference genomes to look for
manually curated GO terms and products and transfer them to other sequenced isolates that are not
curated.
We have established a workflow for manual curation. The annotation tool Artemis is used to read and write directly to a Chado relational database underlying GeneDB (http://www.genedb.org). An annotation-transfer tool has been implemented in Artemis to transfer annotation between features within the same Chado database. GeneDB houses curated Plasmodium reference genomes and is being updated daily. As part of a collaborative effort with PlasmoDB (http://www.plasmodb.org) every few months the annotated and curated genomes are sent from GeneDB to PlasmoDB to be integrated with a wide variety of functional genomics data sets. PlasmoDB enables the community also to search non-curated genomes that are not being updated. A banner with a direct link to the GeneDB gene record page has been implemented to inform the community of changes in the annotation. |
Ulrike Boehme, Thomas Dan Otto, Mandy Sanders, Chris Newbold and Matthew Berriman | |
87 |
PROSITE is a resource for the identification and annotation of conserved regions in protein sequences.
These regions are identified using two types of signatures: generalized profiles (weight matrices) that
describe protein families and modular protein domains and patterns (regular expressions) that describe
short sequence motifs often corresponding to functionally or structurally important residues. PROSITE
signatures are linked to annotation rules, or ProRules, which define protein sequence annotations (such
as active site and ligand-binding residues) and the conditions under which they apply (for example
requiring specific amino acid residues). PROSITE signatures, together with ProRule, are used for the
annotation of domains and features of UniProtKB/Swiss-Prot entries. The latest version of PROSITE
(release 20.122, of 13 January 2016) contains 1309 patterns, 1145 profiles and 1145 ProRules and is
accessible at: http://prosite.expasy.org/prosite.html.
The ScanProsite tool (http://prosite.expasy.org/scanprosite/) allows users to search protein sequences against all PROSITE signatures, and to search for matches to defined PROSITE signatures in the UniProtKB and PDB databases. Individual protein sequences and whole proteomes can be subjected to repeated scans with the benefits of the PROSITE graphical view of the results and the application of ProRule for a more precise prediction. |
Christian J. A. Sigrist, Edouard de Castro, Beatrice A. Cuche, Delphine Baratin, Thierry Schuepbach, Sebastien Moretti, Marco Pagni, Sylvain Poux, Nicole Redaschi, Alan Bridge, Lydie Bougueleret and Ioannis Xenarios | |
88 |
Prediction of Metabolic Pathway
Involvement in Prokaryotic UniProtKB Data by Association Rule Mining
The widening gap between known proteins and their functions has encouraged the development of methods to
automatically infer annotations. Functional annotation of proteins is expected to meet the conflicting
requirements of providing comprehensive information while avoiding erroneous functional assignments.
This trade-off imposes a great challenge in designing intelligent automatic annotations systems.
In the scope of this work, we tackle the problem of UniProtKB automatic functional annotation of prokaryotic pathways. We suggest that association rule mining could be used effectively as a computational method for pathway prediction. Here, we introduce ARBA, an Association-Rule-Based Annotator that can be used to enhance the quality of automatically generated annotations as well as annotating proteins with unknown function. ARBA utilizes data from UniProtKB/Swiss-Prot and uses InterPro signatures and organism taxonomy as attributes of predict metabolic pathways associated with each protein entry. With respect to certain quality measures, we find all rules which would define significant relationships between attributes and pathway annotations in UniProtKB/Swiss-Prot entries. The set of extracted rules represent the comprehensive knowledge which could explain protein pathway involvement. However, these rules comprise redundant information and their high number makes it infeasible to apply them on large sets of data such as UniProtKB/TrEMBL. To address this issue, ARBA puts these rules into a fast competition process based on two concepts, namely dominance and comparability. Rules are then considerably reduced in number and aggregated with respect to the predicted pathways. The resulting knowledge represents concise prediction models that assign pathway involvement to UniProtKB entries. We carried out an evaluation study of our system's performance using semantic similarity and cross-validation technique on UniProtKB prokaryotic entries to demonstrate the performance, capability and robustness of our approach. We found that we achieved a very high accuracy of pathway identification with an F1-measure of 0.982 and AUC of 0.987. Moreover, our prediction models were applied on 6.2 million UniProtKB/TrEMBL reference proteome entries of prokaryotes. As results, (663,724) entries were covered, where (436,510) of them lacked any previous pathway annotations. Observing the annotation coverage of this set of entries in comparison to other main automatic annotation systems present in UniProtKB/TrEMBL which are, SAAS and UniRule (which includes Rule-base and HAMAP-Rule), we found out that ARBA significantly surpassed the other ones in terms of the number of entries covered. As stated earlier, ARBA annotated (663,724) entries where HAMAP-Rule, SAAS, and Rule-base annotated only (229,402), (205,097) and (93,613) entries respectively. On the other hand, analyzing the entries annotation, we found that (786,819) predictions were generated by ARBA where the majority of these predictions, (516,042), touched entries that have no previous pathway annotation. Moreover, (237,784) predictions were found to be identical to the annotations proposed by other systems which enforce the reliability of our systems' predictions. A Java Archive (JAR) package for applying the prediction models on various UniProtKB/TrEMBL prokaryotic entries is available at: http://www.ebi.ac.uk/~rsaidi/arba/ The link also contains the list of prediction models and graphical reports illustrating the system's performance on some prokaryotic organisms in UniProtKB/TrEMBL. |
Imane Boudellioua, Rabie Saidi, Robert Hoehndorf, Maria J. Martin and Victor Solovyev | |
89 |
A large part of post-genomic research is focused on the analysis of protein-protein interactions (PPIs)
being central to all biological processes. Inferring PPIs involved in human transcriptional regulation
(TR) is of particular interest as they are often deregulated in complex diseases and may represent
valuable pharmaceutical targets. We devised a method to analyze and predict these interactions based on
sequence information only aiming to evade limitations imposed by dispersed auxiliary information, such
as localization, structural and expression data. A new predictor incorporates information on the
pseudo-amino acid composition of features that dominate PPIs. Besides the electrostatic and hydrophobic
features, it incorporates the electron-ion interaction potential (EIIP) a descriptor of long-range
interaction properties that contribute to protein binding specificity through long-range recognition
between partners. Based on a dataset that was compiled from HIPPIE (Human Integrated Protein-Protein
Interaction rEference), a random forest model was constructed with an average value of accuracy of
80.41% and AUC 0.88 for independent test sets. Compared with previous studies, our approach outperformed
other models in predictive performance and algorithmic efficiency and will, therefore, facilitate the
understanding of the complex cellular behaviors and organizing of large-scale data into models of
cellular signaling and regulatory machinery.
|
Neven Sumonja, Vladimir Perovic and Nevena Veljkovic | |
90 |
As sequencing technologies continue to drop in price and increase in throughput, new challenges
regarding processing the vast datasets for a huge number of genes and identifying an optimal analytical
methodology emerge. The informational spectrum method (ISM) is a sequence analysis approach that relies
on the Fast Fourier Transform (FFT) combined with decoding of the protein sequence via amino acid
physicochemical properties that transforms the sequence into an Informational Spectrum (IS). Starting
from the IS of the protein, we developed new protein distance measures and a novel phylogenetic
algorithm ISTREE that has been found to overcome some drawbacks of classical phylogenetic approaches,
particularly those related to sensitivity to a single mutation and deletion and as well as to the
position of the mutation. Given that ISTREE is based on FFT and does not require multiple sequence
alignment (MSA), it is a fast method for evolutionary analyses of large sets of protein sequences,
compared to other standard phylogenetic algorithms. We used ISTREE to study the functional evolution of
the hemagglutinin subunit 1 protein (HA1), in an effort to better understand the viral determinants that
facilitate human infections of the highly pathogenic avian influenza (HPAI) A subtype H5N1 virus. The
mutations that increase HPAIV propensity for human-to-human transmission were identified and the
predictions were confirmed in vitro.
|
Vladimir Perovic, Veljko Veljkovic, Sanja Glisic and Nevena Veljkovic | |
91 |
Rice is one of the most important staple food for a large portion of the world's population and also a
key model organism for cereal crops due to its great agricultural importance. In order to provide a
precise reference genome in aid of extensive rice-related studies, it is desirable to keep annotating
the rice genome by integrating large quantities of high-throughput omics data. Here, we present a
re-annotation release of the Oryza Sativa Japonica genome (BIGD-IC4R-1.0), for the first time, based on
more than 700 publicly available high-quality RNA-Seq datasets (~7.4 Terabyte) along with annotations
contributed from NCBI, EBI and UniProt, thereby providing substantial improvements over the previous
version MSU Rice Genome Annotation Project Release 7.0 (MSU7.0; released on Feb 6, 2013). Our near-final
release of BIGD-IC4R-1.0 consists of 57,905 protein-coding genes, among which 2,259 novel genes are
identified for the first time, and the structural annotations of a total of 20,682 genes have been
updated based on the previous Version MSU7.0. Moreover, the number of genes in BIGD-IC4R-1.0 with splice
variants is significantly increased compared with MSU7.0. In addition, 11,841 long ncRNAs were
identified from 658,655 assembled transcripts. BIGD-IC4R-1.0, an updated version for rice genome
re-annotation that is, for the first time, based on large-scale RNA-Seq data analysis, has revised
hundreds of inaccurate gene models and provided a number of alternatively spliced isoforms as well as
long ncRNAs, which thus would be of critical importance for significantly promoting functional studies
in rice as well as other plants.
|
Lili Hao, Jian Sang, Lin Xia and Zhang Zhang | |
92 |
Annotation of proteins based on structure-based analyses is an integral component of the UniProt
Knowledgebase (UniProtKB). There are over 100,000 experimentally determined 3-dimensional structures of
proteins deposited in the Protein Data Bank. UniProt works closely with the Protein Databank in Europe
(PDBe) to map these 3D structural entries to the corresponding UniProtKB accessions accurately and
coherently based on comprehensive sequence and structure-based analyses, to ensure that there is a
UniProtKB record for each relevant PDB record and to import additional data such as ligand-binding sites
from PDB to UniProtKB.
SIFTS (Structure Integration with Function, Taxonomy and Sequences), which is a collaboration between the Protein Data Bank in Europe (PDBe) and UniProt, facilitates the link between the structural and sequence features of proteins by providing correspondence at the level of amino acid residues. A process combining manual and automated processes for maintaining up-to-date cross-reference information has been developed and is carried out for every weekly PDB release. Various criteria are considered to cross-reference PDB and UniProtKB entries such as a) High sequence identity (>90%) b) Exact taxonomic match (at the level of species, subspecies and specific strains for lower organisms) (c) Mapping to curated SwissProt entry (if exists) (d) Mapping to proteins from Reference/Complete proteome (e) mapping to the longest protein sequences. Some cases are inspected manually by UniProt using a dedicated curation interface to ensure accurate cross-referencing. These cases include short peptides, chimeras, synthetic constructs and De novo designed polymers. The SIFTS initiative also provides up to date cross referencing of structural entries to literature (PubMed), taxonomy (NCBI), Enzyme database (IntEnz), Gene Ontology annotations (GO), protein family classification databases (InterPro, Pfam, SCOP and CATH) . In addition to maintaining accurate mappings between UniProtKB and PDB, a pipeline has been developed to automatically import data from PDB to enhance the unreviewed records in UniProtKB/TrEMBL. This includes details of residues involved in the binding of biologically relevant molecules including nucleotides, metals, drugs, carbohydrates and post-translational modifications and greatly improves the biological content of these records. To date, UniProt has successfully completed the non-trivial and labour intensive exercise of cross referencing ~250,000 polypeptide chains and 102,417 PDB entries (out 115,306 entries processed by PDBe). All this work enables non-expert users to see protein entries in the light of relevant biological context such as metabolic pathways, genetic information, molecular functions, conserved motifs and interactions etc. Protein structural information in UniProt serves as a vital dataset for various academic and biomedical research projects. |
Nidhi Tyagi, Guoying Qi, Maria-Jesus Martin, Claire O'Donovan and Uniprot Consortium | |
93 |
The mission of UniProt is to provide a comprehensive and thoroughly annotated protein resource to the
scientific community, most notably through the UniProt Knowledgebase (UniProtKB). Within UniProtKB, the
reviewed section (Swiss-Prot) contains high quality, manually curated, richly-annotated protein records.
In contrast, the unreviewed section (TrEMBL) which makes up 99% of UniProtKB, depends for its annotation
on automatically extracted experimental data from 3D structures, links to other databases and rule-based
annotation. The use of rule-based annotation is necessary because there is no experimental data
available for the majority of the unreviewed protein sequences. This makes inference of function by
similarity/homology the only option for annotation.
UniRule is a rule-based annotation system leveraging the expert-curated data in reviewed UniProtKB to increase the depth of annotation in unreviewed entries. Currently the UniRule system contains over 4,500 rules, which provide annotation for approximately 28% of unreviewed entries. Rules are a formalized way of expressing an association between conditions, which have to be met, and annotations, which are then propagated. InterPro signatures, predictive models for the functional classification of protein sequences, and taxonomic constraints are the fundamental conditions but others are used, too. Annotation types used by UniRule allow the complete functional annotation of a protein sequence, including nomenclature, catalytic activity, Gene Ontology (GO) terms and sequence features such as transmembrane domains. Data provenance is documented using Evidence Ontology tags. A key feature of the UniRule curation tool is a statistical system which allows curators to evaluate their rules against the reviewed entries, to make sure rules are as accurate as possible. This quality control system also allows curators to re-evaluate and update old rules at every release, ensuring that the propagated annotation in the unreviewed entries is kept up to date. A dedicated space on the uniprot.org website has recently been created to allow users to view and explore UniRule. All aspects of the UniRule system and the latest developments will be explained at the conference. |
Klemens Pichler, Ricardo Antunes, Mark Bingley, Emma Hatton-Ellis, Alistair MacDougall, Maria Martin, Diego Poggioli, Sangya Pundir, Alexandre Renaux, Vladimir Volynkin, Hermann Zellner, Cecilia Arighi, John S. Garavelli, Kati Laiho, C.R. Vinayaka, Qinghua Wang, Lai-Su Yeh, Delphine Baratin, Alan Bridge, Edouard de Castro, Ivo Pedruzzi, Nicole Redaschi, Catherine Rivoire, Claire O'Donovan and Uniprot Consortium | |
94 |
The HAVANA group at the Wellcome Trust Sanger Institute manually annotate the gene content of vertebrate
genomes. We annotate using the in-house Zmap viewer interface, which transfers models to the Otter
software for storage and processing and is publicly available. We aim to produce complete genesets for
Human, Mouse and Zebrafish which is done in a clone by clone or gene targeted manner, annotating all
protein coding genes, non-coding RNAs and pseudogenes. We are also engaged in community annotation
projects chiefly for pig and rat, where collaborators request loci or areas of interest for HAVANA
annotation. HAVANA classify transcripts according to functional 'biotypes'. We have numerous coding and
non-coding biotypes, reflecting our confidence in the annotation of the sequences (known, putative,
nonsense mediated decay), as well as a sophisticated system of pseudogene classification. We have
incorporated next generation datasets such RNAseq, CAGE and PolyASeq data into our annotation workflow.
These datasets are particularly important for transcriptomes that lack coverage from traditional
datasets e.g. zebrafish. However, even our human geneset remains a work in progress, and we have
recently begun to use long-read RNAseq PacBio data as well as the synthetic long reads produced by
Tilgner et al. These are used to identify additional alternatively spliced transcripts, to complete
existing partial models and even to find new loci. We are also using PhyloCSF to help identify
additional coding regions in human and mouse, especially those that may already have been missed during
our first-pass annotation. In addition, we are collaborating with the proteomics group at Sanger to
identify peptides produced from mass spectrometry that support the coding potential of transcripts
previously annotated as non-coding.
All of our data is publicly available from the VEGA website (www.vega.sanger.ac.uk). The GENCODE genes-sets for human and mouse can also be accessed from the UCSC and Ensembl genome browsers, as well as the GENCODE web portal (www.gencodegenes.org) |
Deepa Manthravadi, Ruth Bennett, Alexandra Bignell, Gloria Despacio-Reyes, Sarah Donaldson, Adam Frankish, James Gilbert, Michael Gray, Ed Griffiths, Gemma Guest, Matt Hardy, Toby Hunt, Mike Kay, Jane Loveland, Jonathan Mudge, Gaurab Mukherjee, Charles Steward, Marie-Marthe Suner, Mark Thomas and Jennifer Harrow | |
95 |
Orthology delineation is a cornerstone of comparative genomics that offers evolutionarily-informed
hypotheses on gene function by identifying 'equivalent' genes in different species. The OrthoDB catalog
of orthologs, www.orthodb.org [Kriventseva, et al. 2015], represents a comprehensive resource of
comparative genomics data to help researchers make the most of their newly-sequenced genomes. The rapid
accumulation of sequenced genomes mean that such comparative approaches are becoming ever-more powerful
as tools to improve both genome-wide gene structural annotations and large-scale gene functional
inferences. Orthology delineation offers a solid foundation from which to begin to interpret
characteristic genome biology traits of a species or clade of species, highlighting shared and unique
genes that offer clues to understanding species diversity and providing the means to begin to
investigate key biological traits, for both large-scale evolutionary biology research and targeted gene
and gene family studies. The OrthoDB catalog collates available gene functional information from
UniProt, InterPro, GO, OMIM, model organism phenotypes and COG functional categories, as well as
providing evolutionary annotations including rates of ortholog sequence divergence, gene copy-number
profiles, homology-related sibling groups and gene architectures. These resources enable improved and
extended orthology-based gene functional inference in a comparative genomics framework that incorporates
the rapidly growing numbers of newly-sequenced genomes. Such approaches are well-established as
immensely valuable for gene discovery and characterization, helping to build resources to support
biological research. The success of such interpretative analyses relies on the comprehensiveness and
accuracy of the input data, making quality assessment an important part of the process of genome
sequencing, assembly, and annotation. OrthoDB's sets of Benchmarking Universal Single-Copy Orthologs,
BUSCO [Sim‹o, et al. 2015], provide a rich source of data to assess the quality and completeness of
these genome assemblies and their annotations. Orthology-based approaches therefore offer not only a
vital means by which to begin to interpret the increasing quantities of genomic data, but also to help
prioritize improvements, and to ensure that initial 'draft' genomes develop into high-quality resources
with evolutionarily-informed gene functional inferences that benefit the entire research community.
|
Robert Waterhouse, Felipe Sim‹o, Panagiotis Ioannidis, Evgenia Kriventseva, Evgeny Zdobnov and Mirna Tenan | |
96 |
Proteins conserved widely among eukaryotes play fundamentally important roles in the shared, basic
mechanisms of life. The roles of many broadly conserved proteins remain unknown, however, despite almost
a century of genetic and biochemical investigation. Even the recent emergence of genome-wide techniques
and the availability of near-complete protein inventories for many intensively studied eukaryotic model
species have shed light on the functions of few previously uncharacterised conserved proteins. Because
the success of many endeavours in basic and translational research (drug discovery, metabolomics,
systems biology), depends critically on comprehensive representation of functions, a more complete
understanding of protein components conserved throughout eukaryotes would have far-reaching benefits for
biological research in many species.
To identify priority targets for experimental investigation, PomBase provides an inventory of fission yeast proteins that are conserved among eukaryotes but whose broad biological roles remain unknown. A broad functional classification of the known proteome using a selection of Gene Ontology biological process categories has revealed correlations with features such as subcellular localization and morphological phenotype. Combining available data from genome-wide phenotype and localization experiments with insights from the functional classification of known proteins facilitates prediction of biological roles, and thereby guides specific experimental characterisation of unknown proteins. |
Valerie Wood, Midori Harris and Antonia Lock | |
97 |
Computational methods for Gene Ontology (GO) annotation are essential to keep abreast of the unabated
growth of genomic data. In particular, phylogenetic methods provide a compelling framework to propagate
gene attributes across related sequences. The Orthologous Matrix (OMA) database propagates GO
annotations among orthologous relationships across ~2000 genomes, currently inferring ~80 million GO
annotations.
Here, two methods shall be presented. The first, currently implemented in OMA, propagates annotations within cliques of orthologous genes. The second propagates terms across Hierarchical Orthologous Groups (HOGs)—nested groups of genes that descend from single ancestral genes—which makes it possible to compare highly diverged and similar species in a consistent manner. Terms get propagated up the hierarchy towards the root, with the belief in them exponentially decaying over each step. At each node in the hierarchy, annotations from the child groups are mixed to decide the overall annotations at that level of the hierarchy. The merits and challenges of these and other functional annotation methods will be discussed, including complications associated to the open-world assumption. |
Alex Warwick Vesztrocy, Nives Škunca, Adrian Altenhoff and Christophe Dessimoz | |
98 |
SPARCLE, the SPecific ARChitecture Labeling Engine, is a curation interface developed for the Conserved
Domain Database team at the National Center for Biotechnology Information. SPARCLE is used to associate
conserved domain architectures with suggested protein names and brief functional descriptions or labels,
as well as corresponding evidence. Protein names and labels range from generic to very specific,
reflecting the status quo of the underlying protein and domain model collections. They may, however,
provide concise functional annotations that are easier to interpret than raw representations of domain
architecture.
This work was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine |
Aron Marchler-Bauer, Lianyi Han, Christopher Lanczycki, Jane He, Shennan Lu, Farideh Chitsaz, Myra Derbyshire, Noreen Gonzales, Marc Gwadz, Fu Lu, Gabriele Marchler, James Song, Narmada Thanki, Roxanne Yamashita, Chanjuan Zheng, Stephen Bryant and Lewis Geer | |
99 |
High quality genome annotations are fundamental for the understanding of a species. Nowadays automatic
annotation pipelines combining ab initio gene prediction with algorithms and homology based approaches
for functional annotation are standardly used. It is essential to review these electronically inferred
functional annotations. However, data management complications prevent existing pipelines from storing
the complete data provenance. This results in annotated genomic features with unknown origin. For
comparative genomics this information is essential as different algorithms and methodologies will yield
different results. To track annotation with the corresponding provenance and to allow for ease of
integration of different data sources we developed an extensible Semantic Annotation Platform for
Prokaryotes (SAPP).
Due to the fast increasing number of sequenced genomes this platform has been designed to integrate large volumes of genomic data. The platform is modularly designed and it can be extended with new features. SAPP provides SPARQL support for querying and analysing computational results while aggregating heterogeneous data from alternative sources. Phenotypic characterisations were integrated through a collaborative effort fostered by WikiData, interconnecting multiple resources through semantic end-points. We performed an in depth comparative analysis of nearly 4.000 publicly available bacterial genomes. We identified genotype-phenotype associations, pinpointing key features responsible for several bacterial phenotypic traits such as pathogenesis, cell wall characteristics and composition, and environmental requirements. Our results clearly show the potential of semantic technologies to perform large scale comparative genomics |
Jasper Koehorst, Jesse van Dam, Ruben van Heck, Edoardo Saccenti, Vitor A.P. Martins Dos Santos, Maria Suarez Diez and Peter Schaap | |
100 |
The identification of peptides and proteins in mass spectrometry (MS) based proteomics experiments
relies in searching protein sequence databases. Therefore, it is of paramount importance the provision
of an up-to-date, stable and complete protein sequence database for a diversity of species.
UniProt provides a broad range of reference protein data sets for a large number of species, specifically tailored for an effective coverage of sequence space while maintaining a high quality level of sequence annotations and mappings to the genomics and proteomics information. With respect to publicly available bottom-up proteomics data, UniProt started providing mappings to its reference proteomes from release 2015_03 either in the protein entries and made available on the ftp (ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomics_mapping/) via the download section of the website (www.uniprot.org/downloads). The mappings are recalculated for every UniProt release starting each time from a fresh data retrieval from the collaborating MS proteomics repositories and they contain isoform-specific information which will also be graphically displayed soon on the website through a dedicated feature viewer interface. In addition, the collaborating MS proteomics repositories have been cross-referenced from within UniProt data and website. Since then the mappings have been expanded both in terms of covered species and collaborating MS proteomics repositories. Ongoing collaborations have been established to add other MS proteomics repositories as data providers for the mappings also in order to further expand the range of covered species. Special cases of these collaborations are the ones aimed at global reprocessing of the content of PRIDE (the PRoteomics IDEntifications database) and the ones which will provide data with a specific focus on posttranslational modification (PTM) related studies/datasets. Another very promising collaboration is the one with the Consortium for Top Down Proteomics (CTDP, http://www.topdownproteomics.org/) which has been cross-referenced from within UniProt data and website since release 2016_03. The top-down proteomics data available through the CTDP repository is currently used by UniProt for the development of a dedicated pipeline to annotate back the UniProt entries and publicly provide the corresponding mappings on the ftp. CTDP data include isoform-specific and variant-specific information for whole proteoforms also bearing PTMs. |
Emanuele Alpi, Guoying Qi, Alan Da Silva, Benoit Bely, Jie Luo, Maria Martin and Uniprot Consortium | |
101 |
With the genome sequencing technology developing towards significant lower cost, shorter time and higher
throughput, genomic data presents the explosive growth in recent years. Required by journals and funding
agencies, sequencing data must be submitted to public database for accessibility. Inclusive of NCBI/SRA,
EBI/SRA and DDBJ/DRA, International Nucleotide Sequence Database Collaboration (INSDC) has the capacity
to store and publish the genome data from all over the world, but internet network transferring speed
can not afford the big data transferring of long distance and different areas. To maximally overcome the
disadvantages, we develop Genome Sequence Archive (GSA; http://bigd.ac.cn/gsa) in China for archiving
and sharing the genomic data as well as accepting the submissions from all over the world. Since China
has a powerhouse in generating big biological data, GSA can benefit the users especially the China local
users in archiving sequencing & meta data and even making them real-time available for the worldwide
scientific communities. GSA is currently one of the database resources with the international standards
as INSDC which locates in BIG Data Center (BIGD; http://bigd.big.ac.cn) of Beijing Institute of Genomics
(BIG), Chinese Academy of Sciences (CAS).
|
Gsa Project Consortium, Hongxing Lei, Xiangdong Fang and Wenming Zhao | |
102 |
The UniProt Knowledgebase (UniProtKB) endeavours to provide the scientific community with the most
comprehensive catalog possible of protein sequence and functional information. To achieve this, we have
put in place procedures that gather data accurately and consistently from source databases followed by
extensive annotation and cross-referencing to other resources. At the heart of each UniProtKB record is
a protein sequence – typically derived from the translation of a protein-coding gene on a
sequenced genome. As the cost of sequencing continues to fall, the number of organisms with completely
sequenced and annotated genomes is growing at an unprecedented rate. At UniProt we provide comprehensive
protein-centric views of such genomes through the Proteomes portal (http://www.uniprot.org/proteomes/).
The majority of currently available proteomes (45,162 proteomes, UniProt release 2016_01) are based on the translation of completely sequenced genomes submitted to the EMBL/GenBank/DDBJ databases or the International Nucleotide Sequence Database Collaboration (INSDC). Submitted genomes sometimes lack gene model predictions or have problems that prevent the generation of a non-redundant protein set. In the past, this has included important model organisms, such as Rattus norvegicus (Rat) and Zea mays (Maize). In other cases previously sequenced genomes are reannotated by an expert community for example, Triticum aestivum (Wheat); and these datasets are not always available through the INSDC. UniProtKB overcomes these issues by generating these proteomes in collaboration with groups such as Ensembl and model organism databases (MODs). Over the last few years we have established complementary pipelines for import of protein sequences from three alternate sources: Ensembl (vertebrates), Ensembl Genomes (invertebrates) and more recently WormBase ParaSite (helminth genomes). Further analysis has revealed that extending this approach to other databases such as VectorBase, FlyBase and NCBI RefSeq would bring us closer to capturing the vast majority of sequenced genomes. Accurate identification and incorporation of new, publically available, annotated genomes is a complex task and requires evaluation of many factors. Genome size and coverage, contig and scaffold N50 measures, availability of species-specific transcript and protein sequences are all used to evaluate candidate genomes for inclusion to UniProtKB. For particularly complex cases (such as parasite genomes), core genes and single-copy gene analysis measures are taken into account to verify completeness. At present all non-INSDC submissions are assessed by a curator and included manually. As of release 2016_01 (January 2016) UniProtKB contains the proteomes of 135 organisms from alternate sources. In addition to the import of new proteomes, maintenance and updating of existing proteomes to reflect improvements in genome assemblies and genebuild procedures is vital to the sustained growth of the proteomes project. This complex task is central to the functioning of UniProtKB and is overseen by a team of curators and programmers. We will present recent developments in this area and discuss ongoing work within the proteomes database aimed at maintaining the high standard and comprehensiveness of proteome data. |
Ramona Britto and UniProt Consortium | |
104 |
High-throughput sequencing technologies currently allow to produce amounts of data so vast that the
entire biomedical field is facing new challenges, often previously underestimated, such as: long term
storage and sharing of large datasets; capture and provision of accurate metadata; big data re-analyses
and integration in various contexts.
Datasets from 'big data' projects in the field of transcriptomics are increasingly available, such as the Genotype-Tissue Expression (GTEx) dataset. For the development of the gene expression database Bgee (http://bgee.org/), it is a necessity to be able to integrate such large datasets. The aim of the Bgee database is to provide a reference of normal gene expression in animals, comparable between species, currently comprising human and 16 other species. The GTEx data, composed of thousands of RNA-Seq libraries, collected from more than 500 human subjects and 50 tissues, was essential to be integrated. We will share here the lessons learnt from the integration of GTEx into Bgee, and the solutions adopted, related to: the curation and standardization of abundant metadata; the acquisition and storage of terabytes of raw data; the gene expression quantification steps, leveraging the recent improvements in RNA-Seq analysis software and HPC experience; the integration and comparison of these data with other datasets and other species. We have notably conducted a work of re-annotation, using the metadata provided under restricted access, that allowed us to refine the annotations of anatomical structures, and to clarify some imprecisions; this also led us to request the inclusion of new terms into the Uberon anatomical ontology. Subjects and samples were also carefully filtered, in order to capture a high quality dataset of 'normal' (healthy) gene expression in human, that might be of interest to a large community of researchers. In order to then process this dataset, we had to implement the use of new tools such as Aspera to acquire the data, or Kallisto to map the RNA-seq reads in a scalable way. The results were then standardized and reduced as qualitative patterns of gene expression, over anatomy, development and aging, to be integrated into Bgee. This approach allows to produce a biologically meaningful summary of such a huge amount of data; and to scale for integrating future 'big data' projects to come, in the area of transcriptomics. Bgee is available at http://bgee.org/. Our work of re-annotation of GTEx is available at https://github.com/BgeeDB/expression-annotations. |
Anne Niknejad, Amina Echchiki, Angelique Escoriza, Julien Roux, SŽbastien Moretti, Marc Robinson-Rechavi and Frederic B. Bastian | |
105 |
The covalent attachment of ubiquitin to substrate proteins controls the stability, interactions,
activity and/or localization of much of the proteome. The canonical ubiquitination cascade proceeds by a
three-step process: activation of ubiquitin as a thioester by an E1 enzyme, transfer to an E2 enzyme as
a thioester intermediate, and conjugation to a lysine residue (or N-terminal amino group) on the
substrate or the extending polyubiquitin chain by an E3 enzyme. Conversely, the extent of substrate
ubiquitination is dynamically controlled by a host of deubiquitinating enzymes, which often act in
balanced concert with E3 enzymes. The fate of the ubiquitinated substrate is determined by interactions
with a host of ubiquitin binding domains, which can direct the substrate for degradation by the 26S
proteasome or alter substrate localization, interactions or activity.
The broad effects of the UPS on the proteome, and its connections to many disease states, have catapulted the UPS to the forefront of drug discovery in the pharmaceutical and biotechnology sectors. The therapeutic potential of the UPS has been largely underexplored due to insufficient understanding of the interdependence and redundancy between the different system components, the incomplete mapping of substrate-UPS system interactions and the largely unknown druggability of UPS enzymes. The goal of the UbiGRID curation project is to help fill this gap by comprehensively annotating the genetic and protein interactions of all UPS genes/proteins in humans, budding yeast and other model species. UbiGRID will serve as a centralized resource for three types of data: (i) an annotated reference set of UPS components organized into functional classes; (ii) comprehensive curation of genetic and protein interactions for all UPS genes; (iii) the annotation of ubiquitinated residues derived from mass spectrometry datasets. We have developed the most complete annotation of the core UPS machinery for human cells reported to date, which encompasses 1275 known and inferred system components. We recently completed the curation of 84,595 human protein interactions for these 1275 genes, as derived from 10,464 publications. Correspondingly, 31,886 yeast UPS protein interactions have been derived from 2,408 publications and 39,285 yeast genetic interactions from 2,420 publications. We have also captured documented 87,018 sites of ubiquitin modification for human proteins and 13,450 sites for yeast proteins. Collectively, this UPS interaction dataset should facilitate fundamental and applied discoveries in the UPS. |
Rose Oughtred, Bobby-Joe Breitkreutz, Lorrie Boucher, Christie Chang, Jennifer Rust, Nadine Kolas, Lara O'Donnell, Chris Stark, Kara Dolinski, Mike Tyers and Andrew Chatr-Aryamontri | |
106 |
Future energy demands and environmental challenges can be addressed by learning from biological
processes encoded in living organisms and microbial communities. Fungi are among the most powerful plant
pathogens and symbionts as well as biomass decomposers. The Fungal Genomics Program of the US Department
of Energy (DOE) Joint Genome Institute (JGI) is partnering with international scientific community to
explore the fungal diversity in several large scale genomics initiatives.
One of such initiatives, the 1000 Fungal Genomes project, is aimed at exploring diversity across the Fungal Tree of Life in order to understand fungal evolution, to build parts lists of genes, enzymes and pathways for biotechnological applications, and to provide references for environmental metagenomics. Its scale offers new challenges in data production, integration, analysis and requires a unique balance between automated high throughput analysis and manual curation techniques. This balance enables efficient integration of genomic and other omics data in the JGI fungal genomics resource MycoCosm (jgi.doe.gov/fungi), which currently contains over 600 fungal genomes and provides tools for comparative genomics and community-driven data curation. |
Igor Grigoriev | |
107 |
The Disease Portals, Disease-Gene
Annotation and the RGD Disease Ontology at the Rat Genome Database
The Rat Genome Database (RGD; http://rgd.mcw.edu/) provides critical datasets and software tools to a
diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose
research is disease oriented, RGD has created a series of Disease Portals and has prioritized its
curation efforts on the datasets important to understanding the mechanisms of various diseases.
Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers,
genetic associations, molecular mechanisms, and therapeutic targets. To generate gene-disease
annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary
from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the
addition of over 900 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation
for, at present, ten comprehensive disease area-related dataset and analysis platforms at RGD, the
Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each
year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more
robust disease ontology are underway.
|
G. Thomas Hayman, Stanley J. F. Laulederkind, Jennifer R. Smith, Shur-Jen Wang, Victoria Petri, Rajni Nigam, Marek Tutaj, Jeff De Pons, Melinda R. Dwinell and Mary Shimoyama | |
108 |
Studying reciprocal regulations between cancer-related pathways is essential for understanding signaling
rewiring during cancer evolution and in response to treatments. With this aim we have constructed the
Atlas of Cancer Signaling Network (ACSN), a resource of cancer signaling maps and tools with interactive
web-based environment for navigation, curation and data visualization. The content of ACSN is
represented as a seamless 'geographic-like' map browsable using the Google Maps engine and semantic
zooming. The associated blog provides a forum for commenting and curating the ACSN maps content. The
atlas contains multiple crosstalk and regulatory circuits between molecular processes implicated in
cancer (Kuperstein et al, 2015). The integrated NaviCell web-based tool box allows to import and
visualize heterogeneous omics data on top of the ACSN maps and to perform functional analysis of the
maps. NaviCell web-based tool box is also suitable for computing aggregated values for sample groups and
protein families and mapping this data onto the maps. The tool contains standard heatmaps, barplots and
glyphs as well as the novel map staining technique for grasping large-scale trends in numerical values
projected onto a pathway map. The NaviCell web service provides a server mode, which allows automating
visualization tasks and retrieve data from maps via RESTfull (standard HTTP) calls. There is also a
possibility of bindings to several programming languages as Python, R, Java (Bonnet et al, 2015). To
demonstrate applications of ACSN and NaviCell we show a study on drug sensitivity prediction using the
networks. We performed a structural analysis of Cell Cycle and DNA repair signaling network together
with omics data from ovary cancer patients resistant to genotoxic treatment. Following this study we
retrieved synthetic lethal gene sets and suggested intervention gene combinations to restore sensitivity
to the treatment. In additional study we show how epithelial to mesenchymal transition (EMT) signaling
network from the ACSN collection has been used for finding metastasis inducers in colon cancer through
network analysis. We performed structural analysis of EMT signaling network that allowed highlighting
the network organization principles and complexity reduction up to core regulatory routs. Using the
reduced network we modeled single and double mutants for achieving the metastasis phenotype. We
predicted that a combination of p53 knock-out and overexpression of Notch would induce metastasis and
suggested the molecular mechanism. This prediction lead to generation of colon cancer mice model with
metastases in distant organs. We confirmed in invasive human colon cancer samples the modulation of
Notch and p53 gene expression in similar manner as in the mice model, supporting a synergy between these
genes to permit metastasis induction in colon (Chanrion et al, 2014).
Kuperstein I et al, Atlas of Cancer signaling Network: navigating cancer biology with Google Maps Oncogenesis. doi: 10.1016/j.bbrc.2015.06.094 (2015) Bonnet E et al, NaviCell Web Service for network-based data visualization. Nucleic Acids Res. doi:10.1093/nar/gkv450 (2015) Chanrion et al, Notch activation and p53 deletion induce EMT-like processes and metastasis in a novel mouse model of intestinal cancer. Nature Communications 5:5005. doi: 10.1038/ncomms6005 (2014) |
Inna Kuperstein, Eric Bonnet, Christophe Russo, Hien-Anh Nguyen, David Cohen, Laurence Calzone, Maria Kondratova, Eric Viara, Marie Dutreix, Emmanuel Barillot and Andrei Zinovyev | |
109 |
Identification and analysis of host-pathogen interactions (HPI) data has a huge impact on disease
treatment, management and prevention. HPIDB 2.0 (http://www.agbase.msstate.edu/hpi/main.html) provides a
unified query interface for HPI information, and contains 43,276 manually curated entries in the current
release. Since the first HPIDB release in 2010, multiple enhancements to HPIDB data and interface
services were made to facilitate both the identification and functional analysis of HPI data. Notably,
HPIDB 2.0 now provides targeted biocuration of HPI data. Annotations provided by HPIDB curators meet
International Molecular Exchange consortium standards to provide detailed contextual experimental
information and facilitate data sharing. In addition to curation, HPIDB 2.0 integrates HPI from existing
external sources and contains tools to infer additional HPI where annotated data is scarce. Our data
collection approach ensures HPIDB 2.0 users access comprehensive HPI data from a wide range of pathogens
and their hosts (567 pathogen and 68 host species, as of December 2015) and avoid the time-consuming
series of steps required to integrate, standardize, and annotate HPI data. The data updates are
accompanied with enhanced web interface that allows the users to search, visualize, analyze and download
HPI data. Perhaps most noticeably for our users, we have expanded the HPIDB 2.0 results to display
additional interaction information, associated host and/or pathogen Gene Ontology functions and network
visualization. All HPIDB 2.0 data are updated regularly, are publicly available for direct download, and
are disseminated to other MI resources. Our future goals for HPIDB include broadening the number of
pathogens for which experimentally derived manual curation HPI data is available and enabling the end
user to evaluate the quality of transferred homologous HPI for improved computational HPI prediction.
|
Mais Ammari, Cathy Gresham, Fiona McCathy and Bindu Nanduri | |
110 |
UniProtKB/Swiss-Prot (http://www.uniprot.org) provides the scientific community with a collection of
information, expertly curated from the scientific literature, on protein variants. Priority is given to
single amino-acid polymorphisms (SAPs) found in human proteins, their functional consequences and
association with diseases. UniProt release 2016_01 includes over 74,000 human SAPs, 38,515 of which are
enriched by annotations in free-text describing involvement in disease and functional characteristics of
the variant. To ease access to this knowledge and to make it computer readable, we are restructuring
these annotations using controlled vocabulary. By combining terms from Variation Ontology (VariO) and
Gene Ontology (GO), we can describe the large spectrum of effects caused by SAPs on proteins. We use
VariO terms to indicate which protein property is affected, such as its structure, expression,
processing, and function. GO terms are used to specify which biological process, protein function, or
subcellular location are impacted. A limited number of controlled attributes complete the annotations,
defining how the protein property is affected, e.g. increased, decreased or missing.
Currently, most SAPs with at least 1 annotation have been reviewed, producing close to 7,000 structured functional annotations. We plan to provide this new structured format to our users as soon as possible. |
Maria Livia Famiglietti, Lionel Breuza, Teresa Neto, Sebastien Gehant, Nicole Redaschi, Sylvain Poux, Lydie Bougueleret, Ioannis Xenarios and Uniprot Consortium | |
111 |
The IEDB has been describing immune epitope experiments, as presented in the scientific literature, for
more than 10 years and has accumulated a significant dataset, representing what is currently known in
the field of immune epitopes. The goal of this project was to accurately model the disease states
presented in this literature in a logical and consistent manner. To achieve this goal, we reviewed all
disease states described in our literature set, determining the method of exposure (e.g. natural
infection), type of disease (e.g. allergy), site of disease (e.g. respiratory tract) and the immunogen
(e.g. Plasmodium falciparum) in order to establish logical rules to define disease states. This work
generated clear logical definitions for disease states and enabled enforceable validation rules, as well
as identifying errors within our dataset. In order to share the results of this work with overlapping
domains and to make the IEDB more interoperable with other resources, the resulting disease states and
their definitions were submitted to the Human Disease Ontology (DO). DO is the well-established
standardized ontology for human disease. Through collaboration between the IEDB and DO, these diseases
were logically modeled within the DO. Thus, the IEDB can now incorporate a subset of DO as an OWL file
to create a searchable disease tree for our curators and end users to easily view disease information in
a hierarchical tree format. Additionally, the reasoned ontology provides validation rules that are being
incorporated into the IEDB's curation interface to improve accuracy of curated data. Going forward, as
new diseases are encountered in the literature, the process will be repeated. Our hope is that other
resources will increasingly utilize these same diseases within DO and the richness of our data will
grow, identifying overlapping datasets and allowing new scientific conclusions.
|
Randi Vita, Elvira Mitraka, James A. Overton, Lynn M. Schriml and Bjoern Peters | |
112 |
The Cancer Gene Census is an ongoing effort to catalogue those genes for which somatic mutations have
been causally implicated in cancer. Originally published in 2004 (Futreal et al, 2004), the Census has
been continued and maintained by COSMIC, the Catalogue Of Somatic Mutations In Cancer, which is the
world's largest and most comprehensive resource to explore the impact of somatic mutations in human
cancer. Currently, 3% of all human genes are implicated in cancer development via somatic mutations or
gene fusions. Out of 571 Census genes, missense mutations have been reported in 539, nonsense mutations
in 366 and inactivating frameshift mutations in 331 genes. Various other types of mutations were found
in 318 Census genes, and 353 genes were involved in gene fusions. For a gene to be included in the
Census, at least two independent publications need to exist showing mutations in this gene in primary
patient material, and these should have evidence of the somatic origin of at least a subset of mutations
based on analysis of normal tissue from the same individuals. As germline fusions are relatively
uncommon, cancer genes involved in fusions may be included without definite evidence of somatic origin.
Also, single reports of novel fusions in rare tumours are included. Further inclusion and exclusion
rules are applied when considering new genes for inclusion. The Census is updated continuously with new
genes and related information including the tumour types in which mutations are found, classes of
mutation that contribute to oncogenesis, molecular mechanism of the gene in cancer and other genetic
properties. The Census is a manually curated summary of most relevant information on cancer driving
genes and their somatic mutations gathered in COSMIC database and brings together the expertise of a
dedicated curation team, in house and external cancer scientists and a wide user community. Since 2015
it has also become a part of the Centre for Therapeutic Target Validation (CTTV) platform
(https://www.targetvalidation.org) which brings together information on the relationships between
potential drug targets and diseases using evidence from multiple data types that can be relevant to
target identification and prioritisation. The Census is available from the COSMIC website for online use
or download at: (http://cancer.sanger.ac.uk/census).
|
Sari Ward, Zbyslaw Sondka, Sally Bamford, Charlotte G. Cole, David M. Beare, Nidhi Bindal, Tisham De, Simon A. Forbes, John Gamble, Mingming Jia, Chai Yin Kok, Kenric Leung and Peter J. Campbell | |
113 |
Phenotypes and diseases are emergent properties of whole organisms. At Mouse Genome Informatics (MGI,
www.informatics.jax.org), we curate models of human disease to the mice used in experiments,
specifically using the key allele pairs and strain backgrounds that define the full genetic makeup of
the mice. In order to derive gene-to-phenotype and gene-to-human disease model relationships from
annotated mouse models, we need to identify models that contain mutations in only a single causative
gene. These derived gene annotations can then be used to provide users with a high-level summary of gene
function and be used in candidate disease gene analysis. Filtering the various kinds of genotypes to
determine which phenotypes are caused by a mutation in a particular gene can be a laborious and
time-consuming process. At MGI we have developed a gene annotation derivation algorithm that computes
gene-to-phenotype and gene-to-disease annotations from our existing corpus of annotations to genotypes
(allele pairs and strain background). This algorithm differentiates between simple genotypes with
causative mutations in a single gene and more complex genotypes where mutations in multiple genes may
contribute to the phenotype. The process identifies alleles functioning as tools (e.g., reporters,
recombinases) and filters these out. Several improvements in allele descriptions in MGI have been used
to refine the accuracy of this algorithm. These include 1) creation of allele attributes that are used
to identify tools, 2) introduction of relationships between alleles and the genes expressed by the
allele, 3) introduction of relationships between multi-genic alleles and all genes in the mutation
region. Using this algorithm, derived gene-to-phenotype and gene-to-disease annotations were created for
16,000 and 2,100 mouse markers, respectively, starting from over 57,900 and 4,800 genotypes with at
least one phenotype and disease annotation, respectively. Implementation of this algorithm provides
consistent and accurate gene annotations across MGI and provides a vital time-savings relative to manual
annotation by curators. Supported by NIH grant HG000330.
|
Susan Bello, Janan Eppig and The Mgi Software Group | |
114 |
The Deciphering the Mechanisms of Developmental Disorders (DMDD) consortium is a research programme
characterising mouse lines carrying a targeted mutation that show embryonic and perinatal lethality when
the mutation is homozygous. One of the goals of the project is to identify lines useful to developmental
biologists and clinicians as animal models for investigating the basis of human developmental disorders.
Approximately a third of all mouse strains that carry a null mutation show homozygous recessive
embryonic or perinatal lethality, and among this group at least 60% show structural defects in one or
more organ system that can be identified in histological sections by conventional microscopy.
The DMDD project studies embryos that survive to at least mid gestation using a combination of comprehensive high resolution episcopic microscopy (HREM) for 3D imaging, and tissue histology to identify abnormalities in developing organ and placental structures. The images we collect are screened systematically for morphological defects by a team of developmental biologists and anatomists. The mutant phenotypes observed are recorded by using terms from the Mammalian Phenotype Ontology, or our own controlled vocabulary that enables us to document phenotypes in a systematic fashion prior to representation of the phenotype in the ontology. For 3D image datasets we capture the location at which the phenotype was observed through a plugin we developed for the open source image processing and visualisation software package Osirix. This plugin allows curators to export, import and merge sets of ontology terms and comments associated with points in 3D space, which makes it possible to carry out phenotype annotation at several sites. The image data and the phenotypes we have scored are available through the project website (http://dmdd.org.uk). The search function of the website enables end users to navigate directly to the 3D location within the image data that is the basis of the curated phenotype statement, and the stackviewer interface that displays the image also allows this section to be compared to similar ones from other embryos. |
Robert Wilson, Julia Rose, Stefan Geyer, Lukas Reissig, Dorota Szumska, Andrew Cook, Wolfgang Weninger, Cecilia Mazzeo, Jacqui White, Fabrice Prin and Tim Mohun | |
115 |
FlyBase uses an 'allele class' controlled vocabulary to type classical mutant alleles according to their
function, recording for example whether they are 'hypomorphic' or 'amorphic' loss of function alleles,
or 'neomorphic' or 'antimorphic' gain of function alleles. Using this controlled vocabulary allows users
to easily search for particular types of classical mutant alleles. However, we currently have no
equivalent controlled vocabulary for alleles that represent transgenic constructs introduced into flies,
with information describing the nature of the transgenic construct only being captured as free text. We
describe our initial attempts at formulating a 'class' controlled vocabulary for transgenic construct
alleles and assess how this fits onto our existing set of alleles. We also discuss how we hope to use
this new controlled vocabulary to improve summarisation of phenotype and genetic interaction data on the
FlyBase website. Finally we examine how we can use this new controlled vocabulary to computationally
derive new types of information from our existing curation of phenotype and genetic interaction data,
for example to derive functional complementation statements.
|
Gillian Millburn | |
116 |
The use of model organisms to study the mechanisms of human disease is growing rapidly. Concomitant with
this growth is the need for a disease ontology to facilitate comparisons of research findings and
disease profiles between human and model organisms and to aid in identifying the underlying genetic,
genomic and physiological mechanisms of disease. The Mouse Genome Database (MGD,
http://www.informatics.jax.org) and the Rat Genome Database (RGD, http://rgd.mcw.edu) are teaming up
with the Disease Ontology (DO, http://www.disease-ontology.org) project to harmonize disease annotation
through collaborative review of MGD (OMIM), RGD (MEDIC) and DO disease terms and to update and enhance
the structure and content of the DO to improve its capacity to support cross-organism representation of
disease. Our major goal is to foster the adoption of a shared, robust DO for MGD and RGD through the
enhancement of DO to support MGD and RGD disease annotations. As an added benefit, DO will become more
comprehensive and useful to other projects annotating various types of data generated from a wide
variety of experimental and clinical investigations. The ability to consistently represent disease
associations across data types from the cellular to the whole organism, generated from clinical and
model organism studies will facilitate data integration, data mining and comparative analysis. The
progressive enrichment of the DO and successful adoption of DO for disease annotation by MGD and RGD
will demonstrate its potential use across organisms and will encourage other groups to look at DO as a
standard for their disease annotation needs. In addition, use of DO will greatly increase the potential
for interoperability between MGD and RGD systems at the disease annotation level and provide the human
genetics/genomics community with a consistent way to query for rodent disease associations.
Disease Ontology Database URL: http://www.disease-ontology.org Mouse Genome Database URL: http://www.informatics.jax.org Rat Genome Database URL: http://rgd.mcw.edu |
Lynn Schriml, Mary Shimoyama, Elvira Mitraka, Susan Bello, Stanley Laulederkind, Cynthia Smith and Janan Eppig | |
117 |
Many diseases present with distinct phenotypes, making descriptions of phenotypes valuable for
identifying and diagnosing human diseases. The Human Phenotype Ontology (HPO) was developed to provide a
structured vocabulary containing textual and logical descriptions of human phenotypes. The HPO is used
for phenotype-genotype alignment in systems like the Monarch Initiative to provide disorder prediction,
variant prioritization, and patient matching between known diseases and model organisms. Here we
describe recent work to extend the utility of the HPO through the systematic addition of approximately
6,000 synonyms.
Until now, most of the HPO synonyms were composed of clinical terms unfamiliar to patients. For example, a patient may know they are 'color-blind', but may not be familiar with its official phenotype term 'Dyschromatopsia'. Therefore, our goals is to add synonyms in 'layperson-ese' so that HPO can be used by patients as well as basic research scientists and clinicians to help improve disease characterization and diagnosis. We systematically reviewed current HPO classes (approximately 12,000) and assigned layperson synonyms to each class where applicable. The layperson synonyms refer to colloquial terms used to describe phenotypic features associated with medical conditions. Each layperson synonym was annotated to indicate its special status, then classified as either exact (precise); broad (more general); narrow (more specific); or related (associated). The review process included various methods of identifying and validating possible layperson synonyms. We first queried the HPO to avoid duplicate terms. We then batched similar kinds of terms together, such as those related to bone abnormalities, to maintain consistent synonym terminology. For example, the phenotypes of the femur were assigned layperson synonym 'of thigh bone' and morphological abnormalities were described as 'abnormal shape of '. We consulted online resources (e.g., Wikipedia, Mayo Clinic) as well as specialized resources (e.g., Uberon, Gene Ontology) to find additional synonyms. As a quality control measure, we reviewed each other's work, consulted with clinical experts when necessary, and queried Google for the assigned layperson term to verify that it retrieved the appropriate medical term and was in use. Some challenges of assigning layperson synonyms involved reconciling lay terms with the logic and structure of the HPO and determining the best mechanism to validate the lay synonyms. Additionally, not every term has a lay synonym or it may already exist in the HPO, such as 'widow's peak' or 'hitch hiker's thumb'. Finally, some terms have complicated medical terminology, like 'short distal phalanx of first finger', for which a single layperson term is difficult to establish without using the definition of the term. The addition of layperson synonyms increases the usability of the HPO, making it useful for data interoperability across clinicians and patients. Additionally, this work will enable crowdsourcing by citizen scientists. The layperson synonyms will be available as a modular import file for the HPO and are due to be released in the Spring of 2016. |
Nicole Vasilevsky, Mark Engelstad, Erin Foster, Sebastian Kohler, Chris Mungall, Peter Robinson and Melissa Haendel | |
118 |
At its core, UniProt (the Universal Protein Resource) is a collection of protein sequences and
protein-related information. One of the main components of UniProt is the UniProt Knowledgebase. The
goal of UniProtKB is to organise and annotate information about protein function and sequence, providing
a comprehensive overview of available information. Annotation efforts are both automatic and manual.
Information is computationally added to uncharacterised sequences in the unreviewed TrEMBL section of
UniProtKB while experimentally characterised proteins undergo a process of manual expert curation before
entering the reviewed Swiss-Prot section. Manual curation consists of a critical review of experimental
and predicted data for each protein as well as manual verification of each protein sequence. The range
of captured information covers protein function, interactions, catalytic activity, patterns of
expression and disease associations, just to name few. All curation efforts are presented to the
scientific community in the form of a comprehensive, high-quality and freely accessible resource.
Launched to the public in December 2015, The Centre for Therapeutic Target Validation (CTTV) platform is a pioneering public-private research initiative between GlaxoSmithKline, EMBL-EBI and the Wellcome Trust Sanger Institute. CTTV aims to make use of recent progress in genome sequencing and the potential of 'big data' to improve the success rate for new drug discovery. UniProt is an integral part of this platform and supplies valuable information about target function to the protein profile section of the website. It also provides the graphical overview of protein features in the form of an interactive viewer. Data integration across multiple resources contributing to CTTV content and the creation of the map of complex relationships between diseases is possible as a result of mapping diseases to the terms in the Experimental Factor Ontology (EFO). UniProt curators have contributed to the creation of the disease associations by manually mapping over a thousand rare diseases that could not be mapped automatically. The mapping is an ongoing process and UniProt curators continue to map novel diseases in UniProt to common disease phenotype terms in the EFO. CTTV has many current areas of interest ranging from cancer to auto-immune diseases. UniProt curators are currently involved in updating target entries for proteins associated with two inflammatory bowel disease (IBD) conditions Crohn's disease and ulcerative colitis. Having a collection of the most up-to-date, cutting-edge experimental data presented in a well-organised target- or disease-centric manner will contribute to the effort to decipher disease-causing factors and help in the search for new treatments. |
Barbara Palka, Daniel Gonzalez, Edd Turner, Xavier Watkins, Maria Martin and Claire O'Donovan | |
119 |
Characterization of the phenotypic effect of mutations provides evidence on which variants of unknown
significance (VUS) can be more precisely evaluated. In this context we have annotated the phenotypes
caused by mutations in BRCA1, BRCA2, PALB2, MLH1, MSH2, MSH6, PMS2, APC, MUTYH associated with increase
susceptibility to the most prevalent hereditary breast and colorectal cancers.
Using the information derived from publications, the functional impact of variants was captured, resulting in thousands of different annotations. Annotations are organized in triplets, consistent with the RDF model. The annotations are composed of terms from ontologies and controlled vocabularies to ensure consistency in descriptions and support computational analysis. Each annotation is supported by detailed experimental evidence. Well characterized and assessed functions of the target proteins are captured. For instance for BRCA1 and BRCA2 this includes their ubiquitin-protein ligase and transcriptional regulation activities, their role in DNA repair in response to DNA damage. Important in term of phenotypic outcome are direct interactions with UBE2D1, BARD1 and BRIP1. For genes involved in DNA mismatch repair, this includes their overall mismatch repair ability, mismatch complex formation, as well as the individual steps of mismatch binding, ATPase activity, ADP/ATP exchange, etc. These annotations provide a comprehensive resource on the phenotypic outcome of cancer predisposing genes in a level of detail and structure superior to what is found in other databases. |
Isabelle Cusin, Monique Zahn, Valerie Hinard, Pierre Hutter, Amos Bairoch and Pascale Gaudet |