BEL Curation Procedures and Guidelines
¶
This folder contains the BEL curation procedures and guidelines developed and used during the Human Brain Pharmacome project.
Style Guide for BEL¶
This document describes style guidelines for BEL. It was written with inspiration by the pragmatism and existence of the PEP8 guidelines.
Division of Content into Documents¶
Each statement in BEL is an atomic piece of knowledge, and combine with annotations and provenance information makes a nano-publication. This header addresses the issue of how to organize that information into several .bel files.
Simply, each BEL document should represent the contents of one article. There may be reasons to include multiple articles in a single BEL document if there is crucial supporting information, but the task of assembling BEL for analysis is not the task of the curator.
One example where curation intervention was helpful in defining criteria was the use of the “Subgraph” annotation in NeuroMMSig, which sliced a large knowledgebase related to Alzheimer’s disease into several discrete subgraphs corresponding to biological pathways/mechanisms.
As an added benefit, the one-to-one correspondence of BEL scripts to citations makes the management of curators much easier since files will generally not conflict. This also encourages the externalization of list annotations for reuse.
Document Metadata¶
Versioning¶
BEL documents that are manually generated (as opposed to dumps of databases such as DrugBank) should use version numbers following Semantic Versioning. A correct example, using the <MAJOR>.<MINOR>.<PATCH> format:
Authorship¶
Authors should be set comma-separated in alphabetical order by last name using
In the description, the contributions of each author can be listed. Some suggested roles are “curation”, “supervision”, “quality control”.
Contact Info¶
Consider that the authors of a BEL document and the responsible person for the
integrity and correctness of the document might not be the same person. For example,
this could be due to people moving to new projects. Only the person responsible
for a given BEL document should list their contact information in the
SET DOCUMENT ContactInfo
field.
Organization of Terminologies¶
The term “terminologies” is used to refer to both BEL namespaces and BEL annotations in this section.
Terminologies’ keywords should use an uppercased version of their corresponding
entry in Identifiers.org, when possible. Dots and dashes in resource names are
removed for BEL, since they are not consider valid characters for use in keywords.
Example: ec-code
becomes ECCODE
.
Namespaces should be listed first (interspersed URL and PATTERN definitions), then annotations (interspersed URL and PATTERN definitions), then annotations defined by lists. Within each group, all terminologies should be listed in alphabetical order by the keyword used.
Terminologies with multiple parts, like MeSH and GO, should NOT be split into multiple namespaces (e.g. MESHD, MESHCS, MESHC, GOBP, GOCC, GOMF). Update versions of these namespaces can be found at https://github.com/pharmacome/terminology/tree/master/external and versioned using the git commit hashes. The following namespaces are already available:
Note, while GFAM
is used for hgnc.genefamily
for brevity, this isn’t really recommended.
Usage of Short vs. Long Form¶
All BEL functions (e.g., proteinAbundance()
, abundance()
, pathology()
, etc.)
should be abbreviated to the short forms (e.g, p()
, a()
, path()
, etc.).
All BEL transformations (i.e., activity()
, translocation()
, and reaction()
),
as well as their specific arguments (i.e. molecularActivity()
, fromLocation()
, etc.)
should be abbreviated to the short forms (i.e. act()
, tloc()
, and rxn()
).
All BEL relationships should be abbreviated with their short forms.
BEL is quite verbose - the theme is to always abbreviate when possible.
Usage of SET STATEMENT_GROUP
¶
STATEMENT_GROUP
is listed in the BEL specification as a privileged annotation - it does not need to
be defined, and it can be set to anything without semantic validation.
Because it neither has inherent meaning, nor community practices ascribed to it, it is explicitly discouraged to use this annotation.
Some curators use the STATEMENT_GROUP
to give information about who the curator was, or a certain “sprint”
of curation, but these should already be addressed by the earlier point on the organization of BEL documents.
Proper Spacing¶
Ensure proper spacing. Without it, BEL is difficult to read and assess.
TODO develop a linter for continuous integration checking!
Spacing in BEL Terms¶
The following protein with a post-translational modification is difficult
to read because there is no space between the comma following the identifier
and the pmod()
function:
The same, with proper spacing applied:
The same applies for all other variants (sub()
, frag()
, loc()
, etc.)
and other functions in which commas are applied. The following is another
example in which the spacing between the comma following the identifier is
correct, but the contents of the pmod()
are not:
The same, with proper spacing applied:
Spacing in Annotations¶
The following single annotation is difficult to read because there are no spaces between 1) the annotation and the equals sign and 2) the equals sign and the value:
The same, with proper spacing applied:
The following multiple annotation is is difficult to read, because there no spaces between 1) the annotation and the equals, 2) the equals and the open bracket, and 3) the entries within the brackets.
The same, with proper spacing applied:
Citation Information¶
Citations should be written succinctly when referring to databases like PubMed, PubMed Central and DOI. The remaining citation information can be looked up programatically after.
The same, with proper terseness:
Chemical Biology Curation Guidelines¶
This document containts an initial (and as-of-yet incomplete) set of guidelines for representing quantitative information in BEL. They do not require any special extensions to its syntax, and should be compatible with any of the available parsers/compilers.
There is currently a BEL enhancement propopsal for BEL 2.0.0+ for native support of numeric annotations that would benefit from public input, so that’s highly encouraged!
Inhibitors¶
This example will focus on the ability for lovastatin (CHEMBL1487) to inhibit human HMG-CoA reductase (CHEMBL402, UniProt:P04035, HGNC:HMGCR, HGNC:5006).
Simple Representation¶
Medium-Granular Representation with BEL Default Namespace¶
Specific Representation¶
Using BEL 2.0, the molecular function HMGCR that lovastatin inhibits is its (hydroxymethylglutaryl-CoA reductase (NADPH) activity (GO:0004420).
In general, it might not be so obvious how specific of a GO term to choose. Additionally, a protein may have multiple functions. Complementary to GO is the ExPASy enzyme classification, which is also encoded in ChEBI and can be automatically added to BEL.
Assay Metadata¶
Taking inspiration from the ChEBML schema, several pieces of metadata can make inhibition experiments more useful:
- Target Type (e.g., cell line, organism, single protein, complex, etc.)
- Measurement Type (e.g., IC50, pIC50, EC50, pEC50, Ki)
- Measurement Units (e.g., pM, nM, μM, mM, M)
- Measurement Relation (=, >, <, >=, <=, ~)
- Measurement Value (floating point matching the regular expression:
^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$
) - Assay Type (see below)
- Cell line, target organism, and/or species
In some cases, the measurement value may be reported as a range. In these situations, use two complementary annotations for Measurement Range Lower and Measurement Range Upper.
Assay Type (adapted from ChEMBL)¶
- Binding (B) - Data measuring binding of compound to a molecular target, e.g. Ki, IC50, Kd.
- Functional (F) - Data measuring the biological effect of a compound, e.g. %cell death in a cell line, rat weight.
- ADMET (A) - ADME data e.g. t1/2, oral bioavailability.
- Toxicity (T) - Data measuring toxicity of a compound, e.g., cytotoxicity.
- Physicochemical (P) - Assays measuring physicochemical properties of the compounds in the absence of biological material e.g., chemical stability, solubility.
- Unclassified (U) - A small proportion of assays cannot be classified into one of the above categories e.g., ratio of binding vs efficacy.
After adding this metadata, we get:
Provenance¶
Using a valid citation that points to the original source of the information is preferred to using a reference to the database from which the relation comes. Example: it’s better to use PMID:2153213 in these examples referring to its original publication rather than citing ChEMBL.
However, data often comes in a table, and won’t have a real evidence text.
It’s not exactly clear whether BEL requires evidences for each statement,
so for now a placeholder string saying “Retrieved from X” and an additional
annotation called Database
set to X
will allow forward-compatibility.
Use an identifiers.org namespace whenever possible for X
.
Finally, with both the assay metadata and provenance, we get:
Receptor Binding¶
This example will focus on the binding of zolpidem (CHEMBL911) to the GABA receptor alpha-5 subunit (CHEMBL5112, UniProt:A8K338, HGNC:GABRA5)
The binding of a chemical to a receptor is represented by the chemical causing a complex with the protein. Binding is typically measured with Ki.
Zolipidem is not a very strong binder to the GABA receptor alpha-5 subunit, so it is unlikely we’ll find an annotation as to its binding type.
Binding Type¶
- Full agonists are able to activate the receptor and result in a strong biological response. The natural endogenous ligand with the greatest efficacy for a given receptor is by definition a full agonist (100% efficacy).
- Partial agonists do not activate receptors with maximal efficacy, even with maximal binding, causing partial responses compared to those of full agonists (efficacy between 0 and 100%).
- Antagonists bind to receptors but do not activate them. This results in a receptor blockade, inhibiting the binding of agonists and inverse agonists. Receptor antagonists can be competitive (or reversible), and compete with the agonist for the receptor, or they can be irreversible antagonists that form covalent bonds (or extremely high affinity non-covalent bonds) with the receptor and completely block it. The proton pump inhibitor omeprazole is an example of an irreversible antagonist. The effects of irreversible antagonism can only be reversed by synthesis of new receptors.
- Inverse agonists reduce the activity of receptors by inhibiting their constitutive activity (negative efficacy).
To Do:
- add full agonist example
- add partial agonist example
- add antagonist example
Allostery¶
In general, if allostery is not set, then it is assumed to be None.
Allosteric modulators do not bind to the agonist-binding site of the receptor but instead on specific allosteric binding sites, through which they modify the effect of the agonist. For example, benzodiazepines (BZDs) bind to the BZD site on the GABAA receptor and potentiate the effect of endogenous GABA.
- Positive allosteric modulator
- Negative allosteric modulator
Source: https://en.wikipedia.org/wiki/Receptor_(biochemistry)
When the binding type is set, we can also write a second statement with how the binding affects the activity of the receptor.
Basmisanil (CHEMBL3681419) is an inverse agonist of the GABA receptor alpha-5 subunit (UNIPROT:A8K338).
Encoding of Genetic Information¶
BEP6 introduces a proper syntax for representing epigenetic modifications, such as methylation for BEL 2.0.0+. The syntax follows the same style as the protein modification syntax, and can follow the identifier for a gene. A reference implementation has been included in PyBEL.
Epigenetics¶
The gene modification function, gmod()
, as a syntax for encoding epigenetic
modifications. Its usage mirrors the pmod()
function for proteins and includes
arguments for methylation.
The options for gmod()
are currently:
M
,Me
, andmethylation
all refer to methylation on the given geneA
,Ac
, andacetylation
all refer to acetylation on the given gene
Single Nucleotide Polymorphisms (SNPs)¶
In general, a single nucleotide polymorphism (SNP) refers to a variant in a genetic sequence. The de facto identifiers for these variations are the RS numbers from dbSNP. A given identifier can point to two types of information: intrageneic SNPs and intergenic SNPs.
Intragenic SNPs¶
A variation in the sequence of a protein-coding gene can have the consequence of an amino acid substitution or differential expression of the gene. In these cases, it’s important to explicitly code a BEL statement linking a SNP to its gene with the variation to which it refers. Since BEL 2.0, variants can be encoded according to the HGVS nomenclature.
A variant can be named multiple ways depending on the “reference sequence”
used. While it is possible to refer to a variation by the amino acid sequence
or the chromosomal sequence, it is much easier to interpret biologically when
using the gene reference sequence. Given an RS
identifier, dbSNP lists the
reference sequence identifier and the HGVS string based on many of these
sequences. More specifically, reference sequence identifiers starting with
NM
or XM
refer to the genetic sequence (and should all have the same
HGVS string), while the NC
identifiers refer to the chromosome sequence
and should be disregarded.
In the future, there might be a way to automate this procedure, but as a
curator wants to encode intragenic SNPs, they should also make this equivalence
explicit. These statements can be grouped together in a citation to dbSNP, the
evidence can be dummy text, and the confidence level can be set with
SET Confidence = "Axiomatic"
.
First, dbSNP can be included using a regular expression definition of a namespace, since there are potentially billions of enumerated SNPs. This is a BEL 2.0.0+ feature that was proposed in BEP5. Identifiers.org lists the database information at https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000161, and includes a regular expression that all accession numbers follow. This can be included as
The following is an example from https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1235 that
uses the equivalentTo
relationship proposed by BEP4
for BEL 2.0.0+ to link a dbSNP entry to the HGVS nomenclature applied to a gene.
A reference implementation is provided by PyBEL, in which the hasVariant relationship
g(HGNC:MDGA2) hasVariant g(dbSNP:rs1235)
is automatically added by the compiler.
An example from a genome-wide association study:
Intergenic SNPs¶
Some SNPs are not directly part of coding genes’ regions. For these SNPs, it is not necessary to encode a relationship to a gene.
However, this also means that their functional consequences do not follow directly from the causal relationships connected to a particular gene. It must be kept in mind that these SNPs will need further qualifications to become useful, such as associations from LD-Block analysis or other studies from eQTL, etc.
LD-Block Information¶
Linkage disequilibrium (LD) block analysis find SNPs that co-occur together. These relationships can be inferred from data-driven approaches.
TODO
eQTL Information¶
Expression quantitative trait loci (eQTLs) connect variants to gene expression patterns.
TODO
Re-curation Guidelines¶
These guidelines were originally written for the re-curation of several NeuroMMSig subgraphs in the Alzheimer’s Disease Knowledge Assembly during the Human Brain Pharmacome project, but may be generally applicable to other BEL scripts as well.
Normalizing Entities¶
Chemicals¶
- Normalize chemical entities to preferred namespaces (ChEBI, ChEBML, PubChem) whenever possible. MeSH is explicitly discouraged because it is difficult to look up their structures as SMILES or InChI, even with resolving services like UniChem
- Formalize knowledge about chemicals that have not yet been encoded in ChEBI (such as Selventa chemicals [SCHEM], the BELIEF chemical namespaces, etc.), drawing from other public resources (PubChem, MeSH, CAS, ChEMBL, UniChem, BridgeDB, ChemSpider, etc.) whenever possible
Protein Complexes and Families¶
FamPlex has emerged as a resource that maps families and protein complexes (including the Selventa mappings SFAM and SCOMP as well as other widely used namespaces like PFAM and InterPro).
- Normalize all entities to FamPlex
- Formalize knowledge about new families by making a pull request to FamPlex. See: https://github.com/sorgerlab/famplex#contributing
Other Entities¶
We are also building a terminology at https://github.com/pharmacome/terminology. This should not be done lightly, so see its contribution guidelines and rules before making a pull request.
Checking Correctness¶
- If statement can be asserted from the given evidence, add the annotation
SET Confidence = "High"
- If the statement is wrong, fix it and add the annotation
SET Confidence = "Medium"
- If it’s not clear what BEL should represent the biology, add
SET Confidence = "Low"
for later discussion - If the evidence string contains no reasonable biological knowledge/is nonsense, delete it and the related statements entirely. It’s okay to remove BEL statements that are not supported.
Finalization¶
After all relevant statements have been checked for correctness, the
curation leader will check all statements with SET Confidence = "High"
or SET Confidence = "Medium"
and change to SET Confidence = "Very High"
if they agree. If they do not agree, they will fix it themselves.
Curation using INDRA¶
This document describes a procedure for using INDRA for acquiring automatically extracted relations in BEL for curation.
1. Identification of target entities¶
Curation begins from a seed list of entities. Often, this will be genes of interest to a project or based on a rational assessment of information density in a network.
2. Gather statements from INDRA¶
After acquiring statements from the INDRA REST Database, several filters are made to the resulting
- Statements containing chemicals are filtered. They are more difficult to curate because of the heterogeneity of chemical nomenclatures and identifiers, and many times can be replaced by the use of databases like ChEMBL, CTD, PubChem, etc.
- Confidences are calculated for each statement based on INDRA’s BELIEF engine, which takes into account high-quality material that has been previously curated in Pathway Commons, Selventa’s Large BEL Corpus, and other sources. Statements of different levels of granularity are linked, and confidences can be calculated.
- When several text-based evidences are available for a given statement, one is chosen randomly so as to avoid curator fatigue.
3. Preparation of Curation Sheets¶
The filtered content is exported to a CSV file containing the following columns:
- INDRA UUID. A unique identifier assigned to each statement in the INDRA database
- INDRA Confidence. The estimated percentage (0.0 - 1.0) for how correct a statement is
- Text Reference. The PubMed identifier of the article from which the statement was automatically extracted
- Text. The text from which the statement was extracted
- INDRA API. The text mining system used by INDRA to extract the statement
- BEL Subject
- BEL Predicate
- BEL Object
- Checked
- Correct
- Changed
- Annotations
4. Curation¶
Each statement should be read and assessed by the curator, then an “x” should be placed in the “Checked” column.
- If the statement was correct, an “x” should be placed in the “Correct” column.
- Otherwise, the statement should be fixed (assignment of entity types, relation, etc.) and an “x” should be placed in the “Changed” column.
- If the statement is total nonsense, then no checks should be placed in either the “Correct” or “Changed” columns. See the guidelines for curating information about errors on https://github.com/pharmacome/curation/blob/master/indra-errors.rst
If there are other BEL that can be extracted, make a new line with all of the same provenance information (uuid, reference, evidence, etc.) and just place an “x” in the “Changed” column.
If there are any annotations (cell type, species, cell line, tissue, experimental context, etc.) that are obvious from the evidence, then they can be denoted in the “Annotations” column using the BEL idiom of SET ANNOTATION … statements.
5. Re-Curation¶
Finally, the statements should be exported to BEL and checked using the re-curation procedure.
Curation Guidelines for INDRA errors¶
This document outlines the most common type of errors identified in the BEL statements extracted from INDRA.
When they’re encountered:
- Add a column with the label ‘Error Type’ to your document
- Whenever you find a type of error mentioned in the table, please put the type of error in the column
- If the error does not correspond to any of those categories, add a new error type to this table and give an example
Gene-centric¶
- Knock Down. A knock down of the gene is not labelled as such.
- Gene skipping. Exon/intron regulation is not labelled as such.
- Promoter activity. Missing promotor activity labels.
Relationship¶
- Target. Labelled as target but not clear from the evidence.
- Modulate. Labelled as modulation but not clear from the evidence.
- Regulate. Labelled as regulation but not clear from the evidence.
- Mediate. Labelled as mediation but not clear from the evidence.
Identification¶
The Named-entity recognition system labels an entity wrong. For example, confusing an abbreviation with a different meaning (e.g., FIP means USF2 but is also a recombinant protein (FasL Interfering Protein))
Negation¶
- But not. False positive negation.
- Negative mediator.
Subject/object¶
Subject and object are correctly labelled but there is not relationship between them.
Not Evidence¶
The evidence information is not sufficient to code the BEL statement.
Site of Modification¶
The entity was labelled with the wrong modification site.
Physical Contact Missing¶
The relationship was labelled as directed but the evidence does not show that.
Rational Enrichment¶
This document describes an approach to enrich knowledge assemblies based on topological novelty.
1. Assemble BEL from Relevant Sources¶
- Assemble BEL from relevant sources
- Re-curate content with questionable quality following the re-curation procedure
- Optional: choose annotations that are relevant and filter the resulting network. During the initial stages of the Human Brain Pharmacome project, we used the “Subgraph” annotation from NeuroMMSig to select our ten highest priority candidate mechanisms of pathogenesis of Alzheimer’s disease.
2. Identify Low-Information Nodes¶
The information density of each node is assessed to provide an un-biased prioritization list of genes/proteins.
- Chemicals and higher-order processes (e.g., biological processes, phenotypes, pathologies, clinical measurements, etc.) are filtered from the resulting network. From a molecular-biology point of view, these are the logical inputs and outputs to a biological system.
- Proteins, mRNAs, miRNAs, and all variants are collapsed to their corresponding genes’ nodes.
- The information density of each node is defined as the degree: the sum of the in-degree and out-degree.
Nodes with a degree of 0 necessarily had links to chemicals or higher-order processes that were removed in Step 1. They are interesting because there is not yet mechanistic information linking them to other genes and proteins that have been curated in the context of the network. Nodes with a degree of 1 or more have an increasing amount of information, and can be ranked such that the lowest degrees are prioritized for curation.
3. Filter Low-Information Nodes¶
Some of the nodes with low information have already been quantified mechanistically, but just not in the given knowledge assembly. In this step, several sources (Pathway Commons, Bio2BEL, etc.) are used to enrich these nodes to determine if they have already been curated. Those that are found in other sources are also excluded from curation.
4. Curation¶
Several steps can be taken to curate content after prioritizing genes.
- Curation with INDRA
- Manual search of literature for full-text. Eventually, storing the resulting PDFs in services like Mendeley will result in good automatic recommendations. Curation and re-curation can be done on prioritized papers.
Curation with Git¶
This document describes a workflow for doing BEL curation using git for version control. It’s inspired by the Git Flow philosophy.
The purpose of this document is not to explain what git is or how to use it. Lots of fantastic resources exist on the internet for this - here’s a good place to start: https://guides.github.com/activities/hello-world/.
1. Make a Branch¶
The main/master branch of the git repository is protected in order to encourage assessment by multiple users.
- Check out the master branch
- Fetch and pull from the origin
- Make a new branch with a descriptive (but succinct) name of what you’ll be working on.
If this branch corresponds to adding a new article, title the branch with the author’s
last name and publication year prefixed with
curation-
. Example:curation-olsen2016
2. Commits with Incremental Progress¶
Commit early and often. Each commit should describe what’s being worked on, and follow best community practices for writing the commit. See: https://chris.beams.io/posts/git-commit/. In the future, pre-commit hooks could be used to enforce this policy.
Between commits, locally compiling the BEL document with a compiler like PyBEL to find errors is helpful as opposed to waiting until the whole curation task is finished. Other solutions using continuous integration to take the hassle out of installing and running a BEL compiler are also publicly available. For example, see: https://github.com/cthoyt/pybel-git.
3. Merge¶
If curation is being done on GitHub, make a pull request. On GitLab, make a merge request. This is a place where the final results can be checked one more time for syntactic/semantic correctness with PyBEL, and discussion between curators and managers can occur.
Each merge/pull request needs a name suitable for the public commit log, since the multiple commits will be combine (through a process known as squashing) before merging onto the master branch.