Encoding of Genetic Information

BEP6 introduces a proper syntax for representing epigenetic modifications, such as methylation for BEL 2.0.0+. The syntax follows the same style as the protein modification syntax, and can follow the identifier for a gene. A reference implementation has been included in PyBEL.

Epigenetics

The gene modification function, gmod(), as a syntax for encoding epigenetic modifications. Its usage mirrors the pmod() function for proteins and includes arguments for methylation.

The options for gmod() are currently:

  • M, Me, and methylation all refer to methylation on the given gene
  • A, Ac, and acetylation all refer to acetylation on the given gene

Single Nucleotide Polymorphisms (SNPs)

In general, a single nucleotide polymorphism (SNP) refers to a variant in a genetic sequence. The de facto identifiers for these variations are the RS numbers from dbSNP. A given identifier can point to two types of information: intrageneic SNPs and intergenic SNPs.

Intragenic SNPs

A variation in the sequence of a protein-coding gene can have the consequence of an amino acid substitution or differential expression of the gene. In these cases, it’s important to explicitly code a BEL statement linking a SNP to its gene with the variation to which it refers. Since BEL 2.0, variants can be encoded according to the HGVS nomenclature.

A variant can be named multiple ways depending on the “reference sequence” used. While it is possible to refer to a variation by the amino acid sequence or the chromosomal sequence, it is much easier to interpret biologically when using the gene reference sequence. Given an RS identifier, dbSNP lists the reference sequence identifier and the HGVS string based on many of these sequences. More specifically, reference sequence identifiers starting with NM or XM refer to the genetic sequence (and should all have the same HGVS string), while the NC identifiers refer to the chromosome sequence and should be disregarded.

In the future, there might be a way to automate this procedure, but as a curator wants to encode intragenic SNPs, they should also make this equivalence explicit. These statements can be grouped together in a citation to dbSNP, the evidence can be dummy text, and the confidence level can be set with SET Confidence = "Axiomatic".

First, dbSNP can be included using a regular expression definition of a namespace, since there are potentially billions of enumerated SNPs. This is a BEL 2.0.0+ feature that was proposed in BEP5. Identifiers.org lists the database information at https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000161, and includes a regular expression that all accession numbers follow. This can be included as

The following is an example from https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1235 that uses the equivalentTo relationship proposed by BEP4 for BEL 2.0.0+ to link a dbSNP entry to the HGVS nomenclature applied to a gene.

A reference implementation is provided by PyBEL, in which the hasVariant relationship g(HGNC:MDGA2) hasVariant g(dbSNP:rs1235) is automatically added by the compiler.

An example from a genome-wide association study:

Intergenic SNPs

Some SNPs are not directly part of coding genes’ regions. For these SNPs, it is not necessary to encode a relationship to a gene.

However, this also means that their functional consequences do not follow directly from the causal relationships connected to a particular gene. It must be kept in mind that these SNPs will need further qualifications to become useful, such as associations from LD-Block analysis or other studies from eQTL, etc.

LD-Block Information

Linkage disequilibrium (LD) block analysis find SNPs that co-occur together. These relationships can be inferred from data-driven approaches.

TODO

eQTL Information

Expression quantitative trait loci (eQTLs) connect variants to gene expression patterns.

TODO