Adding Predictions
The Biomappings workflow is generic and allows for any mapping prediction workflow to be used. You can see several examples in the scripts/ directory of the canonical Biomappings repository.
Most are built using a lexical matching workflow based on Gilda. This approach is fast, simple, and interpretable since it relies on the labels and synonyms for concepts appearing in ontologies and other controlled vocabularies (e.g., MeSH). It takes the following steps:
Index labels and synonyms from entities in various controlled vocabularies (e.g., MeSH, ChEBI)
Filter out concepts that have mappings in primary source or third-party mappings in Biomappings
Perform all-by-all comparison of indexes for each controlled vocabulary
Filter out mappings that have been previously marked as incorrect (e.g., avoid zombie mappings)
Examples
The following examples have already been used to predict mappings
generate_mondo_mappings.py uses a pre-configured workflow based on the
gilda
andpyobo
integrationsgenerate_ccle_mappings.py uses same pre-configured workflow with a custom filter to include additional mappings
generate_cl_mesh_mappings implements a fully custom matching workflow, also built on Gilda (in case you want to roll your own)
generate_mesh_uniprot_mappings uses rule-based matching between MeSH and UniProt proteins that relies on the fact that the MeSH terms were generated from UniProt names.
generate_wikipathways_orthologs uses a rule-based method for matching orthologous pathways in WikiPathways that relies on the fact that the names are procedurally generated with a certain template
We also have a work-in-progress example using pykeen
for generating mappings based on knowledge graph embeddings
(both in the transductive and inductive setting).
Warning
For people coming from the machine learning domain, there may be a desire to over-engineer matching methods. It actually turns out to be the case that lexical matching gets 80-90% of the job done most of the time when there are reasonable lexicalizations available. Resist the urge to make matching workflows overcomplicated!
Clone the Repository
Fork the upstream Biomappings repository
Clone your fork, make a branch, and install it. Note that we’re including the
web
andgilda
extras, so we can run the curation interface locally as well as get all the tools we need for generating predictions.git clone https://github.com/<your namespace>/biomappings cd biomappings git checkout -b tutorial python -m pip install -e .[web,gilda]
Go into the scripts/ directory
cd scripts/
Make a Python file for predictions. In this example, we’re going to generate mappings between the ChEBI ontology and Medical Subject Headings (MeSH).
touch generate_chebi_mesh_example.py
Preparing the Mapping Script
Biomappings has a lot of first-party support for Gilda prediction workflows, so generating mappings
can be quite easy using a pre-defined workflow. Open your newly created generate_chebi_mesh_example.py
in your favorite editor and add the following four lines:
# generate_chebi_mesh_example.py
from biomappings.gilda_utils import append_gilda_predictions
from biomappings.utils import get_script_url
provenance = get_script_url(__file__)
append_gilda_predictions("chebi", "mesh", provenance=provenance)
All generated mappings in Biomappings should point to the script that generated
them. biomappings.utils.get_script_url()
is called in a sneaky way with
__file__
to get the name of the to generate a URI string , assuming that this
is in the scripts/
directory of the Biomappings repository.
The hard work is done by biomappings.gilda_utils.append_gilda_predictions()
when called
with ChEBI as the source prefix and MeSH as the target prefix along with the previously generated
provenance URI string. Under the hood, this does the following:
Looks up the names and synonyms for concepts in ChEBI and MeSH using
pyobo
, a unified interface for accessing ontologies and non-ontology controlled vocabularies (such as MeSH)Runs the algorithm described above
Appends the predictions on to the local predictions TSV file
Finishing Up
Execute the script from your command line and the predictions will be added to your local Biomappings cache.
python generate_chebi_mesh_example.py
This is a good time to review the changes and make a commit using
git add src/biomappings/resources/predictions.tsv
git commit -m "Add predictions from ChEBI to MeSH"
git push
Finally, you can run the web curation interface like normal and search for your new predictions to curate!
biomappings web
Custom Predictions File
While it’s preferred that predictions generated using the Biomappings workflow are contributed back to the upstream repository, custom instances can be deployed, e.g., within a company that wants to curate mappings to its own internal controlled vocabulary.
In order to accomplish this, you can use the path
argument to
biomappings.gilda_utils.append_gilda_predictions()
. By modifying the previous example,
we can store the predictions in a file in the same directory as the script called predictions.tsv
.
# generate_chebi_mesh_example.py
from pathlib import Path
from biomappings.gilda_utils import append_gilda_predictions
from biomappings.utils import get_script_url
HERE = Path(__file__).parent.resolve()
PREDICTIONS_PATH = HERE.joinpath("predictions.tsv")
provenance = get_script_url(__file__)
append_gilda_predictions("chebi", "mesh", provenance=provenance, path=PREDICTIONS_PATH)