Adding Predictions
Biomappings is built on top of sssom_curator and can directly use the SSSOM
Curator lexical matching workflow to make semantic mapping predictions between any two
resources exposed via PyOBO with the biomappings predict lexical CLI command like
in:
$ biomappings predict lexical mesh maxo
This workflow is fast, simple, and interpretable since it relies on the labels and synonyms for concepts appearing in ontologies and other controlled vocabularies (e.g., MeSH). It takes the following steps:
Index labels and synonyms from entities in various controlled vocabularies (e.g., MeSH, MaXO)
Filter out concepts that have mappings in primary source or third-party mappings in Biomappings
Perform all-by-all comparison of indexes for each controlled vocabulary
Filter out mappings that have been previously marked as incorrect (e.g., avoid zombie mappings)
Custom Prediction Workflows
Biomappings can also be extended with custom semantic mapping prediction workflows. Several examples can be found in the scripts/ directory of the upstream Biomappings repository.
The following examples have already been used to predict mappings:
generate_mondo_mappings.py uses a pre-configured lexical matching workflow exposed through
sssom_curator.Repository.lexical_prediction_cli()generate_mesh_uniprot_mappings implements a fully custom lexical matching workflow that uses rule-based matching between MeSH and UniProt proteins that relies on the fact that the MeSH terms were generated from UniProt names. Use this as inspiration for rolling your own workflow.
generate_wikipathways_orthologs uses a rule-based method for matching orthologous pathways in WikiPathways that relies on the fact that the names are procedurally generated with a certain template
We also have a work-in-progress example using pykeen for generating mappings
based on knowledge graph embeddings (both in the transductive and inductive setting).
Warning
For people coming from the machine learning domain, there may be a desire to over-engineer matching methods. It actually turns out to be the case that lexical matching gets 80-90% of the job done most of the time when there are reasonable lexicalizations available. Resist the urge to make matching workflows overcomplicated!
Clone the Repository
Fork the upstream Biomappings repository
Clone your fork, make a branch, and install it. Note that we’re including the
webandpredict-lexicalextras, so we can run the curation interface locally as well as get all the tools we need for generating predictions.$ git clone https://github.com/<your namespace>/biomappings $ cd biomappings $ git checkout -b tutorial $ python -m pip install -e .[web,predict-lexical]
Go into the scripts/ directory
$ cd scripts/
Make a Python file for predictions. In this example, we’re going to generate mappings between the ChEBI ontology and Medical Subject Headings (MeSH).
$ touch generate_chebi_mesh_example.py
Preparing the Mapping Script
Biomappings has a lot of first-party support for lexical prediction workflows, so
generating mappings can be quite easy using a pre-defined workflow. Open your newly
created generate_chebi_mesh_example.py in your favorite editor and add the following
four lines:
# generate_chebi_mesh_example.py
from biomappings import append_lexical_predictions, get_script_url
provenance = get_script_url(__file__)
append_lexical_predictions("chebi", "mesh", provenance=provenance)
All generated mappings in Biomappings should point to the script that generated them.
biomappings.get_script_url() is called in a sneaky way with __file__ to get
the name of the to generate a URI string , assuming that this is in the scripts/
directory of the Biomappings repository.
The hard work is done by biomappings.append_lexical_predictions() when called with
ChEBI as the source prefix and MeSH as the target prefix along with the previously
generated provenance URI string. Under the hood, this does the following:
Looks up the names and synonyms for concepts in ChEBI and MeSH using
pyobo, a unified interface for accessing ontologies and non-ontology controlled vocabularies (such as MeSH)Runs the algorithm described above
Appends the predictions on to the local predictions TSV file
Finishing Up
Execute the script from your command line and the predictions will be added to your local Biomappings cache.
$ python generate_chebi_mesh_example.py
This is a good time to review the changes and make a commit using
$ git add src/biomappings/resources/predictions.tsv
$ git commit -m "Add predictions from ChEBI to MeSH"
$ git push
Finally, you can run the web curation interface like normal and search for your new predictions to curate!
$ biomappings web