Rxivist logo

Expanding a Database-derived Biomedical Knowledge Graph via Multi-relation Extraction from Biomedical Abstracts

By David N. Nicholson, Daniel S. Himmelstein, Casey S. Greene

Posted 08 Aug 2019
bioRxiv DOI: 10.1101/730085

Knowledge graphs support multiple research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via some form of manual curation, which is difficult to scale in the context of an increasing publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to automatically annotate textual data. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This makes populating a knowledge graph with multiple nodes and edge types practically infeasible. We sought to accelerate the label function creation process by evaluating the extent to which label functions could be re-used across multiple edge types. We used a subset of an existing knowledge graph centered on disease, compound, and gene entities to evaluate label function re-use. We determined the best label function combination by comparing a baseline database-only model with the same model but added edge-specific or edge-mismatch label functions. We confirmed that adding additional edge-specific rather than edge-mismatch label functions often improves text annotation and shows that this approach can incorporate novel edges into our source knowledge graph. We expect that continued development of this strategy has the potential to swiftly populate knowledge graphs with new discoveries, ensuring that these resources include cutting-edge results.

Download data

  • Downloaded 702 times
  • Download rankings, all-time:
    • Site-wide: 21,655 out of 100,360
    • In bioinformatics: 3,124 out of 9,219
  • Year to date:
    • Site-wide: 6,378 out of 100,360
  • Since beginning of last month:
    • Site-wide: None out of 100,360

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)


  • 20 Oct 2020: Support for sorting preprints using Twitter activity has been removed, at least temporarily, until a new source of social media activity data becomes available.
  • 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
  • 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
  • 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
  • 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
  • 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
  • 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
  • 22 Jan 2019: Nature just published an article about Rxivist and our data.
  • 13 Jan 2019: The Rxivist preprint is live!