Using Machine Learning to Parse Breast Pathology Reports
Julliette M Buckley,
Suzanne B Coopey,
Judy E Garber,
Barbara L Smith,
Michele A Gadd,
Michelle C Specht,
Thomas M Gudewicz,
Kevin S Hughes
Posted 10 Oct 2016
bioRxiv DOI: 10.1101/079913 (published DOI: 10.1007/s10549-016-4035-1)
Posted 10 Oct 2016
Purpose: Extracting information from Electronic Medical Record is a time-consuming and expensive process when done manually. Rule-based and machine learning techniques are two approaches to solving this problem. In this study, we trained a machine learning model on pathology reports to extract pertinent tumor characteristics, which enabled us to create a large database of attribute searchable pathology reports. This database can be used to identify cohorts of patients with characteristics of interest. Methods: We collected a total of 91,505 breast pathology reports from three Partners hospitals: Massachusetts General Hospital (MGH), Brigham and Womens Hospital (BWH), and Newton Wellesley Hospital (NWH), covering the period from 1978 to 2016. We trained our system with annotations from two datasets, consisting of 6,295 and 10,841 manually annotated reports. The system extracts 20 separate categories of information, including atypia types and various tumor characteristics such as receptors. We also report a learning curve analysis to show how much annotation our model needs to perform reasonably. Results: The model accuracy was tested on 500 reports that did not overlap with the training set. The model achieved accuracy of 90% for correctly parsing all carcinoma and atypia categories for a given patient. The average accuracy for individual categories was 97%. Using this classifier, we created a database of 91,505 parsed pathology reports. Conclusions: Our learning curve analysis shows that the model can achieve reasonable results even when trained on a few annotations. We developed a user-friendly interface to the database that allows physicians to easily identify patients with target characteristics and export the matching cohort. This model has the potential to reduce the effort required for analyzing large amounts of data from medical records, and to minimize the cost and time required to glean scientific insight from this data.
- Downloaded 2,445 times
- Download rankings, all-time:
- Site-wide: 4,511 out of 118,465
- In pathology: 28 out of 695
- Year to date:
- Site-wide: 15,534 out of 118,465
- Since beginning of last month:
- Site-wide: 18,512 out of 118,465
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!