Journal:Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example

From LIMSWiki
Revision as of 22:07, 16 November 2016 by Shawndouglas (talk | contribs) (Created stub. Saving and adding more.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
Full article title Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example
Journal Journal of Pathology Informatics
Author(s) Ye, Jay J.
Author affiliation(s) Dahl-Chase Pathology Associates
Primary contact Email: Log in to original site to view
Year published 2016
Volume and issue 7
Page(s) 44
DOI 10.4103/2153-3539.192822
ISSN 2153-3539
Distribution license Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
Website http://www.jpathinformatics.org
Download http://www.jpathinformatics.org/temp/JPatholInform7144-5899807_162318.pdf (PDF)

Abstract

Background: Different methods have been described for data extraction from pathology reports with varying degrees of success. Here a technique for directly extracting data from relational database is described.

Methods: Our department uses synoptic reports modified from College of American Pathologists (CAP) Cancer Protocol Templates to report most of our cancer diagnoses. Choosing the melanoma of skin synoptic report as an example, R scripting language extended with the RODBC package was used to query the pathology information system database. Reports containing melanoma of skin synoptic report in the past four and a half years were retrieved and individual data elements were extracted. Using the retrieved list of the cases, the database was queried a second time to retrieve/extract the lymph node staging information in the subsequent reports from the same patients.

Results: 426 synoptic reports corresponding to unique lesions of melanoma of skin were retrieved, and data elements of interest were extracted into an R data frame. The distribution of Breslow depth of melanomas grouped by year is used as an example of intra-report data extraction and analysis. When the new pN staging information was present in the subsequent reports, 82% (77/94) was precisely retrieved (pN0, pN1, pN2 and pN3). Additional 15% (14/94) was retrieved with certain ambiguity (positive or knowing there was an update). The specificity was 100% for both. The relationship between Breslow depth and lymph node status was graphed as an example of lesion-specific multi-report data extraction and analysis.

Conclusions: R extended with the RODBC package is a simple and versatile approach well-suited for the above tasks. The success or failure of the retrieval and extraction depended largely on whether the reports were formatted and whether the contents of the elements were consistently phrased. This approach can be easily modified and adopted for other pathology information systems that use relational database for data management.

Keywords: Pathology report data extraction, R, SQL database

Introduction

Reporting major cancers with checklists/synoptic reports has been a mandated requirement by the College of American Pathologists and the American College of Surgeons - Commission on Cancer.[1][2] The use of the synoptic reporting format helps to ensure the completeness of the reports and lessen the chance of pathologists omitting relevant information; format consistency also makes it easier for the treating physicians to grasp all the relevant information.[3][4][5] The standardization can, therefore, improve the quality of patient care.

The underlying mechanisms for generating and storing information for synoptic reports may vary, including both a continuous string of text and as structured individualized elements.[5][6] Since the communication of the information to the treating physician is text-based reports, this variability does not affect the treating physicians or the individual patients treated. However, it does have implications on how pathologists prepare the reports, whether there is added cost involved in generating the reports, how the data are reported to the cancer registrars, and how readily the underlying information can be retrieved and used for the purpose of research and quality assurance.

Different approaches have been used to extract individual elements in the pathology reports. Natural language processing (NLP) has been used to extract information from breast carcinoma pathology reports with variable degrees of success.[7][8] Recently, Boag described a simpler yet powerful approach: the R programing language was used to extract and analyze data from discrete synoptic pathology reports (from the reports of prostate needle core biopsies).[9] First, all the reports with synoptic reports of prostate needle core biopsies were retrieved using a built-in report retrieving mechanism of their pathology information system. These report texts were uniformly formatted and consistently phrased since they were generated by a third-party software that captures individual data elements discretely (mTuitive xPert Cancer Reporting version 3 software, mTuitive Corporation, Centerville, MA, USA). Second, after file-type conversion, the texts were read into R, and the individual data elements were extracted and used for analysis.

Using melanoma of skin as an example, the above approach has been extended in the following ways: having R script directly interact with the database (through RODBC package), applying R to non-uniformly formatted and semi-consistently expressed report texts, and performing lesion-specific retrieval and analysis across multiple reports. The process is described in sufficient detail, including providing key portions of the R code, so as to enable readers with some R programing knowledge to test out the approach in their own systems.

References

  1. Amin, M.B. (2010). "The 2009 Version of the Cancer Protocols of the College of American Pathologists: A Continuing Journey From “Guidelines for Pathologists” to “Standards for Multidisciplinary Comprehensive Cancer Care”". Archives of Pathology & Laboratory Medicine 134 (3): 326-330. doi:10.1043/1543-2165-134.3.326. 
  2. Commission on Cancer, American College of Surgeons (2015). "Cancer Program Standards: Ensuring Patient-Centered Care, 2016 Edition". American College of Surgeons. pp. 83. https://www.facs.org/quality%20programs/cancer/coc/standards. Retrieved 12 October 2016. 
  3. Messenger, D.E.; McLeaod, R.S.; Kirsch, R. (2011). "What impact has the introduction of a synoptic report for rectal cancer had on reporting outcomes for specialist gastrointestinal and nongastrointestinal pathologists?". Archives of Pathology & Laboratory Medicine 135 (11): 1471-5. doi:10.5858/arpa.2010-0558-OA. PMID 22032575. 
  4. Lankshear, S.; Srigley, J.; McGowan, T. et al. (2013). "Standardized synoptic cancer pathology reports - so what and who cares? A population-based satisfaction survey of 970 pathologists, surgeons, and oncologists". Archives of Pathology & Laboratory Medicine 137 (11): 1599-602. doi:10.5858/arpa.2012-0656-OA. PMID 23432456. 
  5. 5.0 5.1 Amin, W.; Sirintrapun, S.J.; Parwani, A.V. (2010). "Utility and applications of synoptic reporting in pathology". Open Access Bioinformatics 2010 (2): 105—112. doi:10.2147/OAB.S12295. 
  6. Baskovich, B.W.; Allan, R.W. (2011). "Web-based synoptic reporting for cancer checklists". Journal of Pathology Informatics 2: 16. doi:10.4103/2153-3539.78039. PMC PMC3073063. PMID 21572504. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3073063. 
  7. Wieneke, A.E.; Bowles, E.J.; Cronkite, D. et al. (2015). "Validation of natural language processing to extract breast cancer pathology procedures and results". Journal of Pathology Informatics 6: 38. doi:10.4103/2153-3539.159215. PMC PMC4485196. PMID 26167382. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4485196. 
  8. Buckley, J.M.; Coopey, S.B.; Sharko, J. et al. (2012). "The feasibility of using natural language processing to extract clinical information from breast pathology reports". Journal of Pathology Informatics 3: 23. doi:10.4103/2153-3539.97788. PMC PMC3424662. PMID 22934236. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3424662. 
  9. Boag, A. (2015). "Extraction and analysis of discrete synoptic pathology report data using R". Journal of Pathology Informatics 6: 62. doi:10.4103/2153-3539.170649. PMC PMC4687157. PMID 26730352. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4687157. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.