Journal:Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example

Full article title	Pathology report data extraction from relational database using R,; with extraction from reports on melanoma of skin as an example
Journal	Journal of Pathology Informatics
Author(s)	Ye, Jay J.
Author affiliation(s)	Dahl-Chase Pathology Associates
Primary contact	Email: Log in to original site to view
Year published	2016
Volume and issue	7
Page(s)	44
DOI	10.4103/2153-3539.192822
ISSN	2153-3539
Distribution license	Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
Website	http://www.jpathinformatics.org
Download	http://www.jpathinformatics.org/temp/JPatholInform7144-5899807_162318.pdf (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background: Different methods have been described for data extraction from pathology reports with varying degrees of success. Here a technique for directly extracting data from relational database is described.

Methods: Our department uses synoptic reports modified from College of American Pathologists (CAP) Cancer Protocol Templates to report most of our cancer diagnoses. Choosing the melanoma of skin synoptic report as an example, R scripting language extended with the RODBC package was used to query the pathology information system database. Reports containing melanoma of skin synoptic report in the past four and a half years were retrieved and individual data elements were extracted. Using the retrieved list of the cases, the database was queried a second time to retrieve/extract the lymph node staging information in the subsequent reports from the same patients.

Results: 426 synoptic reports corresponding to unique lesions of melanoma of skin were retrieved, and data elements of interest were extracted into an R data frame. The distribution of Breslow depth of melanomas grouped by year is used as an example of intra-report data extraction and analysis. When the new pN staging information was present in the subsequent reports, 82% (77/94) was precisely retrieved (pN0, pN1, pN2 and pN3). Additional 15% (14/94) was retrieved with certain ambiguity (positive or knowing there was an update). The specificity was 100% for both. The relationship between Breslow depth and lymph node status was graphed as an example of lesion-specific multi-report data extraction and analysis.

Conclusions: R extended with the RODBC package is a simple and versatile approach well-suited for the above tasks. The success or failure of the retrieval and extraction depended largely on whether the reports were formatted and whether the contents of the elements were consistently phrased. This approach can be easily modified and adopted for other pathology information systems that use relational database for data management.

Keywords: Pathology report data extraction, R, SQL database

Introduction

Reporting major cancers with checklists/synoptic reports has been a mandated requirement by the College of American Pathologists and the American College of Surgeons - Commission on Cancer.^[1]^[2] The use of the synoptic reporting format helps to ensure the completeness of the reports and lessen the chance of pathologists omitting relevant information; format consistency also makes it easier for the treating physicians to grasp all the relevant information.^[3]^[4]^[5] The standardization can, therefore, improve the quality of patient care.

The underlying mechanisms for generating and storing information for synoptic reports may vary, including both a continuous string of text and as structured individualized elements.^[5]^[6] Since the communication of the information to the treating physician is text-based reports, this variability does not affect the treating physicians or the individual patients treated. However, it does have implications on how pathologists prepare the reports, whether there is added cost involved in generating the reports, how the data are reported to the cancer registrars, and how readily the underlying information can be retrieved and used for the purpose of research and quality assurance.

Different approaches have been used to extract individual elements in the pathology reports. Natural language processing (NLP) has been used to extract information from breast carcinoma pathology reports with variable degrees of success.^[7]^[8] Recently, Boag described a simpler yet powerful approach: the R programing language was used to extract and analyze data from discrete synoptic pathology reports (from the reports of prostate needle core biopsies).^[9] First, all the reports with synoptic reports of prostate needle core biopsies were retrieved using a built-in report retrieving mechanism of their pathology information system. These report texts were uniformly formatted and consistently phrased since they were generated by a third-party software that captures individual data elements discretely (mTuitive xPert Cancer Reporting version 3 software, mTuitive Corporation, Centerville, MA, USA). Second, after file-type conversion, the texts were read into R, and the individual data elements were extracted and used for analysis.

Using melanoma of skin as an example, the above approach has been extended in the following ways: having R script directly interact with the database (through RODBC package), applying R to non-uniformly formatted and semi-consistently expressed report texts, and performing lesion-specific retrieval and analysis across multiple reports. The process is described in sufficient detail, including providing key portions of the R code, so as to enable readers with some R programing knowledge to test out the approach in their own systems.

Materials and methods

The computer workstation is a desktop PC with an Intel Core i5-3470 CPU @ 3.20GHz and 4GB RAM. The pathology information system is PowerPath 10.0.0.19 (Sunquest), with the Advanced Material Processing module. The backend database management system for PowerPath is Microsoft SQL server 2005. Open source programing language R version 3.3.1 (https://www.r-project.org) is used for interacting with the PowerPath database and for data extraction, tidying, and analysis. RStudio version 0.99.489 (https://www.rstudio.com) is the integrated development environment used for both running the R script and interactively analyzing the data.

Figure 1 shows the R script used to retrieve data from the PowerPath database and to extract the elements of interest from the retrieved data. The lines starting with a hash mark (#) are the comment lines for human readers only. The script first loads the required packages: RODBC, dplyr, stringr, gdata, splitstackshape, and ggplot2. It then makes a connection to the database, retrieves data, and closes the connection. The server IP address, UID, and PWD in the line to open a connection to database have been modified to protect the security of the system. The retrieved data are in the format of R data frame with four columns: accession number, patient ID, created date, and text containing the melanoma of skin synoptic report.

Figure 1. R code for retrieving data from database and for performing preliminary data element extraction. Lines start with hash mark are comment lines, not executed by R interpreter. The content of the connection string in line conn <- odbcDriverConnect("driver={SQL Server}; SERVER = 192.168.*.***,****; DATABASE = Pathology_ Database; UID = Jay_Ye; PWD = Open_Sesame") is modified for confidentiality. Twenty-nine (29) lines of code for selecting out synoptic reports corresponding to unique lesions are omitted. See Materials and Methods for additional description.

In our system, not every synoptic report represents a unique lesion. Reports of some lesions are copied and re-accessioned under a different case number for the subsequent molecular send out tests. These repetitions have the "acc_type" of 184 in our database and are excluded from the retrieval by adding the condition a.acc_type_id <>184 to the WHERE clause of the query.

Among the retrieved synoptic reports, while most patients have only one synoptic report, a minority of patients have multiple different synoptic reports that represent the following situations: (1) one synoptic report for biopsies, followed by one for excisions with or without concomitant lymph node biopsies, (2) metachronous multiple primaries, and (3) synchronous multiple primaries. The following rules are set to programmatically obtain a data frame with each synoptic report corresponding to a unique lesion of melanoma. For patients with two synoptic reports, if the dates of these two reports are within 120 days, the second synoptic report is treated as the synoptic report for the excision specimen and only the second synoptic report is kept for data extraction. If the two reports are more than 120 days apart, the synoptic reports are treated as two different primary melanomas and both reports are kept for data extraction. If the patient has more than two reports, only the last one is kept for data extraction. Since these are futures idiosyncratic to our system, the code to obtain a data frame containing synoptic reports for unique lesion is not discussed in detail.

Then, the script reads a. txt file (Figure 2) into memory so that it has the information as to what elements to extract. Subsequently, the script extracts the elements of interest from the retrieved data and adds each element as a new column of the data frame.

Figure 2. Content of text file (.txt) containing elements of interest that are to be extracted from the text of melanoma synoptic report. This file is read into memory by the line mycolnames<-read.table("./melcolnames.txt", sep="\n", stringsAsFactors = FALSE)$V1 in the R script. Each line of the file is used in line pattern <-paste0(mycolnames[i],"(.*)?\r\n") to form portion of a Regular Expression text pattern for data element extraction; the extraction is performed in the immediate following line element <-str_match (ms$finding_text, pattern)[,2].

Combining Breslow depth of melanoma obtained from two different formats of synoptic reports, extracting the numeric value and the subsequent summary analysis using these values are shown below.

The two columns of data "MaximumTumorThickness" and "DepthofInvasion" can be merged to form a new column "depth":

Depth<-coalesce (ms$MaximumTumorThickness, ms$DepthofInvasion)

Then, the merged data can be used to extract the numeric portion of the text, converted to the R numeric data type, and added to the original data frame:

pured<-str_match (depth,"([0-9.]{1, 5})(.*)?(mm ǀ millimeter)")

Breslowdepth<-as.numeric (pured[,2])

ms<-cbind (ms, 'BreslowDepth (mm)' = Breslowdepth)

Using the case number of retrieved list of cases, for each case, the script goes back to the PowerPath database to retrieve subsequent reports within 120 days of the synoptic report for the same patient. The patient ID associated with the accession number of the synoptic report is used to bridge the synoptic reports and the subsequent reports. The particular query used in our system is as follows ("myAccNum" is the variable containing the accession number of the synoptic report):

DECLARE @acc_num varchar (60) SET @acc_num myAccNum

SELECT accession_no, COALESCE (finding, finding_text) as report

FROM path_rpt_heading as heading JOIN acc_results r on heading.id=r.heading_id JOIN accession_2 a on r.acc_id=a.id

WHERE acc_id in (select id from accession_2 where patient_id=(select patient_id from accession_2 where accession_no = @acc_num)) AND a.accession_no > @acc_num AND a.created_date <= (select dateadd (day, 120, a2.created_date) from accession_2 a2 where a2.accession_no = myAccNum) AND a.acc_type_id <>184 AND accession_no LIKE 'S-%' AND (heading.name LIKE 'COMMENT%' OR heading.name LIKE 'DIAGNOSIS%' OR heading.name LIKE 'AMENDED DIAGNOSIS%')

PowerPath implemented the database table in such a way that each section of the report text, such as DIAGNOSIS, COMMENT, CLINICAL INFORMATION, GROSS DESCRIPTION, and so on, is stored in different rows, and they are stored in either column "finding" or column "finding_text" (but not both) depending on the length of the text for the given section. The column-wise concatenation was performed within a SQL query using the function COALESCE, and the row-wise concatenation for each report was performed within R. The resulting data frame "reportbycase" contained DIAGNOSIS, COMMENT, AND AMENDED DIAGNOSIS section of the report text for each retrieved report.

These reports are used for the extraction of potential new staging information. Many reports of lymph node biopsies contain a summary sentence in the comment such as "the updated melanoma stage is pT1a, pN0, and pMX." The staging information from such reports can be extracted using the following R code:

reportbycase$pT <-str_match (reportbycase$report,"(melanoma ǀ Melanoma)(.*)?(pT[a-oq-wyzA-MO-WYZ0-9]{1,3})")[,4]

reportbycase$pN <-str_match (reportbycase$report,"(melanoma ǀ Melanoma)(.*)?(pN[a-oq-wyzA-WYZ0-9]{1,3})(.sn.)?")[,4]

reportbycase$pM <-str_match (reportbycase$report,"(melanoma ǀ Melanoma)(.*)?(pM[a-oq-wyzA-WYZ0-9]{1,3})")[,4]

The following code is used for the report text when the extraction described above returns NA:

for (i in 1:numRpt){

if (is.na (reportbycase$pN[i])){

if (!is.na (str_match (reportbycase$report[i],"(Melanoma ǀ melanoma ǀ Melan)")) and !is.na (str_match (reportbycase$report[i],"(Sentinel ǀ sentinel ǀǀ node)"))){

if (!is.na (str_match (reportbycase$report[i], "\\([1-9]/"))){

reportbycase$pN[i] <- 'pNp'

} else if (!is.na (str_match (reportbycase$report[i], "\\([0]/"))){

reportbycase$pN[i] <- 'pN0'

} else {

reportbycase$pN[i] <- 'pN?'

}

The above code utilizes the observation that many lymph node reports, even without summarizing sentences, would include the ratio of positive nodes to the total number of nodes within parentheses in the diagnosis for each specimen such as 0/1 denoting one node being negative and 2/3 denoting two of three nodes being positive. Three potential results can be retrieved: pN0, pNp (at least one node is positive), and pN? (there is a lymph node biopsy, but the result is not automatically retrievable).

The highest stage obtained from all the subsequent reports is used as the updated staging information for that patient. The highest stage for pT, pN, and pM is selected out from all the reports using the following code:

pT_u <- as.character(max(reportbycase$pT, na.rm = TRUE))

pN_u <- as.character(max(reportbycase$pN, na.rm = TRUE))

pM_u <- as.character(max(reportbycase$pM, na.rm = TRUE))

This process is repeated for each synoptic report.

The relationship between the lymph node status and Breslow depth of melanoma was then explored to demonstrate that information dispersed among different reports can be collated and used for data analysis.

References

↑ Amin, M.B. (2010). "The 2009 Version of the Cancer Protocols of the College of American Pathologists: A Continuing Journey From “Guidelines for Pathologists” to “Standards for Multidisciplinary Comprehensive Cancer Care”". Archives of Pathology & Laboratory Medicine 134 (3): 326-330. doi:10.1043/1543-2165-134.3.326.
↑ Commission on Cancer, American College of Surgeons (2015). "Cancer Program Standards: Ensuring Patient-Centered Care, 2016 Edition". American College of Surgeons. pp. 83. https://www.facs.org/quality%20programs/cancer/coc/standards. Retrieved 12 October 2016.
↑ Messenger, D.E.; McLeaod, R.S.; Kirsch, R. (2011). "What impact has the introduction of a synoptic report for rectal cancer had on reporting outcomes for specialist gastrointestinal and nongastrointestinal pathologists?". Archives of Pathology & Laboratory Medicine 135 (11): 1471-5. doi:10.5858/arpa.2010-0558-OA. PMID 22032575.
↑ Lankshear, S.; Srigley, J.; McGowan, T. et al. (2013). "Standardized synoptic cancer pathology reports - so what and who cares? A population-based satisfaction survey of 970 pathologists, surgeons, and oncologists". Archives of Pathology & Laboratory Medicine 137 (11): 1599-602. doi:10.5858/arpa.2012-0656-OA. PMID 23432456.
↑ ^5.0 ^5.1 Amin, W.; Sirintrapun, S.J.; Parwani, A.V. (2010). "Utility and applications of synoptic reporting in pathology". Open Access Bioinformatics 2010 (2): 105—112. doi:10.2147/OAB.S12295.
↑ Baskovich, B.W.; Allan, R.W. (2011). "Web-based synoptic reporting for cancer checklists". Journal of Pathology Informatics 2: 16. doi:10.4103/2153-3539.78039. PMC PMC3073063. PMID 21572504. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3073063.
↑ Wieneke, A.E.; Bowles, E.J.; Cronkite, D. et al. (2015). "Validation of natural language processing to extract breast cancer pathology procedures and results". Journal of Pathology Informatics 6: 38. doi:10.4103/2153-3539.159215. PMC PMC4485196. PMID 26167382. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4485196.
↑ Buckley, J.M.; Coopey, S.B.; Sharko, J. et al. (2012). "The feasibility of using natural language processing to extract clinical information from breast pathology reports". Journal of Pathology Informatics 3: 23. doi:10.4103/2153-3539.97788. PMC PMC3424662. PMID 22934236. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3424662.
↑ Boag, A. (2015). "Extraction and analysis of discrete synoptic pathology report data using R". Journal of Pathology Informatics 6: 62. doi:10.4103/2153-3539.170649. PMC PMC4687157. PMID 26730352. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4687157.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.