Journal:Infrastructure tools to support an effective radiation oncology learning health system

Full article title	Infrastructure tools to support an effective radiation oncology learning health system
Journal	Journal of Applied Clinical Medical Physics
Author(s)	Kapoor, Rishabh; Sleeman IV, William C.; Ghosh, Preetam; Palta, Jatinder
Author affiliation(s)	Virginia Commonwealth University
Primary contact	rishabh dot kapoor at vcuhealth dot org
Year published	2023
Volume and issue	24(10)
Article #	e14127
DOI	10.1002/acm2.14127
ISSN	1526-9914
Distribution license	Creative Commons Attribution 4.0 International
Website	https://aapm.onlinelibrary.wiley.com/doi/10.1002/acm2.14127
Download	https://aapm.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/acm2.14127 (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

Purpose: The concept of the radiation oncology learning health system (RO‐LHS) represents a promising approach to improving the quality of care by integrating clinical, dosimetry, treatment delivery, and research data in real‐time. This paper describes a novel set of tools to support the development of an RO‐LHS and the current challenges they can address.

Methods: We present a knowledge graph‐based approach to map radiotherapy data from clinical databases to an ontology‐based data repository using FAIR principles. This strategy ensures that the data are easily discoverable, accessible, and can be used by other clinical decision support systems. It allows for visualization, presentation, and analysis of valuable data and information to identify trends and patterns in patient outcomes. We designed a search engine that utilizes ontology‐based keyword searching and synonym‐based term matching that leverages the hierarchical nature of ontologies to retrieve patient records based on parent and children classes, as well as connects to the Bioportal database for relevant clinical attributes retrieval. To identify similar patients, a method involving text corpus creation and vector embedding models (Word2Vec, Doc2Vec, GloVe, and FastText) are employed, using cosine similarity and distance metrics.

Results: The data pipeline and tool were tested with 1,660 patient clinical and dosimetry records, resulting in 504,180 RDF (Resource Description Framework) tuples and visualized data relationships using graph‐based representations. Patient similarity analysis using embedding models showed that the Word2Vec model had the highest mean cosine similarity, while the GloVe model exhibited more compact embeddings with lower Euclidean and Manhattan distances.

Conclusions: The framework and tools described support the development of an RO‐LHS. By integrating diverse data sources and facilitating data discovery and analysis, they contribute to continuous learning and improvement in patient care. The tools enhance the quality of care by enabling the identification of cohorts, clinical decision support, and the development of clinical studies and machine learning (ML) programs in radiation oncology.

Keywords: FAIR, learning health system infrastructure, ontology, Radiation Oncology Ontology, Semantic Web

Background and significance

For the past three decades, there is a growing interest in building learning organizations to address the most pressing and complex business, social, and economic challenges facing society today. [1] For healthcare, the National Academy of Medicine has defined the concept of a learning health system (LHS) as an entity where science, incentive, culture, and informatics are aligned for continuous innovation, with new knowledge capture and discovery as an integral part for practicing evidence-based medicine. [2] The current dependency on randomized controlled clinical trials that use a controlled environment for scientific evidence creation with only a small percent (<3%) of patient samples is inadequate now and may be irrelevant in the future since these trials take too much time, are too expensive, and are fraught with questions of generalizability. The Agency for Healthcare Research and Quality has also been promoting the development of LHSs as part of a key strategy for healthcare organizations to make transformational changes to improve healthcare quality and value. Large-scale healthcare systems are now recognizing the need to build infrastructure capable of continuous learning and improvement in delivering care to patients and address critical population health issues. [3] In an LHS, data collection should be performed from various sources such as electronic health records (EHRs), treatment delivery records, imaging records, patient-generated data records, and administrative and claims data, which then allows for this aggregated data to be analyzed for generating new insights and knowledge that can be used to improve patient care and outcomes.

However, only a few attempts at leveraging existing infrastructure tools used in routine clinical practice to transform the healthcare domain into an LHS have been made. [5, 6] Some examples of actual implementation have emerged, but by and large these concepts have been mostly discussed as conceptual ideas and strategies in the literature. There are several data organization and management challenges that must be addressed in order to effectively implement a radiation oncology LHS:

1. Data integration: Radiation oncology data are generated from a variety of sources, including EHRs, imaging systems, treatment planning systems (TPSs), and clinical trials. Integration of this data into a single repository can be challenging due to differences in data formats, terminologies, and storage systems. There is often significant semantic heterogeneity in the way that different clinicians and researchers use terminology to describe radiation oncology data. For example, different institutions may use different codes or terms to describe the same condition or treatment.

2. Data stored in disparate database schemas: Presently, the EHR, TPS, and treatment management system (TMS) data are housed in a series of relational database management systems (RDMS), which have rigid database structures, varying data schemas and can include lots of uncoded textual data. Tumor registries also stores data in their own defined schemas. Although the column names in the relational databases between two software products might be the same, semantic meaning based on the application of use may be completely different. Changing a database schema requires a lot of programming effort and code changes because of the rigid structure of the stored data, and it is generally advisable to retire old tables and build new tables with the added column definitions.

3. Episodic linking of records: Episodic linking of records refers to the process of integrating patient data from multiple encounters or episodes of care into a single comprehensive record. This record includes information about the patient's medical history, diagnosis, treatment plan, and outcomes, which can be used to improve care delivery, research, and education. Linking multiple data sources based on the patients episodic history of care is quite challenging because the heterogeneity of these data sources does not normally follow any common data storage standards.

4. Build data query tools based on semantic meaning of the data: Since the data are currently stored in multiple RDMSs for the specific purpose to cater the operations aspects of the patient care, extracting common semantic meaning from this data is very challenging. Common semantic meaning in healthcare data is typically achieved through the use of standardized vocabularies and ontologies that define concepts and relationships between them. Developing data query tools based on semantic meaning requires a high level of expertise in both the technical and domain-specific aspects of radiation oncology. Moreover, executing complex data queries, which includes tree-based queries, recursive queries, and derived data queries requires multiple tables joining operations in RDMSs, which is a costly operation.

While we are on the cusp of an artificial intelligence (AI) revolution in biomedicine, with the fast-growing development of advanced machine learning (ML) methods that can analyze complex datasets, there is an urgent need for a scalable intelligent infrastructure that can support these methods. The radiation oncology domain is also one of the most technically advanced medical specialties, with a long history of electronic data generation (e.g., radiation treatment (RT) simulation, treatment planning, etc.) that is modeled for each individual patient. This large volume of patient-specific real-world data captured during routine clinical practice, dosimetry, and treatment delivery make this domain ideally suited for rapid learning. [4] Rapid learning concepts could be applied using an LHS, providing the potential to improve patient outcomes and care delivery, reduce costs, and generate new knowledge from real world clinical and dosimetry data.

Several research groups in radiation oncology, including the University of Michigan, MD Anderson, and Johns Hopkins, have developed data gathering platforms with specific goals. [5] These platforms—such as the M-ROAR platform [6] at the University of Michigan, the system-wide electronic data capture platform at MD Anderson [7], and the Oncospace program at Johns Hopkins [8]—have been deployed to collect and assess practice patterns, perform outcome analysis, and capture RT-specific data, including dose distributions, organ-at-risk (OAR) information, images, and outcome data. While these platforms serve specific purposes, they rely on relational database-based systems without utilizing standard ontology-based data definitions. However, knowledge graph-based systems offer significant advantages over these relational database-based systems. Knowledge graph-based systems provide a more integrated and comprehensive representation of data by capturing complex relationships, hierarchies, and semantic connections between entities. They leverage ontologies, which define standardized and structured knowledge, enabling a holistic view of the data and supporting advanced querying and analysis capabilities. Furthermore, knowledge graph-based systems promote data interoperability and integration by adopting standard ontologies, facilitating collaboration and data sharing across different research groups and institutions. As such, knowledge graph-based systems are able to help ensure that research data is more findable, accessible, interoperable, and reusable (FAIR). [22]

In this paper, we set out to contribute to the advancement of the science of LHSs by presenting a detailed description of the technical characteristics and infrastructure that were employed to design a radiation oncology LHS specifically with a knowledge graph approach. The paper also describes how we have addressed the challenges that arise when building such a system, particularly in the context of constructing a knowledge graph. The main contributions of our work are as follows:

1. Provides an overview of the sources of data within radiation oncology (EHRs, TPS, TMS) and the mechanism to gather data from these sources in a common database.

2. Maps the gathered data to a standardized terminology and data dictionary for consistency and interoperability. Here we describe the processing layer built for data cleaning, checking for consistency and formatting before the extract, transform, and load (ETL) procedure is performed in a common database.

3. Adds concepts, classes, and relationships from existing NCI Thesaurus and SNOMED terminologies to the previously published Radiation Oncology Ontology (ROO) to fill in gaps with missing critical elements in the LHS.

4. Presents a knowledge graph visualization that demonstrates the usefulness of the data, with nodes and relationships for easy understanding by clinical researchers.

5. Develops an ontology-based keyword searching tool that utilizes semantic meaning and relationships to search the RDF knowledge graph for similar patients.

6. Provides a valuable contribution to the field of radiation oncology by describing an LHS infrastructure that facilitates data integration, standardization, and utilization to improve patient care and outcomes.

Material and methods

Gather data from multiple source systems in the radiation oncology domain

The adoption of EHRs in patients' clinical management is rapidly increasing in healthcare, but the use of data from EHRs in clinical research is lagging. The utilization of patient-specific clinical data available in EHRs has the potential to accelerate learning and bring value in several key topics of research, including comparative effectiveness research, cohort identification for clinical trial matching, and quality measure analysis. [9, 10] However, there is an inherent lack of interest in the use of data from the EHR for research purposes since the EHR and its data were never designed for research. Modern EHR technology has been optimized for capturing health details for clinical record keeping, scheduling, ordering, and capturing data from external sources such as laboratories, diagnostic imaging, and capturing encounter information for billing purposes. [11] Many data elements collected in routine clinical care, which are critical for oncologic care, are neither collected as structured data elements nor with the same defined rigor as those in clinical trials. [12, 13]

Given all these challenges with using data from EHRs, we have designed and built a clinical software called Health Information Gateway Exchange (HINGE). HINGE is a web-based electronic structured data capture system that has electronic data sharing interfaces using the Fast Healthcare Interoperability Resources (FHIR) Health Level 7 (HL7) standards with a specific goal to collect accurate, comprehensive, and structured data from EHRs. [14] FHIR is an advanced interoperability standard introduced by standards developing organization HL7. FHIR is based on the previous HL7 standards (version 1 & 2) and provides a representational state transfer (REST) architecture, with an application programming interface (API) in Extensible Markup Language (XML) and JavaScript Object Notation (JSON) formats. Additionally, there has also been recent regulatory and legislative changes promoting the use of FHIR standards for interoperability and interconnectivity of healthcare systems. [16] HINGE has employed the FHIR interfaces with EHRs to retrieve required patient details such as demographics; list of allergies; prescribed active medications; vitals; lab results; surgery, radiology, and pathology reports; active diagnosis; referrals; encounters; and survival information. We have described the design and implementation of HINGE in our previous publication. [15] In summary, HINGE is designed to automatically capture and abstract clinical, treatment planning, and delivery data for cancer patients receiving radiotherapy. The system uses disease site-specific “smart” templates to facilitate the entry of relevant clinical information by physicians and clinical staff. The software processes the extracted data for quality and outcome assessment, using well-defined clinical and dosimetry quality measures defined by disease site experts in radiation oncology. The system connects seamlessly to the local IT/medical infrastructure via interfaces and cloud services and provides tools to assess variations in radiation oncology practices and outcomes, and determine gaps in radiotherapy quality delivered by each provider.

We created a data pipeline from HINGE to export discrete data in JSON-based format. These data are then fed to the extract, transform, and load (ETL) processor. An overview of the data pipeline is shown in Figure 1. ETL is a three-step process where the data are first extracted, transformed (i.e., cleaned, formatted), and loaded into an output radiation oncology clinical data warehouse (RO-CDW) repository. Since HINGE templates do not function as case report forms and they are formatted based on an operational data structure, the data cleaning process is performed with some basic data preprocessing, including cleaning and checking for redundancy in the dataset, ignoring null values while making sure each data element has its supporting data elements populated in the dataset. As there are several types of datasets, each dataset requires a different type of cleaning. Therefore, multiple scripts for data cleaning have been prepared. The following outlines some of the checks that have been performed using the cleaning scripts.

1. Data type validation: We verified whether the column values were in the correct data types (e.g., integer, string, float). For instance, the “Performance Status Value” column in a patient record should be an integer value.

2. Cross-field consistency check: Some fields require other column values to validate their content. For example, the “Radiotherapy Treatment Start Date” should not be earlier than the “Date of Diagnosis.” We conducted a cross-field validation check to ensure that such conditions were met.

3. Mandatory element check: Certain columns in the input data file cannot be empty, such as “Patient ID Number” and “RT Course ID” in the dataset. We performed a mandatory field check to ensure that these fields were properly filled.

4. Range validation: This check ensures that the values fall within an acceptable range. For example, the “Marital Status” column should contain values between 1 to 9.

5. Format check: We verified the format of data values to ensure that they were consistent with the expected year-month-day (YYYYMMDD) format.

Figure 1. Overview of the data pipeline to gather clinical data into the radiation oncology clinical data warehouse (RO-CDW). As part of this pipeline, we have built HL7/FHIR interfaces between the EHR system and HINGE database to gather pertinent information from the patient's chart. These data are stored in the HINGE database and used to auto-populate disease-site-specific smart templates that depict the clinical workflow from initial consultation to follow-up care. The providers record their clinical assessments in these templates as part of their routine clinical care. Once the templates are finalized and signed by the providers in HINGE, the data are exported in JSON format, and using an ETL process, we can load the data in our RO-CDW's relational SQL database. Additionally, we use SQL stored procedures to extract, transform, and load data from the Varian Aria data tables and extraction of dosimetry dose-volume histogram (DVH) curves to our RO-CDW.

The main purpose of this step is to ensure that the dataset is of high quality and fidelity when loaded in RO-CDW. In the data loading process, we have written SQL and .Net-based scripts to transform the data into RO-CDW-compatible schema and load them into a Microsoft SQL Server 2016 database. When the data are populated, unique identifiers are assigned to each data table entry, and interrelationships are maintained within the tables so that the investigators can use query tools to query and retrieve the data, identify patient cohorts, and analyze the data.

We have deployed a free, open-source, and light-weight DICOM server known as Orthanc17 to collect DICOM-RT datasets from any commercial TPS. Orthanc is a simple yet powerful standalone DICOM server designed to support research and provide query/retrieve functionality of DICOM datasets. Orthanc provides a RESTful API that makes it possible to program using any computer language where DICOM tags stored in the datasets can be downloaded in a JSON format. We used the Python plug-in to connect with the Orthanc database to extract the relevant tag data from the DICOM-RT files. Orthanc was able to seamlessly connect with the Varian Eclipse planning system with the DICOM DIMSE C-STORE protocol. [18] Since the TPS conforms to the specifications listed under the Integrating the Healthcare Enterprise—Radiation Oncology (IHE-RO) profile, the DICOM-RT datasets contained all the relevant tags that were required to extract data.

One of the major challenges with examining patients’ DICOM-RT data is the lack of standardized organs at risk (OAR) and target names, as well as ambiguity regarding dose-volume histogram metrics, and multiple prescriptions mentioned across several treatment techniques. With the goal of overcoming these challenges, the AAPM TG 263 initiative has published their recommendations on OAR and target nomenclature. The ETL user interface deploys this standardized nomenclature and requires the importer of the data to match the deemed OARs with their corresponding standard OAR and target names. In addition, this program also suggests a matching name based on an automated process of relabeling using our published techniques (OAR labels [19], radiomics features [20], and geometric information [21]). We find that these automated approaches provide an acceptable accuracy over the standard prostate and lung structure types. In order to gather the dose volume histogram data from the DICOM-RT dose and structure set files, we have deployed a DICOM-RT dosimetry parser software. If the DICOM-RT dose file exported by the TPS contains dose-volume histogram (DVH) information, we utilize it. However, if the file lacks this information, we employ our dosimetry parser software to calculate the DVH values from the dose and structure set volume information.

Abbreviations, acronyms, and initialisms

Acknowledgements

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.