Journal:FAIR Health Informatics: A health informatics framework for verifiable and explainable data analysis

Full article title	FAIR Health Informatics: A health informatics framework for verifiable and explainable data analysis
Journal	Healthcare
Author(s)	Siddiqi, Muhammad H.; Idris, Muhammad; Alruwaili, Madallah
Author affiliation(s)	Jouf University, Universite Libre de Bruxelles
Primary contact	Email: mhsiddiqi at ju dot edu dot sa
Year published	2023
Volume and issue	11(12)
Article #	1713
DOI	10.3390/healthcare11121713
ISSN	2227-9032
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.mdpi.com/2227-9032/11/12/1713
Download	https://www.mdpi.com/2227-9032/11/12/1713/pdf?version=1686475395 (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

The recent COVID-19 pandemic has hit humanity very hard in ways rarely observed before. In this digitally connected world, the health informatics and clinical research domains (both public and private) lack a robust framework to enable rapid investigation and cures. Since data in the healthcare domain are highly confidential, any framework in the healthcare domain must work on real data, be verifiable, and support reproducibility for evidence purposes. In this paper, we propose a health informatics framework that supports data acquisition from various sources in real-time, correlates these data from various sources among each other and to the domain-specific terminologies, and supports querying and analyses. Various sources include sensory data from wearable sensors, clinical investigation (for trials and devices) data from private/public agencies, personal health records, academic publications in the healthcare domain, and semantic information such as clinical ontologies and the Medical Subject Headings (MeSH) ontology. The linking and correlation of various sources include mapping personal wearable data to health records, clinical oncology terms to clinical trials, and so on. The framework is designed such that the data are findable, accessible, interoperable, and reusable (FAIR) with proper identity and access management mechanisms. This practically means tracing and linking each step in the data management lifecycle through discovery, ease of access and exchange, and data reuse. We present a practical use case to correlate a variety of aspects of data relating to a certain medical subject heading from the MeSH ontology and academic publications with clinical investigation data. The proposed architecture supports streaming data acquisition, and servicing and processing changes throughout the lifecycle of the data management process. This is necessary in certain events, such as when the status of a certain clinical or other health-related investigation needs to be updated. In such cases, it is required to track and view the outline of those events for the analysis and traceability of the clinical investigation and to define interventions if necessary.

Keywords: data correlation, data linking, verifiable data, data analysis, explainable decisions, clinical trials, COVID, clinical investigation, semantic mapping, smart health

Introduction

Pandemics are not new to this world or humanity. There have been pandemics in the past, and they may happen again in the future. The recent COVID-19 pandemic is different from the previous ones in that the virus is more infectious without being known, symptoms are ambiguous, and the detection methods require a lot of time and resources. It has caused more deaths than ever in the history of humankind, and the impact it has had on the world economy and human lives (whether affected or not) is grave and is posing questions about the future of diseases and pandemics. At the same time, while humans advance knowledge and technology, there is a need to investigate and put effort into overcoming the challenges posed by these kinds of serious threats. This not only requires us to deal with the current pandemic but also to look into the future, predict and presume the possibilities, investigate, and develop solutions at a large scale so that there is a reduction in the risk of losing lives and danger on a large scale.

The majority of the existing software systems and solutions in the healthcare domain are proprietary and limited in their capacity to a specific domain, such as only processing clinical trial data without integrating state-of-the-art investigation and wearable sensor data. Hence, they lack the ability to present a scalable analytical and technical solution, while having limitations and lacking the ability to trace back the analysis and results to the origin of the data. Because of these limitations, any robust and practical solution does not only have to account for clinical data but also present a practical and broader overview of these catastrophic events from clinical investigations, their results and treatments that are up-to-date, and combine them with medical history records as well as academic and other related datasets available. In other words, the recent advancements and investigations in the clinical domain specific to a particular disease are published in research articles and journals, and they also need to be correlated to real human subjects that undergo clinical trials so that up-to-date analysis can be carried out and proper guidelines and interventions can be suggested.

Moreover, the majority of the record-keeping bodies maintain electronic health records (EHRs) for patients and, recently, records of COVID vaccinations. However, they lack the ability to link the data to individuals’ activities and clinical outcomes. There is also a lack of fusing data related to the investigation of a particular disease from various data providers, such as private clinical trials, public clinical trials [1,2,3], and public-private clinical trials. The lack of these services is not only because there is less literature on fusing these multiple forms of data but also because of security and privacy concerns related to the confidentiality of healthcare data. The current advanced and robust privacy and security infrastructure available is more than enough to ensure personal and organizational interests. On the one hand, there are governments and other organizations that publish their clinical research data to public repositories to be available for clinical research using defined standards. On the other hand, there are pharma companies, which mostly hold the analytics driven by these and the respective algorithms and methods private. Furthermore, clinical trial data are not enough since they only provide measurements for different subjects who underwent a trial, while other data, from sources such as EHRs, contain real investigative cases and the histories of patients. Therefore, fusing these data from a variety of sources is helpful to determine the effects of a particular drug or treatment plan in combination with other vaccines, treatments, etc.

Given the above brief overview of the capabilities of the state-of-the-art, most of these systems either tackle static or dynamic, relational or non-relational, noisy or cleaned data without fusing, integrating, or semantically linking it. This paper proposes a solution to the above problems by presenting a healthcare framework that supports the ability to acquire, manage, and process static and dynamic (real-time) data. Our proposed framework is a data format that focuses on fusing information from various sources. In a nutshell, the proposed framework ideally targets a strategy that makes data findable, accessible, interoperable, and reusable (FAIR).

Objectives and contributions

In particular, we define the objectives and contributions of this research work as proposing a framework that is designed to be able to provide the following basic and essential capabilities for healthcare data:

A clinical "data lake" that stores data in a unified format where the raw data can be in any format loaded from a raw storage or streamed.
Pipelines that support static and incremental data collection from raw storage to a clinical data lake and maintain a record of any change to the structure at the data level and at the schema level all the way to the raw data and raw data schema.
A schema repository that versions the data when it changes its structure and enables forward and backward compatibility of the data throughout.
Universal clinical schemas that can incorporate any clinically related concepts (such as clinical trials from any provider) and support flexibility.
Processes to perform change data capture (CDC) as the data proceed down the pipeline towards the applications, i.e., incremental algorithms and methods to enable incremental processing of incoming data and keep a log of only the data of interest.
Timelines for changing data for a specific clinical investigation from a clinical trial data element such as a trial, site, investigator, etc., and include the ability to stream the data to the application, while providing semantic linking and profiling of subjects in the data.

These capabilities are meant to provide evidence of data analysis (i.e., where does the analysis go back in terms of data), traces of changes, and a holistic view of various clinical investigations running simultaneously at different places related to a certain specific clinical investigation. An example of that is when the COVID-19 vaccine was being developed; it was necessary to be able to trace the investigation of various efforts by independent bodies in a single place where one could see the phases of clinical trials of vaccines, their outcomes, treatments, and even the investigation sites and investigators. This could not only result in better decision-making but also in putting resources in suitable places and better planning for future similar scenarios.

State of the art

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.