Difference between revisions of "Journal:Secure record linkage of large health data sets: Evaluation of a hybrid cloud model"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Saving and adding more.)
Line 90: Line 90:


===Related work===
===Related work===
Some research on algorithms that address the computational burden of the comparison and classification tasks in record linkage has been undertaken. Most work on distributed and parallel algorithms for record linkage is specific to the MapReduce paradigm<ref name="AbdEl-GhafarRecord17">{{cite journal |title=Record linkage approaches in big data: A state of art study |journal=Proceedings of the 13th International Computer Engineering Conference |author=Abd El-Ghafar, R.M.; Gheith, M.H.; El-Bastawissy, A. et al. |pages=224-230 |year=2017 |doi=10.1109/ICENCO.2017.8289792}}</ref>, a programming model for processing large data sets in parallel on a cluster. Few sources detail the comparison and classification tasks themselves, with the focus on load balancing algorithms to address issues associated with data skew. These works attempt to optimize the workload distribution across nodes while removing as many true negatives from the comparison space as possible.<ref name="DouProb16">{{cite journal |title=Probabilistic parallelisation of blocking non-matched records for big data |journal=Proceedings of the 2016 IEEE International Conference on Big Data |author=Dou, C.; Sun, D.; Chen, Y.-C. et al. |pages=3465-3473 |year=2016 |doi=10.1109/BigData.2016.7841009}}</ref><ref name="ChuDist16">{{cite journal |title=Distributed data deduplication |journal=Proceedings of the VLDB Endowment |author=Chu, X.; Ilyas, I.F.; Koutris, P. |volume=9 |issue=11 |pages=864–875 |year=2016 |doi=10.14778/2983200.2983203}}</ref><ref name="GazzariTowards20">{{cite journal |title=Towards task-based parallelization for entity resolution |journal=SICS Software-Intensive Cyber-Physical Systems |author=Gazzari, L.; Herschel, M. |volume=35 |pages=31–38 |year=2020 |doi=10.1007/s00450-019-00409-6}}</ref><ref name="PapadakisBlock20">{{cite journal |title=Blocking and Filtering Techniques for Entity Resolution: A Survey |journal=ACM Computing Surveys |author=Papadakis, G.; Skoutas, D.; Thanos, E. et al. |volume=53 |issue=2 |at=31 |year=2020 |doi=10.1145/3377455}}</ref> Load balancing algorithms typically use multiple MapReduce jobs and different indexing methods to tackle the data skew problem. Indexing methods include standard blocking<ref name="ChuDist16" /><ref name="GazzariTowards20" />, density-based blocking<ref name="DouProb16" />, and locality sensitive hashing (LSH) [20], with varying success in optimizing the workload distribution.<ref name="KarapiperisAFast16">{{cite journal |title=A fast and efficient Hamming LSH-based scheme for accurate linkage |journal=Knowledge and Information Systems |author=Karapiperis, D.; Verykios, V.S. |volume=49 |pages=861–884 |year=2016 |doi=10.1007/s10115-016-0919-y}}</ref>
Pita ''et al.''<ref name="PitaASpark15">{{cite journal |title=A Spark-based workflow for probabilistic record linkage of healthcare data |journal=Workshop Proceedings of the EDBT/ICDT 2015 Joint Conference |author=Pita, R.; Pinto, C.; Melo, P. et al. |pages=1–10 |year=2015 |url=http://ceur-ws.org/Vol-1330/paper-04.pdf |format=PDF}}</ref> have built on the MapReduce-based work and demonstrated good performance and quality using a Spark-based workflow for probabilistic linkage. Spark was chosen for in-memory processing, ease of programming, scalability, and the new resilient distributed data set model. Like MapReduce, Spark continues to be used to address the issues with linkage and data skew on larger data sets. Spark solutions for full entity resolution are being developed, with different indexing techniques used to address workload distribution. The SparkER tool by Gagliardelli ''et al''<ref name="GagliardelliSpark19">{{cite journal |title=SparkER: Scaling Entity Resolution in Spark |journal=Proceedings of the 22nd International Conference on Extending Database Technology |author=Gagliardelli, L.; Simonini, G.; Beneventano, D. et al. |pages=602–05 |year=2019 |doi=10.5441/002/edbt.2019.66}}</ref> uses LSH, meta-blocking, and a block purging process to remove high-frequency blocking keys. Mestre ''et al.''<ref name="MestreAnEffic17">{{cite journal |title=An efficient spark-based adaptive windowing for entity matching |journal=Journal of Systems and Software |author=Mestre, D.G.; Pires, C.E.S.; Nascimento, D.C. et al. |volume=128 |pages=1–10 |year=2017 |doi=10.1016/j.jss.2017.03.003}}</ref> presented a sorted neighborhood implementation with an adaptive window size, which uses three Spark transformation steps to distribute the data and minimize data skew.





Revision as of 14:53, 22 December 2020

Full article title Secure record linkage of large health data sets: Evaluation of a hybrid cloud model
Journal JMIR Medical Informatics
Author(s) Brown, Adrian P.; Randall, Sean M.
Author affiliation(s) Curtin University
Primary contact Email: adrian dot brown at curtin dot edu dot au
Year published 2020
Volume and issue 8(9)
Article # e18920
DOI 10.2196/18920
ISSN 2291-9694
Distribution license Creative Commons Attribution 4.0 International
Website https://medinform.jmir.org/2020/9/e18920/
Download https://medinform.jmir.org/2020/9/e18920/pdf (PDF)

Abstract

Background: The linking of administrative data across agencies provides the capability to investigate many health and social issues, with the potential to deliver significant public benefit. Despite its advantages, the use of cloud computing resources for linkage purposes is scarce, with the storage of identifiable information on cloud infrastructure assessed as high-risk by data custodians.

Objective: This study aims to present a model for record linkage that utilizes cloud computing capabilities while assuring custodians that identifiable data sets remain secure and local.

Methods: A new hybrid cloud model was developed, including privacy-preserving record linkage techniques and container-based batch processing. An evaluation of this model was conducted with a prototype implementation using large synthetic data sets representative of administrative health data.

Results: The cloud model kept identifiers on-premises and used privacy-preserved identifiers to run all linkage computations on cloud infrastructure. Our prototype used a managed container cluster in Amazon Web Services to distribute the computation using existing linkage software. Although the cost of computation was relatively low, the use of existing software resulted in an overhead of processing of 35.7% (149/417 minutes execution time).

Conclusions: The result of our experimental evaluation shows the operational feasibility of such a model and the exciting opportunities for advancing the analysis of linkage outputs.

Keywords: cloud computing, medical record linkage, confidentiality, data science

Introduction

Background

In the last 10 years, innovative development of software applications, wearables, and the internet of things has changed the way we live. These technological advances have also changed the way we deliver health services and provide a rapidly expanding information resource, with the potential for data-driven breakthroughs in the understanding, treatment, and prevention of disease. Additional information from patient-related devices like mobile phone and Google search histories[1], wearable devices[2], and mobile phone apps[3] provides new opportunities for monitoring, managing, and improving health outcomes in new and innovative ways. The key to unlocking these data is in relating details at the individual patient level to provide an understanding of risk factors and appropriate interventions.[4] The linking, integration, and analysis of these data has recently been described as "population data science."[5]

Record linkage is a technique for finding records within and across one or more data sets thought to refer to the same person, family, place, or event.[6] Coined in 1946, the term describes the process of assembling the principal life events of an individual from birth to death.[7] In today’s digital age, the capacity of systems to match records has increased, yet the aim remains the same: linking records to construct individual chronological histories and undertake studies that deliver significant public benefit.

An important step in the evolution of data linkage is to develop infrastructure with the capacity to link data across agencies to create enduring integrated data sets. Such resources provide the capability to investigate many health and social issues. A number of collaborative groups have invested in a large-scale record linkage infrastructure to achieve national linkage objectives.[8][9] The establishment of research centers specializing in the analysis of big data also recognizes the issue of increasing data size and complexity.[10]

As the demand for data linkage increases, a core challenge will be to ensure that the systems are scalable. Record linkage is computationally expensive, with a potential comparison space equivalent to the Cartesian product of the record sets being linked, making linkage of large data sets (in the tens of millions or greater) a considerable challenge. Optimizing systems, removing manual operations, and increasing the level of autonomy for such processes is essential.

A wide range of software is currently used for record linkage. System deployments range from single desktop machines to multiple servers, with most being hosted internally to organizations. The functional scope of packages varies greatly, with manual processes and scripts required to help manage, clean, link, and extract data. Many packages struggle with large data set sizes.

Many industries have moved toward cloud computing as a solution for high computational workloads, data storage, and analytics.[11] An overview of cloud computing types and service models is shown in Table 1. The business benefits of cloud computing include usage-based costing, minimal upfront infrastructure investment, superior collaboration (both internally and externally), better management of data, and increased business agility.[12] Despite these advantages, uptake by the record linkage industry has been slow. One reason for this is that the storage of identifiable information on cloud infrastructure has been assessed as high-risk by data custodians. Although security in cloud computing systems has been shown to be more robust than some in-house systems[13], the media reporting of data breaches has created an impression of insecurity and vulnerability.[14] A culture of risk aversion leaves the record linkage units with expensive, dedicated equipment and computing resources that require managing, maintaining, and upgrading or replacing regularly.

Table 1. Overview of cloud computing types and service models.
Name Description
Types of cloud computing
Public All computing resources are located within a cloud service provider that is generally accessible via the internet.
Private Computing resources for an organization that are located within the premises of the organization. Access is typically through local network connections.
Hybrid Cloud services are composed of some combination of public and private cloud services. Public cloud services are typically leveraged in this situation for increasing capacity or capability.
Service models
Infrastructure as a service (IaaS) The provider manages physical hardware, storage, servers, and virtualization, providing virtual machines to the consumer.
Platform as a service (PaaS) In addition to the items managed for IaaS, the provider also manages operating systems, middleware, and platform runtimes. The consumer leverages these platform runtimes in their own apps.
Software as a service (SaaS) The provider manages everything, including apps and data, exposing software endpoints (typically as a website) for the consumer.

To leverage the advantages of cloud computing, we need to explore operational cloud computing models for record linkage that consider the specific requirements of all stakeholders. In addition, linkage infrastructure requires the development and implementation of robust security and information governance frameworks as part of adopting a cloud solution.

Related work

Some research on algorithms that address the computational burden of the comparison and classification tasks in record linkage has been undertaken. Most work on distributed and parallel algorithms for record linkage is specific to the MapReduce paradigm[15], a programming model for processing large data sets in parallel on a cluster. Few sources detail the comparison and classification tasks themselves, with the focus on load balancing algorithms to address issues associated with data skew. These works attempt to optimize the workload distribution across nodes while removing as many true negatives from the comparison space as possible.[16][17][18][19] Load balancing algorithms typically use multiple MapReduce jobs and different indexing methods to tackle the data skew problem. Indexing methods include standard blocking[17][18], density-based blocking[16], and locality sensitive hashing (LSH) [20], with varying success in optimizing the workload distribution.[20]

Pita et al.[21] have built on the MapReduce-based work and demonstrated good performance and quality using a Spark-based workflow for probabilistic linkage. Spark was chosen for in-memory processing, ease of programming, scalability, and the new resilient distributed data set model. Like MapReduce, Spark continues to be used to address the issues with linkage and data skew on larger data sets. Spark solutions for full entity resolution are being developed, with different indexing techniques used to address workload distribution. The SparkER tool by Gagliardelli et al[22] uses LSH, meta-blocking, and a block purging process to remove high-frequency blocking keys. Mestre et al.[23] presented a sorted neighborhood implementation with an adaptive window size, which uses three Spark transformation steps to distribute the data and minimize data skew.


References

  1. Abebe, R.; Hill, S.; Vaughan, J.W. et al. (2019). "Using Search Queries to Understand Health Information Needs in Africa". Proceedings of the Thirteenth International AAAI Conference on Web and Social Media 13 (1): 3–14. https://ojs.aaai.org/index.php/ICWSM/article/view/3360. 
  2. Radin, J.M.; Wineinger, N.E.; Topol, E.J. et al. (2020). "Harnessing wearable device data to improve state-level real-time surveillance of influenza-like illness in the USA: A population-based study". The Lancet Digital Health 2 (2): e85–e93. doi:10.1016/S2589-7500(19)30222-5. 
  3. Lai, S.; Farnham, A.; Ruktanonchai, N.W. et al. (2019). "Measuring mobility, disease connectivity and individual risk: A review of using mobile phone data and mHealth for travel medicine". Journal of Travel Medicine 26 (3): taz019. doi:10.1093/jtm/taz019. PMC PMC6904325. PMID 30869148. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6904325. 
  4. Khoury, M.J.; Iademarco, M.F.; Tiley, W.T. (2016). "Precision Public Health for the Era of Precision Medicine". American Journal of Prevantative Medicine 50 (3): 398-401. doi:10.1016/j.amepre.2015.08.031. PMC PMC4915347. PMID 26547538. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4915347. 
  5. McGrail, K.; Jones, K. (2018). "Population Data Science: The science of data about people". Conference Proceedings for International Population Data Linkage Conference 2018 3 (4). doi:10.23889/ijpds.v3i4.918. 
  6. Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer-Verlag. doi:10.1007/978-3-642-31164-2. ISBN 9783642311642. 
  7. Dunn, H.L. (1946). "Record Linkage". American Journal of Public Health and the Nation's Health 36 (12): 1412–6. PMC PMC1624512. PMID 18016455. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1624512. 
  8. Casey, J.A.; Schwartz, B.S.; Stewart, W.F. et al. (2016). "Using Electronic Health Records for Population Health Research: A Review of Methods and Applications". Annual Review of Public Health 37: 61–81. doi:10.1146/annurev-publhealth-032315-021353. PMC PMC6724703. PMID 26667605. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6724703. 
  9. Population Health Research Network (17 April 2014). "Population Health Research Network 2013 Independent Panel Review: Findings and Recommendations" (PDF). https://www.phrn.org.au/media/80607/phrn-2013-independent-review-findings-and-recommendations-v2-_final-report-april-17-2014-2.pdf. Retrieved 15 September 2020. 
  10. UNSW Australia (2015). "Centre for Big Data Research in Health: Annual Report 2015" (PDF). https://cbdrh.med.unsw.edu.au/sites/default/files/CBDRH_Annual%20Report_2015_160609_Final.pdf. Retrieved 15 September 2020. 
  11. Liversidge, J.; Spencer, J.; Weinstein, E. et al. (21 December 2018). "Predicts 2019: Cloud Adoption and Increasing Regulation Will Drive Investment in IT Vendor Management". Gartner Research. https://www.gartner.com/en/documents/3896211/predicts-2019-cloud-adoption-and-increasing-regulation-w. Retrieved 15 September 2020. 
  12. Vasiljeva, T.; Shaikhulina, S.; Kreslins, K. (2017). "Cloud Computing: Business Perspectives, Benefits and Challenges for Small and Medium Enterprises (Case of Latvia)". Procedia Engineering 178: 443–51. doi:10.1016/j.proeng.2017.01.087. 
  13. Khalil, I.M.; Khreishah, A.; Bouktig, S. et al. (2013). "Security Concerns in Cloud Computing". Proceedings of the 10th International Conference on Information Technology: New Generations: 411-416. doi:10.1109/ITNG.2013.127. 
  14. John, J.; Norman, J. (2018). "Major Vulnerabilities and Their Prevention Methods in Cloud Computing". In Peter, J.; Alavi, A.; Javadi, B.. Advances in Big Data and Cloud Computing. Springer. pp. 11–26. doi:10.1007/978-981-13-1882-5_2. ISBN 9789811318825. 
  15. Abd El-Ghafar, R.M.; Gheith, M.H.; El-Bastawissy, A. et al. (2017). "Record linkage approaches in big data: A state of art study". Proceedings of the 13th International Computer Engineering Conference: 224-230. doi:10.1109/ICENCO.2017.8289792. 
  16. 16.0 16.1 Dou, C.; Sun, D.; Chen, Y.-C. et al. (2016). "Probabilistic parallelisation of blocking non-matched records for big data". Proceedings of the 2016 IEEE International Conference on Big Data: 3465-3473. doi:10.1109/BigData.2016.7841009. 
  17. 17.0 17.1 Chu, X.; Ilyas, I.F.; Koutris, P. (2016). "Distributed data deduplication". Proceedings of the VLDB Endowment 9 (11): 864–875. doi:10.14778/2983200.2983203. 
  18. 18.0 18.1 Gazzari, L.; Herschel, M. (2020). "Towards task-based parallelization for entity resolution". SICS Software-Intensive Cyber-Physical Systems 35: 31–38. doi:10.1007/s00450-019-00409-6. 
  19. Papadakis, G.; Skoutas, D.; Thanos, E. et al. (2020). "Blocking and Filtering Techniques for Entity Resolution: A Survey". ACM Computing Surveys 53 (2): 31. doi:10.1145/3377455. 
  20. Karapiperis, D.; Verykios, V.S. (2016). "A fast and efficient Hamming LSH-based scheme for accurate linkage". Knowledge and Information Systems 49: 861–884. doi:10.1007/s10115-016-0919-y. 
  21. Pita, R.; Pinto, C.; Melo, P. et al. (2015). "A Spark-based workflow for probabilistic record linkage of healthcare data" (PDF). Workshop Proceedings of the EDBT/ICDT 2015 Joint Conference: 1–10. http://ceur-ws.org/Vol-1330/paper-04.pdf. 
  22. Gagliardelli, L.; Simonini, G.; Beneventano, D. et al. (2019). "SparkER: Scaling Entity Resolution in Spark". Proceedings of the 22nd International Conference on Extending Database Technology: 602–05. doi:10.5441/002/edbt.2019.66. 
  23. Mestre, D.G.; Pires, C.E.S.; Nascimento, D.C. et al. (2017). "An efficient spark-based adaptive windowing for entity matching". Journal of Systems and Software 128: 1–10. doi:10.1016/j.jss.2017.03.003. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. Grammar was cleaned up for smoother reading. In some cases important information was missing from the references, and that information was added. At the time of loading of this article, the links to the Additional File 1 and 2 were broken on the original site; a request to fix the errors has been sent to the journal.