Difference between revisions of "Journal:Privacy-preserving healthcare informatics: A review"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
Line 157: Line 157:


* '''Linkage attack''': An adversary may reidentify the identity and discover the SA values of a targeted record owner by matching the auxiliary QID values with the published table 𝑇′. For example, imagine Table 1 is published without modification and suppose A possesses the knowledge that B lives in zip code 96038, then A infers that B belongs to record 1 (identity disclosure) and has diabetes (attribute disclosure).<ref name="MachanavajjhalaL-diversity06">{{cite journal |title=L-diversity: Privacy beyond k-anonymity |journal=Proceedings of the 22nd International Conference on Data Engineering |author=Machanavajjhala, A.; Gehrke, J.; Kifer, D. et al. |pages=24–36 |year=2006 |doi=10.1109/ICDE.2006.1}}</ref><ref name="Sweeney_k-anon02">">{{cite journal |title=k-anonymity: A model for protecting privacy |journal=International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems |author=Sweeney, L. |volume=10 |issue=5 |pages=557–570 |year=2002 |doi=10.1142/S0218488502001648}}</ref><ref name="LiuAClust18">">{{cite journal |title=A Clustering K-Anonymity Privacy-Preserving Method for Wearable IoT Devices |journal=Security and Communication Networks |author=Liu, F.; Li, T. |volume=2018 |at=4945152 |year=2018 |doi=10.1155/2018/4945152}}</ref><ref name="WeiDistrib18">{{cite journal |title=Distribution‐preserving k‐anonymity |journal=Statistical Analysis and Data Mining |author=Wei, D.; Ramamurthy, K.N.; Varshney, K.R. |volume=11 |issue=6 |pages=253-270 |year=2018 |doi=10.1002/sam.11374}}</ref><ref name="LiangOptim20">{{cite journal |title=Optimization-based k-anonymity algorithms |journal=Computers & Security |author=Lianf, Y.; Samavi, R. |volume=93 |at=101753 |year=2020 |doi=10.1016/j.cose.2020.101753}}</ref>
* '''Linkage attack''': An adversary may reidentify the identity and discover the SA values of a targeted record owner by matching the auxiliary QID values with the published table 𝑇′. For example, imagine Table 1 is published without modification and suppose A possesses the knowledge that B lives in zip code 96038, then A infers that B belongs to record 1 (identity disclosure) and has diabetes (attribute disclosure).<ref name="MachanavajjhalaL-diversity06">{{cite journal |title=L-diversity: Privacy beyond k-anonymity |journal=Proceedings of the 22nd International Conference on Data Engineering |author=Machanavajjhala, A.; Gehrke, J.; Kifer, D. et al. |pages=24–36 |year=2006 |doi=10.1109/ICDE.2006.1}}</ref><ref name="Sweeney_k-anon02">">{{cite journal |title=k-anonymity: A model for protecting privacy |journal=International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems |author=Sweeney, L. |volume=10 |issue=5 |pages=557–570 |year=2002 |doi=10.1142/S0218488502001648}}</ref><ref name="LiuAClust18">">{{cite journal |title=A Clustering K-Anonymity Privacy-Preserving Method for Wearable IoT Devices |journal=Security and Communication Networks |author=Liu, F.; Li, T. |volume=2018 |at=4945152 |year=2018 |doi=10.1155/2018/4945152}}</ref><ref name="WeiDistrib18">{{cite journal |title=Distribution‐preserving k‐anonymity |journal=Statistical Analysis and Data Mining |author=Wei, D.; Ramamurthy, K.N.; Varshney, K.R. |volume=11 |issue=6 |pages=253-270 |year=2018 |doi=10.1002/sam.11374}}</ref><ref name="LiangOptim20">{{cite journal |title=Optimization-based k-anonymity algorithms |journal=Computers & Security |author=Lianf, Y.; Samavi, R. |volume=93 |at=101753 |year=2020 |doi=10.1016/j.cose.2020.101753}}</ref>
* '''Homogeneity attack''': This attack discloses the SA values of a target when there is insufficient homogeneity in the SA. That is, the combination of QID is mapped to one SA value only. For example, suppose A knows that B is 28 years old, which belongs to the first equivalence class (an equivalence class is a cluster of records with the same QID values) in Table 2, below (record 1, 2 and 3). Since these records have the same disease, A infers that B suffers from diabetes.<ref name="MachanavajjhalaL-diversity06" /><ref name="Khanθ-Sensitive20">{{cite journal |title=θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records |journal=electronics |author=Khan, R.; Tao, X.; Anjum, A. et al. |volume=9 |issue=5 |at=716 |year=2020 |doi=10.3390/electronics9050716}}</ref>
* '''Background knowledge attack''': This attack utilizes logical reasoning and additional knowledge about a target to breach the SA values. For example, suppose A knows that C is 43 years old and lives in the zip code 96583, which belongs to the third equivalence class in Table 2, below (record 7, 8 and 9). Nevertheless, the records show that C may have either diabetes or cancer. Based on A’s background knowledge that C is a person who likes sweet foods, A infers that C is diabetic.<ref name="MachanavajjhalaL-diversity06" /><ref name="Khanθ-Sensitive20" />
* '''Skewness attack''': When the overall distribution of SA in the original data is skewed, SA values can be inferred. The SA values have different degrees of sensitivity. For instance, a victim may not mind being known as diabetic as it is a common (majority) disease. However, one would mind being known to have mental illness. According to Table 3, below, the probability of having mental illness is 33.3%, which is much higher than that of real distribution (11.1% in Table 1). Thus, this imposes a privacy threat, since anyone in the equivalence class that has a 33.3% possibility can be inferred to have mental illness, as compared with 11.1% of the overall distribution.<ref name="Lit-Close07">{{cite journal |title=t-Closeness: Privacy Beyond k-Anonymity and l-Diversity |journal=Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering |author=Li, N.; Li, T.; Venkatasubramanian, S. |pages=106–15 |year=2007 |doi=10.1109/ICDE.2007.367856}}</ref>
* '''Similarity attack''': This attack discloses SA values when the semantic relationship of distinct SA values in an equivalence class is close. For example, suppose that an adversary infers the possible salaries of a target victim are 2K, 3K, and 4K. Although the numbers represent distinct salary, they are all categorized in the range [2K,4K]. Hence, an adversary could infer that the target has low salary when the SA values are semantically similar.<ref name="Khanθ-Sensitive20" /><ref name="Lit-Close07" />
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="5"|'''Table 2.''' Published data of Table 1
|-
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" rowspan="2"|Number
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" colspan="3"|Quasi-identifier
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Sensitive attribute
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Age
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Zip code
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Gender
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Disease
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|1
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 30
  | style="background-color:white; padding-left:10px; padding-right:10px;"|960**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Diabetes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 30
  | style="background-color:white; padding-left:10px; padding-right:10px;"|960**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Diabetes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|3
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 30
  | style="background-color:white; padding-left:10px; padding-right:10px;"|960**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Diabetes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|4
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 40
  | style="background-color:white; padding-left:10px; padding-right:10px;"|963**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Cancer
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|5
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 40
  | style="background-color:white; padding-left:10px; padding-right:10px;"|963**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Mental Illness
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|6
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 40
  | style="background-color:white; padding-left:10px; padding-right:10px;"|963**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Diabetes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|7
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 50
  | style="background-color:white; padding-left:10px; padding-right:10px;"|965**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Diabetes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|8
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 50
  | style="background-color:white; padding-left:10px; padding-right:10px;"|965**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Cancer
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|9
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 50
  | style="background-color:white; padding-left:10px; padding-right:10px;"|965**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Cancer
|-
|}
|}
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="5"|'''Table 3.''' Another version of the published data of Table 1
|-
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" rowspan="2"|Number
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" colspan="3"|Quasi-identifier
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Sensitive attribute
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Age
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Zip code
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Gender
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Disease
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|1
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 30
  | style="background-color:white; padding-left:10px; padding-right:10px;"|960**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Diabetes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 30
  | style="background-color:white; padding-left:10px; padding-right:10px;"|960**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Mental Illness
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|3
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 30
  | style="background-color:white; padding-left:10px; padding-right:10px;"|960**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Cancer
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|4
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 40
  | style="background-color:white; padding-left:10px; padding-right:10px;"|963**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Cancer
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|5
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 40
  | style="background-color:white; padding-left:10px; padding-right:10px;"|963**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Mental Illness
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|6
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 40
  | style="background-color:white; padding-left:10px; padding-right:10px;"|963**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Diabetes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|7
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 50
  | style="background-color:white; padding-left:10px; padding-right:10px;"|965**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Diabetes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|8
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 50
  | style="background-color:white; padding-left:10px; padding-right:10px;"|965**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Cancer
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|9
  | style="background-color:white; padding-left:10px; padding-right:10px;"|< 50
  | style="background-color:white; padding-left:10px; padding-right:10px;"|965**
  | style="background-color:white; padding-left:10px; padding-right:10px;"|*
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Mental Illness
|-
|}
|}
===Privacy and utility objective of PPDP===





Revision as of 19:52, 11 April 2021

Full article title Privacy-preserving healthcare informatics: A review
Journal ITM Web of Conferences
Author(s) Chong, Kah Meng
Author affiliation(s) Universiti Tunku Abdul Rahman
Primary contact kmchong at utar dot edu dot my
Year published 2021
Volume and issue 36
Article # 04005
DOI 10.1051/itmconf/20213604005
ISSN 2271-2097
Distribution license Creative Commons Attribution 4.0 International
Website https://www.itm-conferences.org/articles/itmconf/abs/2021/01/itmconf_icmsa2021_04005/
Download https://www.itm-conferences.org/articles/itmconf/pdf/2021/01/itmconf_icmsa2021_04005.pdf (PDF)

Abstract

The electronic health record (EHR) is the key to an efficient healthcare service delivery system. The publication of healthcare data is highly beneficial to healthcare industries and government institutions to support a variety of medical and census research. However, healthcare data contains sensitive information of patients, and the publication of such data could lead to unintended privacy disclosures. In this paper, we present a comprehensive survey of the state-of-the-art privacy-enhancing methods that ensure a secure healthcare data sharing environment. We focus on the recently proposed schemes based on data anonymization and differential privacy approaches in the protection of healthcare data privacy. We highlight the strengths and limitations of the two approaches and discuss some promising future research directions in this area.

Keywords: data privacy, data sharing, electronic health record, healthcare informatics,

Introduction

Electronic health record (EHR) systems are increasingly adopted as an important paradigm in the healthcare industry to collect and store patient data, which includes sensitive information such as demographic data, medical history, diagnosis code, medications, treatment plans, hospitalization records, insurance information, immunization dates, allergies, and laboratory and test results. The availability of such big data has provided unprecedented opportunities to improve the efficiency and quality of healthcare services, particularly in improving patient care outcomes and reducing medical costs. EHR data have been published to allow useful analysis as required by the healthcare industry[1] and government institutions.[2][3] Some key examples may include large-scale statistical analytics (e.g., the study of correlation between diseases), clinical decision making, treatment optimization, clustering (e.g., epidemic control), and census surveys. Driven by the potential of EHR systems, a number of EHR repositories have been established, such as the National Database for Autism Research (NDAR), U.K. Data Service, ClinicalTrials.gov, and UNC Health Care (UNCHC).

Although the publication of EHR data is enormously beneficial, it could lead to unintended privacy disclosures. Many conventional cryptography and security methods have been deployed to primarily protect the security of EHR systems, including access control, authentication, and encryption. However, these technologies do not guarantee privacy preservation of sensitive data. That is, the sensitive information of patient could still be inferred from the published data by an adversary. Various regulations and guidelines have been developed to restrict publishable data types, data usage, and data storage, including the Health Insurance Portability and Accountability Act (HIPAA)[4][5], General Data Protection Regulation (GDPR)[6][7], and Personal Data Protection Act.[8] However, there are several limitations to this regulatory approach. First, a high trust level is required of the data recipient that they follow the rules and regulations provided by the data publisher. Yet, there are adversaries who attempt to attack the published data to reidentify a target victim. Second, sensitive data still might be carelessly published due to human error and fall into the wrong hands, which eventually leads to a breach of individual privacy. As such, regulations and guidelines alone do not provide computational guarantee for preserving the privacy of a patient and thus cannot fully prevent such privacy violations. The need of protecting individual data privacy in a hostile environment, while allowing accurate analysis of patient data, has driven the development of effective privacy models in protecting healthcare data.

In this paper, we present the privacy issues in healthcare data publication and elaborate on relevant adversarial attack models. With a focus on data anonymization and differential privacy, we discuss the limitations and strengths of these proposed approaches. Finally, we conclude the paper and highlight future research direction in this area.

Privacy threats

In this section, we first discuss privacy-preserving data publishing (PPDP) and the properties of healthcare data. Then, we present the major privacy disclosures in healthcare data publication and show the relevant attack models. Finally, we present the privacy and utility objective in PPDP.

Privacy-preserving data publishing

Privacy-preserving data publishing (PPDP) provides technical solutions that address the privacy and utility preservation challenges of data sharing scenarios. An overview of PPDP is shown in Figure 1, which includes a general data collection and data publishing scenario.


Fig1 Chong ITMWebConf21 36.png

Figure 1. Overview of privacy-preserving data publishing (PPDP)

During the data collection phase, data of the record owner (patient) are collected by the data holder (hospital) and stored in an EHR. In the data publishing phase, the data holder releases the collected data to the data recipient (e.g., the public or a third party, e.g., an insurance company or medical research center) for further analysis and data mining. However, some of the data recipients (adversary) are not honest and attempt to obtain more information about the record owner beyond the published data, which includes the identity and sensitive data of the record owner. Hence, PPDP serves as a vital process that sanitizes sensitive information to avoid privacy violations of one or more individuals.

Healthcare data

Typically, stored healthcare data exists as relational data in tabular form. Each row (tuple) corresponds to one record owner, and each column corresponds to a number of distinct attributes, which can be grouped into the following four categories:

  • Explicit identifier (ID): a set of attributes such as name, social security number, national IDs, mobile number, and drivers license number that uniquely identifies a record owner
  • Quasi-identifier (QID): a set of attributes such as date of birth, gender, address, zip code, and hobby that cannot uniquely identify a record owner but can potentially identify the target if combined with some auxiliary information
  • Sensitive attribute (SA): sensitive personal information such as diagnosis codes, genomic information, salary, health condition, insurance information, and relationship status that the record owner intends to keep private from unauthorized parties
  • Non-sensitive attribute (NSA): a set of attributes such as cookie IDs, hashed email addresses, and mobile advertising IDs generated from an EHR that do not violate the privacy of the record owner if they are disclosed (Note: all attributes that are not categorized as ID, QID, and SA are classified as NSA.)

Each attribute can be further classified as a numerical attribute (e.g., age, zip code, and date of birth) and non-numerical attribute (e.g., gender, job, and disease). Table 1 shows an example dataset, in which the name of patients is naively anonymized (by removing the names and social security numbers).

Table 1. An example of different types of attributes in a relational table
Name Quasi-identifier Sensitive attribute
Age Zip code Gender Disease
1 23 96038 Male Diabetes
2 28 96070 Female Diabetes
3 26 96073 Male Diabetes
4 37 96328 Male Cancer
5 33 96319 Female Mental Illness
6 33 96388 Female Diabetes
7 43 96583 Male Diabetes
8 49 96512 Female Cancer
9 45 96590 Male Cancer

Privacy disclosures

A privacy disclosure is defined as a disclosure of personal information that users intend to keep private from an entity which is not authorized to access or have the information. There are three types of privacy disclosures:

  • Identity disclosure: Identity disclosure, also known as reidentification, is the major privacy threat in publishing healthcare data. It occurs when the true identity of a targeted victim is revealed by an adversary from the published data. In other words, an individual is reidentified when an adversary is able to map a record in the published data to its corresponding patient with high probability (record linkage). For example, if an adversary possesses the information that A is 43 years old, then A is reidentified as record 7 in Table 1.
  • Attribute disclosure: This disclosure occurs when an adversary successfully links a victim to their SA information in the published data with high probability (attribute linkage). This SA information could be an SA value (e.g., "Disease" in Table 1) or a range that contains the SA value (e.g., medical cost range).
  • Membership disclosure: This disclosure occurs when an adversary successfully infers the existence of a targeted victim in the published data with high probability. For example, the inference of an individual in a COVID-19-positive database poses a privacy threat to the individual.

Attack models

Privacy attacks could be launched by matching a published table containing sensitive information about the target victim with some external resources modelling the background knowledge of the attacker. For a successful attack, an adversary may require the following prior knowledge:

  • The published table, 𝑻′: An adversary has access to the published table 𝑇′ (which is often an open resource) and knows that 𝑇 is an anonymized data for some table T.
  • QID of a targeted victim: An adversary possesses partial or complete QID values about a target from any external resource and the values are accurate. This assumption is realistic as the QID information is easy to acquire from different sources, including real-life inspection data, external demographic data, and and voter list data.
  • Knowledge about the distribution of the SA and NSA in table T : For example, an adversary may possess the information of P (disease=diabetes, age>50) and may utilize this knowledge to make additional inferences about records in the published table 𝑇′.

Generally, privacy attacks could be launched due to the linkability properties of the QID. Now we discuss the relevant privacy attack models for identity and attribute disclosure.

  • Linkage attack: An adversary may reidentify the identity and discover the SA values of a targeted record owner by matching the auxiliary QID values with the published table 𝑇′. For example, imagine Table 1 is published without modification and suppose A possesses the knowledge that B lives in zip code 96038, then A infers that B belongs to record 1 (identity disclosure) and has diabetes (attribute disclosure).[9][10][11][12][13]
  • Homogeneity attack: This attack discloses the SA values of a target when there is insufficient homogeneity in the SA. That is, the combination of QID is mapped to one SA value only. For example, suppose A knows that B is 28 years old, which belongs to the first equivalence class (an equivalence class is a cluster of records with the same QID values) in Table 2, below (record 1, 2 and 3). Since these records have the same disease, A infers that B suffers from diabetes.[9][14]
  • Background knowledge attack: This attack utilizes logical reasoning and additional knowledge about a target to breach the SA values. For example, suppose A knows that C is 43 years old and lives in the zip code 96583, which belongs to the third equivalence class in Table 2, below (record 7, 8 and 9). Nevertheless, the records show that C may have either diabetes or cancer. Based on A’s background knowledge that C is a person who likes sweet foods, A infers that C is diabetic.[9][14]
  • Skewness attack: When the overall distribution of SA in the original data is skewed, SA values can be inferred. The SA values have different degrees of sensitivity. For instance, a victim may not mind being known as diabetic as it is a common (majority) disease. However, one would mind being known to have mental illness. According to Table 3, below, the probability of having mental illness is 33.3%, which is much higher than that of real distribution (11.1% in Table 1). Thus, this imposes a privacy threat, since anyone in the equivalence class that has a 33.3% possibility can be inferred to have mental illness, as compared with 11.1% of the overall distribution.[15]
  • Similarity attack: This attack discloses SA values when the semantic relationship of distinct SA values in an equivalence class is close. For example, suppose that an adversary infers the possible salaries of a target victim are 2K, 3K, and 4K. Although the numbers represent distinct salary, they are all categorized in the range [2K,4K]. Hence, an adversary could infer that the target has low salary when the SA values are semantically similar.[14][15]
Table 2. Published data of Table 1
Number Quasi-identifier Sensitive attribute
Age Zip code Gender Disease
1 < 30 960** * Diabetes
2 < 30 960** * Diabetes
3 < 30 960** * Diabetes
4 < 40 963** * Cancer
5 < 40 963** * Mental Illness
6 < 40 963** * Diabetes
7 < 50 965** * Diabetes
8 < 50 965** * Cancer
9 < 50 965** * Cancer
Table 3. Another version of the published data of Table 1
Number Quasi-identifier Sensitive attribute
Age Zip code Gender Disease
1 < 30 960** * Diabetes
2 < 30 960** * Mental Illness
3 < 30 960** * Cancer
4 < 40 963** * Cancer
5 < 40 963** * Mental Illness
6 < 40 963** * Diabetes
7 < 50 965** * Diabetes
8 < 50 965** * Cancer
9 < 50 965** * Mental Illness

Privacy and utility objective of PPDP

References

  1. Senthilkumar, S.A.; Rai, B.K.; Meshram, A.A. et al. (2018). "Big Data in Healthcare Management: A Review of Literature". American Journal of Theoretical and Applied Business 4 (2): 57–69. doi:10.11648/j.ajtab.20180402.14. 
  2. Dudeck, M.A.; Horan, T.C.; Peterson, K.D. et al. (2011). "National Healthcare Safety Network (NHSN) Report, data summary for 2010, device-associated module". American Journal of Infection Control 39 (10): 798-816. doi:10.1016/j.ajic.2011.10.001. PMID 22133532. 
  3. Powell, K.M.; Li, Q.; Gross, C. et al. (2019). "Ventilator-Associated Events Reported by U.S. Hospitals to the National Healthcare Safety Network, 2015-2017". Proceedings of the American Thoracic Society 2019 International Conference. doi:10.1164/ajrccm-conference.2019.199.1_MeetingAbstracts.A3419. 
  4. Cohen, I.G.; Mello, M.M. (2018). "HIPAA and Protecting Health Information in the 21st Century". JAMA 320 (3): 231–32. doi:10.1001/jama.2018.5630. PMID 29800120. 
  5. Obeng, O.; Paul, S. (2019). "Understanding HIPAA Compliance Practice in Healthcare Organizations in a Cultural Context". AMCIS 2019 Proceedings: 1–5. https://aisel.aisnet.org/amcis2019/info_security_privacy/info_security_privacy/1/. 
  6. Voigt, P.; von dem Bussche, A. (2017). The EU General Data Protection Regulation (GDPR): A Practical Guide. Springer. ISBN 9783319579580. 
  7. Tikkinen-Piri, C.; Rohunen, A.; Markkula, J. (2018). "EU General Data Protection Regulation: Changes and implications for personal data collecting companies". Computer Law & Security Review 34 (1): 134–53. doi:10.1016/j.clsr.2017.05.015. 
  8. Carey, P. (2018). Data Protection: A Practical Guide to UK and EU Law. Oxford University Press. ISBN 9780198815419. 
  9. 9.0 9.1 9.2 Machanavajjhala, A.; Gehrke, J.; Kifer, D. et al. (2006). "L-diversity: Privacy beyond k-anonymity". Proceedings of the 22nd International Conference on Data Engineering: 24–36. doi:10.1109/ICDE.2006.1. 
  10. ">Sweeney, L. (2002). "k-anonymity: A model for protecting privacy". International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (5): 557–570. doi:10.1142/S0218488502001648. 
  11. ">Liu, F.; Li, T. (2018). "A Clustering K-Anonymity Privacy-Preserving Method for Wearable IoT Devices". Security and Communication Networks 2018: 4945152. doi:10.1155/2018/4945152. 
  12. Wei, D.; Ramamurthy, K.N.; Varshney, K.R. (2018). "Distribution‐preserving k‐anonymity". Statistical Analysis and Data Mining 11 (6): 253-270. doi:10.1002/sam.11374. 
  13. Lianf, Y.; Samavi, R. (2020). "Optimization-based k-anonymity algorithms". Computers & Security 93: 101753. doi:10.1016/j.cose.2020.101753. 
  14. 14.0 14.1 14.2 Khan, R.; Tao, X.; Anjum, A. et al. (2020). "θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records". electronics 9 (5): 716. doi:10.3390/electronics9050716. 
  15. 15.0 15.1 Li, N.; Li, T.; Venkatasubramanian, S. (2007). "t-Closeness: Privacy Beyond k-Anonymity and l-Diversity". Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering: 106–15. doi:10.1109/ICDE.2007.367856. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation and grammar for readability. In some cases important information was missing from the references, and that information was added.