Journal:Privacy preservation techniques in big data analytics: A survey

From LIMSWiki
Revision as of 22:51, 12 November 2018 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Privacy preservation techniques in big data analytics: A survey
Journal Journal of Big Data
Author(s) Rao, P. Ram Mohan; Krishna, S. Murali; Kumar, A.P. Siva
Author affiliation(s) MLR Institute of Technology, Sri Venkateswara College of Engineering, JNTU Anantapur
Primary contact Email: rammohan04 at gmail dot com
Year published 2018
Volume and issue 5
Page(s) 33
DOI 10.1186/s40537-018-0141-8
ISSN 2196-1115
Distribution license Creative Commons Attribution 4.0 International
Website https://link.springer.com/article/10.1186/s40537-018-0141-8
Download https://link.springer.com/content/pdf/10.1186%2Fs40537-018-0141-8.pdf (PDF)

Abstract

Incredible amounts of data are being generated by various organizations like hospitals, banks, e-commerce, retail and supply chain, etc. by virtue of digital technology. Not only humans but also machines contribute to data streams in the form of closed circuit television (CCTV) streaming, web site logs, etc. Tons of data is generated every minute by social media and smart phones. The voluminous data generated from the various sources can be processed and analyzed to support decision making. However data analytics is prone to privacy violations. One of the applications of data analytics is recommendation systems, which are widely used by e-commerce sites like Amazon and Flipkart for suggesting products to customers based on their buying habits, leading to inference attacks. Although data analytics is useful in decision making, it will lead to serious privacy concerns. Hence privacy preserving data analytics became very important. This paper examines various privacy threats, privacy preservation techniques, and models with their limitations. The authors then propose a data lake-based modernistic privacy preservation technique to handle privacy preservation in unstructured data.

Keywords: data, data analytics, privacy threats, privacy preservation

Introduction

There is exponential growth in the volume and variety of data due to diverse applications of computers in all domain areas. The growth has been achieved due to affordable availability of computer technology, storage, and network connectivity. The large scale data—which also include person specific private and sensitive data like gender, zip code, disease, caste, shopping cart, religion, etc.—is being stored in a variety of public and private domains. The data holder can then release this data to a third-party data analyst to gain deeper insights and identify hidden patterns which are useful in making important decisions that may help in improving businesses and provide value-added services to customers[1], as well in activities such as prediction, forecasting, and recommendation.[2] One of the prominent applications of data analytics is the recommendation system, which is widely used by e-commerce sites like Amazon and Flipkart for suggesting products to customers based on their buying habits. Facebook does something similar by suggesting friends, places to visit, and even movies to watch based on our interest. However releasing user activity data may lead to inference attacks like identifying gender based on user activity.[3] We have studied a number of privacy preserving techniques which are being employed to protect against privacy threats. Each of these techniques has their own merits and demerits. This paper explores the merits and demerits of each of these techniques and also describes the research challenges in the area of privacy preservation. Always there exists a trade off between data utility and privacy. This paper also proposes a data lake-based modernistic privacy preservation technique to handle privacy preservation in unstructured data with maximum data utility.

Privacy threats in data analytics

Privacy is the ability of an individual to determine what data can be shared, and employ access control. If the data is in the public or private domain, then it is a threat to individual privacy as the data is held by a data holder. The data holder can be a social networking application, website, mobile app, e-commerce site, bank, hospital, etc. It is the responsibility of the data holder to ensure privacy of the users data. Apart from the data held in various domains, knowingly or unknowingly users may contribute to data leakage. For example, most of mobile apps seek access to our contacts, files, camera, etc., and without reading the privacy statement we agree to all its terms and conditions, there by contributing to data leakage.

Hence there is a need to educate smart phone users regarding privacy and privacy threats. Some of the key privacy threats include (1) surveillance, (2) disclosure, (3) discrimination, and (4) personal embracement and abuse.

Surveillance

Many retail, e-commerce, etc. businesses study their customers' buying habits and try to come up with various offers and value-added services.[4] Based on the opinion data and sentiment analysis, social media sites may provide recommendations of new friends, places to visit, people to follow, etc. This is possible only when they continuously monitor their customers' transactions. This is a serious privacy threat as no individual accepts surveillance.

Disclosure

Consider a hospital holding a patient's data, often containing identifying or revealing information such as zip code, gender, age, and disease.[5][6][7] The data holder, the hospital, has released data to a third party for analysis by anonymizing sensitive personal information so that the person cannot be identified. The third party data analyst can map this information with freely available external data sources like census data and then identify the person suffering a disorder. This is how the private information of a person can be disclosed, which is considered to be a serious privacy breach.

Discrimination

Discrimination is the bias or inequality which can happen when some private information of a person is disclosed. For instance, statistical analysis of electoral results proved that people of one community were completely against the party, which formed the government. Now the government can neglect that community or can have bias over them.

Personal embracement and abuse

Whenever a person's private information is disclosed, it can even lead to personal embracement or abuse. For example, a person was privately taking medication for some specific problem and was buying the medicine on a regular basis from a medical shop. As part of their regular business model, the medical shop may send a reminder and offers related to the medicine over the phone. If another family member has noticed this, it may lead to personal embracement and even abuse.[8]

Data analytics activity creates data privacy issues. Many countries are enacting and enforcing privacy preservation laws. Yet something as simple as lack of awareness can still generate privacy attacks despite these mechanisms. For example, many smart phones users are not aware of the information that is stolen from their phones by many apps. Previous research shows only 17 percent of smart phone users are aware of such privacy threats.[9]

Privacy preservation methods

Many privacy preserving techniques have been developed, but most of them are based on anonymization of data. A list of privacy preservation techniques includes:

  • K anonymity
  • L diversity
  • T closeness
  • Randomization
  • Data distribution
  • Cryptographic techniques
  • Multidimensional sensitivity-based anonymization (MDSBA)

K anonymity

Anonymization is the process of modifying data before it is given for data analytics[10], so that de-identification is not possible and will lead to K indistinguishable records if an attempt is made to de-identify by mapping the anonymized data with external data sources. K anonymity[11] is prone to two attacks namely homogeneity attack and back ground knowledge attack. Some of the algorithms applied include Incognito[12] and Mondrian[13] to ensure anonymization. K-anonymity is applied on the patient data shown in Table 1. The table shows data before anonymization.

Table 1. Patient data, before anonymization
Sno Zip Age Disease
1 57677 29 Cardiac problem
2 57602 22 Cardiac problem
3 57678 27 Cardiac problem
4 57905 43 Skin allergy
5 57909 52 Cardiac problem
6 57906 47 Cancer
7 57605 30 Cardiac problem
8 57673 36 Cancer
9 57607 32 Cancer

The K-anonymity algorithm is applied with a K value of 3 to ensure three indistinguishable records when an attempt is made to identify a particular person's data. K-anonymity is applied on the two attributes, viz., zip and age shown in Table 1. The result of applying anonymization on the zip and age attributes is shown in Table 2.

Table 2. After applying anonymization on zip and age
Sno Zip Age Disease
1 576** 2* Cardiac problem
2 576** 2* Cardiac problem
3 576** 2* Cardiac problem
4 5790* >40 Skin allergy
5 5790* >40 Cardiac problem
6 5790* >40 Cancer
7 576** 3* Cardiac problem
8 576** 3* Cancer
9 576** 3* Cancer

The above technique used generalization[14] to achieve anonymization. Suppose we know that John is 27 years old and lives in the 57677 zip code. We can then still conclude John to have cardiac problem even after anonymization as shown in Table 2. This is called a homogeneity attack. For example, if John is 36 years old and we know that John does not have cancer, then definitely John must have cardiac problem. This is called a background knowledge attack. Achieving K-anonymity[15][16] can be done either by using generalization or suppression. K-anonymity can be optimized if the minimal generalization can be done without significant data loss.[17] Identity disclosure is the major privacy threat, which cannot be guaranteed by K-anonymity.[18] Personalized privacy is the most important aspect of individual privacy.[19]

L-diversity

To address the homogeneity attack, another technique called L-diversity has been proposed. As per L-diversity, there must be L well represented values for the sensitive attribute (disease) in each equivalence class.

Implementing L-diversity is not possible every time because of the variety of data. L-diversity is also prone to the skewness attack, meaning that when the overall distribution of data is skewed into few equivalence classes, attribute disclosure cannot be ensured. For example, if the entire records are distributed into only three equivalence classes, then semantic closeness of these values may lead to attribute disclosure. Also L-diversity may lead to a similarity attack. From Table 3, it can be noticed that if we know that John is 27 year old and lives in the 57677 zip code, then John definitely falls under the low income group because salaries of all three persons in the 576** zip are low compared to others in the table. This is called a similarity attack.

Table 3. L-diversity privacy preservation technique
Sno Zip Age Salary Disease
1 576** 2* 5k Cardiac problem
2 576** 2* 6k Cardiac problem
3 576** 2* 7k Cardiac problem
4 5790* >40 20k Skin allergy
5 5790* >40 22k Cardiac problem
6 5790* >40 24k Cancer

Abbreviations

  • CCTV: closed circuit television
  • MDSBA: multidimensional sensitivity-based anonymization

References

  1. Ducange, P.; Pecori, R.; Mezzina, P. (2018). "A glimpse on big data analytics in the framework of marketing strategies". Soft Computing 22 (1): 325–42. doi:10.1007/s00500-017-2536-4. 
  2. Chauhan, A.; Kummamuru, K.; Toshniwal, D. (2017). "Prediction of places of visit using tweets". Knowledge and Information Systems 50 (1): 145–66. doi:10.1007/s10115-016-0936-x. 
  3. Yang, D.; Qu, B.; Cudre-Mauroux, P. (2018). "Privacy-Preserving Social Media Data Publishing for Personalized Ranking-Based Recommendation". IEEE Transactions on Knowledge and Data Engineering. doi:10.1109/TKDE.2018.2840974. 
  4. Liu, Y.; Guo, W.; Fan, C.-I. et al. (2018). "A Practical Privacy-Preserving Data Aggregation (3PDA) Scheme for Smart Grid". IEEE Transactions on Industrial Informatics. doi:10.1109/TII.2018.2809672. 
  5. Duncan, G.T.; Fienberg, S.E.; Krishnan, R. et al. (2001). "Disclosure limitation methods and information loss for tabular data". In Doyle, P.; Lane, J.; Theeuwes, J. et al.. Confidentiality, disclosure and data access: Theory and practical applications for statistical agencies. Elsevier. pp. 135–66. ISBN 9780444507617. 
  6. Duncan, G.T.; Lambert, D. (1986). "Disclosure-Limited Data Dissemination". Journal of the American Statistical Association 81 (393): 10-18. doi:10.1080/01621459.1986.10478229. 
  7. Lambert, D. (1993). "Measures of disclosure risk and harm". Journal of Official Statistics 9 (2): 313–31. 
  8. Spiller, K.; Ball, K; Bandara, A. et al. (2017). "Data Privacy: Users’ Thoughts on Quantified Self Personal Data". In Ajana, B.. Self-Tracking. Palgrave Macmillan, Cham. pp. 111–24. doi:10.1007/978-3-319-65379-2_8. ISBN 9783319653792. 
  9. Hettig, M.; Kiss, E.; Jassel, J.-F. et al. (2013). "Visualizing Risk by Example: Demonstrating Threats Arising From Android Apps". Symposium on Usable Privacy and Security (SOUPS) 2013: 1-2. https://cups.cs.cmu.edu/soups/2013/risk/paper.pdf. 
  10. Iyengar, V.S. (2002). "Transforming data to satisfy privacy constraints". Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 279–288. doi:10.1145/775047.775089. 
  11. Bayardo, R.J.; Agrawal, R. (2005). "Data privacy through optimal k-anonymization". Proceedings of the 21st International Conference on Data Engineering: 217–28. doi:10.1109/ICDE.2005.42. 
  12. LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. (2005). "Incognito: Efficient full-domain K-anonymity". Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data: 49–60. doi:10.1145/1066157.1066164. 
  13. LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. (2006). "Mondrian: Multidimensional K-Anonymity". Proceedings of the 22nd International Conference on Data Engineering: 25. doi:10.1109/ICDE.2006.101. 
  14. Samarati, P.; Sweeney, L. (1998). "Protecting Privacy when Disclosing Information: k-Anonymity and its Enforcement through Generalization and Suppression". Technical Report SRI-CSL-98-04. SRI International. http://www.csl.sri.com/papers/sritr-98-04/. 
  15. Sweeney, L. (2002). "Achieving k-anonymity privacy protection using generalization and suppression". International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (5): 571–88. doi:10.1142/S021848850200165X. 
  16. Sweeney, L. (2002). "K-Anonymity: A model for protecting privacy". International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (5): 557–70. doi:10.1142/S0218488502001648. 
  17. Meyerson, A.; Williams, R. (2004). "On the complexity of optimal K-anonymity". Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems: 223-28. doi:10.1145/1055558.1055591. 
  18. Machanavajjhala, A.; Gehrke, J.; Kifer, D. et al. (2006). "L-diversity: Privacy beyond K-anonymity". Proceedings of the 22nd International Conference on Data Engineering: 24. doi:10.1109/ICDE.2006.1. 
  19. Xiao, X.; Tao, Y. (2006). "Personalized privacy preservation". Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data: 229-40. doi:10.1145/1142473.1142500. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. Grammar was cleaned up for smoother reading. In some cases important information was missing from the references, and that information was added. The citations for 10 and 11 were flipped because the original applied the citation to the title, which we don't do here.