Protecting data privacy with anonymity: Quantifying instinctive measures and intelligent effective search for optimal anonymized data



Journal Title

Journal ISSN

Volume Title



Data privacy entails the ability for individuals to control their personal data. With advanced technology in this digital era, users can lose control of their personal data without knowing as their data can be tracked, stored, and shared across multiple parties. Protecting online data privacy is a daunting task. Current consent-based privacy policies tend to be too elaborate and difficult to apply effectively. There is a need for new approaches to protecting data privacy that go beyond user consent model. As more data are shared and publicly available, attackers can further infer confidential data from multiple query sources to commit malicious acts. This dissertation addresses the issues of how to better protect privacy of such published structured data, particularly, fundamental issues of anonymity quantification and practical issues on efficiency and optimality of automated anonymization.
Anonymity is among the most widely used property for data privacy protection. It represents the state of indistinguishability. Thus, increasing users’ anonymity increases their indistinguishability that makes it harder for them to be re-identified. Anonymization ensures that each set of "critical" data values belongs to more than one individual so that the individual's identity can be protected. Many privacy-preserving approaches to anonymizing structured data involve transforming the original data into a more anonymous form (via generalization and suppression) while preserving the data integrity. Although techniques for anonymization have been studied extensively, to our surprise, most of them do not directly measure anonymity but use a measure specified to indirectly indicate the quality of anonymity (e.g., anonymity degree in k-anonymity). Most existing anonymity measures are indirect since they are based on entropy that estimates information loss, a partial consequence of anonymity. Anonymity measure is at the heart of anonymization and yet there is little research on quantifying a direct measure of anonymity. This dissertation models two direct anonymity measures: information-based and inference-based for the disclosure breach and re-identification attack, respectively. The unique aspect of the formulation is its instinctive articulation of opposing perspectives of victims (in concealing their identity) and attackers (in finding the disclosure or identity). Furthermore, unlike most other work, this study distinguishes the measure of an individual anonymity from that of the group. When dealing with large-scaled data, by using data distribution, this dissertation proposes measures of uniformity, variety, and diversity as anonymity indicators to quickly assess degrees of data privacy.
On practical issues of anonymization, to improve efficiency, most general-purpose anonymization techniques aim to find "optimal" k-anonymization (or anonymized data satisfying k-anonymity requirements), e.g., by minimizing data distortion or the number of generalization steps. However, the problem of finding k-anonymization to maximize preserved information is NP-hard. This has led to greedy anonymization and special purpose techniques. Still, a common issue in anonymization is trade-offs between data privacy and informativeness. Generalization helps gain anonymity but can result in data that are not useful. Anonymization approaches are mostly designed to address specific goals (e.g., accurate classification, efficient algorithms) but none provides an integrated solution for efficiency, privacy, and preserved informativeness. This dissertation presents a general-purpose anonymization technique that applies generalization for securing privacy by satisfying user-specified anonymity requirements while optimizing information preservation. The proposed approach exploits the monotonicity property of generalization along with a heuristic search to efficiently find optimal generalized data that comply with the anonymity requirements. The approach is theoretically grounded as the search can be mapped to a well-known efficient optimal search in Artificial Intelligence. In addition, the approach can give the data quality for classification relatively well even though its intent is to keep the generalized data as close as possible to the original. Finally, this dissertation puts together a practical methodology for anonymity analytics and retention.

Restricted to TTU community only. To view, login with your eRaider (top right). Others may request the author grant access exception by clicking on the PDF link to the left.



Privacy, Anonymity, Intelligent Anonymization, Privacy Measures