• Starts: 1:30 pm on Wednesday, July 24, 2024
  • Ends: 3:30 pm on Wednesday, July 24, 2024

ECE PhD Dissertation Defense: Zeynep K

Title: Database Alignment: Fundamental Limits and Multiple Databases Setting

Presenter: Zeynep K

Date: Wednesday, July 24, 2024

Time: 1:30 PM to 3:30 PM

Location: 8 Saint Mary's Street, Room 339

Advisor: Professor Bobak Nazer

Chair: Professor Tianyu Wang

Committee: Professor Bobak Nazer, Professor Prakash Ishwar, Professor David A. Castanon, Professor Archana Venkataraman

Abstract: In modern data analysis, privacy is a critical concern when dealing with user-related databases. Ensuring user anonymity while extracting meaningful correlations from the data poses a significant challenge, especially when side information can potentially enable de-anonymization. This dissertation explores the standard information theoretic problems in the correlated databases model. We define a "database" as a simple probabilistic model that contains a random feature vector for each user, with user labels shuffled to ensure anonymity.

We first investigate correlation detection between two databases, formulating it as a composite binary hypothesis testing problem. Under the alternate hypothesis, there exists an unknown permutation that aligns users in the first database with those in the second, thereby matching correlated entries. The null hypothesis assumes that the databases are independent, with no such alignment. For the special case of Gaussian feature vectors, we derive both upper and lower bounds on the correlation required to achieve or fail to achieve this statistical problem. Our results are tight up to a constant factor when the feature length exceeds the number of users. Regarding our achievability boundary, we draw connections to the user labeling recovery problem, highlighting significant parallels and insights. By recognizing an equivalence relationship between correlated databases and graphical models, we validate our results for specific parameter regimes. Additionally, for the two databases model, we initially examine the potential gaps in the statistical analysis conducted thus far for the large number of users regime by drawing parallels with similar problems in the literature. Motivated by these comparisons, we propose a novel approach to address the detection problem, focusing on the hidden permutation structure and intricate dependencies characterizing these relationships.

Building on our research, we present a comprehensive model for handling multiple correlated databases. In this multiple-databases setting, we address another fundamental information theoretic problem: user label recovery. We evaluate the performance of the typicality matching estimator in relation to the asymptotic behavior of feature length, demonstrating a strong impossibility result that holds up to a multiplicative constant factor. This exploration into multiple databases not only broadens the scope of our study but also underscores the complexity and richness of correlation detection in a more generalized framework.

In conclusion, we summarize the statistical gaps identified in our findings, exploring their possible origins. We also discuss the limitations of our simple probabilistic model and propose strategies to address them, suggesting modifications and enhancements for future research. Finally, we outline potential future research directions, including the information theoretic problem of change detection, which remains an open area of significant interest.

Location:
PHO 339