The Algorithmic Foundations of Adaptive Data Analysis

Important Links

Lecture Notes
Homework
Logistics
Course Piazza Site (email us if you want to be added)

Synopsis

This class will take a mathematically rigorous approach to understanding how to mitigate overfitting and false discovery when doing data analysis in the common case in which data is repeatedly re-used, both to suggest which analyses should be performed, and to actually conduct those analyses. Despite being the de-facto way in which most data analysis is performed, your statistics 101 textbook will tell you not to do this: if your data analyses are themselves chosen from the data, most common methods of inference are invalidated.

We will start the class by demonstrating why this is the case: if you use standard empirical estimates, then adaptively chosen analyses really can overfit very quickly. The rest of the class will then be focused on mitigations: can we design more sophisticated statistical estimators that can prevent this problem?

Classes will start the week of September 4. (For Penn students: this is a week after the Penn semester starts. You get the first week off!). See the About page for times and locations.

Before class starts, please read Gelman and Loken’s excellent paper The Garden of the Forking Paths which describes the problem of adaptivity (using different terminology) in social science research, and provides interesting real-world examples of the kind of problem we want to build tools to solve.

The final project can have several forms:

Read and digest a paper (or pair of related papers) that we haven’t covered in class, and produce a set of lecture notes about it, presenting and proving the main results. The lecture notes should be of the form that we have been producing: a rigorous, and concise presentation of the main result of the paper. They should draw connections to material we have covered in class where appropriate.
A more open ended research investigation of your choosing. This can be entirely empirical, or entirely theoretical, or anywhere in between. The product of this should be a clear, precise writeup in the form of an academic paper, clearly framing your problem in the context of the academic literature, and describing your investigations and findings.

Students are invited to work in groups. In either case, the final product will be due on or before the date on which the final exam would have been scheduled (At Penn, this is December 18). Students who are interested in presenting their work in the last several class periods are invited to do so, for extra credit. Contact us!

Topics to turn into a lecture:

Proving Chernoff-like concentration bounds using the stability tools we have developed in class:
- Subgaussian Tail Bounds via Stability Arguments
- Concentration Bounds for High Sensitivity Functions via Differential Privacy
Other Differentially Private Estimators:
- Generalization for Adaptively Chosen Estimators via Stable Median
- Sublinear Time Adaptive Data Analysis
Adaptive Data Gathering
- Why Adaptively Collected Data have Negative Bias and How to Correct For It
Bayesian Adaptive Data Analysis
- Challenges in Bayesian Adaptive Data Analysis
- Bayesian Adaptive Data Analysis Guarantees from Subgaussianity
Hypothesis Testing
Approaches from Statistics (Selective Inference and Post Selection Inference)
- Exact Post-Selection Inference with Application to the Lasso
- Optimal Inference after Model Selection
- Selective Inference with a Randomized Response
- Valid Post-Selection Inference
- Inferactive Data Analysis (N. Bi, J. Markovic, L. Xia, J. Taylor, arxiv 2017)
Economic Approaches
- The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery
~~Information Theoretic Views of Generalization~~ (these will likely be covered in class)

Other project ideas:

More Theory: Can you improve the bounds we have presented in class? Even improvements just in the constant factors can be extremely significant in practice, and the difference between a theoretical curiosity and a useful technique. What about efficient algorithms that obtain statistically valid results for large numbers of queries from structured classes? (We know we can’t hope for efficient mechanisms for arbitrary queries that are substantially better than the Gaussian mechanism… )
More practice: How well can the techniques we’ve presented in class be made to work in practical settings? What about if you set the parameters heuristically (rather than as dictated by the theorems?) The GuessAndCheck mechanism can be used to answer large numbers of queries if you have a way of coming up with mostly accurate guesses. We know that methods that can provably do this can’t run in polynomial time in general — but there might be good methods for heuristically coming up with accurate guesses. The reusable holdout is an example of this. Can you find others?
Other kinds of adaptivity: We studied mitigations for the problem of adaptively selecting analyses. But adaptive data gathering can also introduce overfitting. (For example, I will continue gathering data until I obtain a p-value of < 0.05, at which point I will stop). See also the linked paper above. Are techniques from this class helpful in mitigating biases from other kinds of adaptivity?
Markets for Statistical Validity: We have been studying techniques to guarantee statistical validity in a single research study. But datasets are shared amongst many researchers, and false discovery and overfitting are therefore research-community scale problems. How should we reason about overfitting at a multi-study scale? Are there market based approaches to allocating access to finite data sets?
More: Don’t feel constrained by these suggestions — pick your own idea.

The Algorithmic Foundations of Adaptive Data Analysis

Adaptive Data Analysis

Important Links

Synopsis

Finalizing the class notes

Project Ideas

The Effect of the Reproducibility Movement in Social Psychology