This class will take a mathematically rigorous approach to understanding how to mitigate overfitting and false discovery when doing data analysis in the common case in which data is repeatedly re-used, both to suggest which analyses should be performed, and to actually conduct those analyses. Despite being the de-facto way in which most data analysis is performed, your statistics 101 textbook will tell you not to do this: if your data analyses are themselves chosen from the data, most common methods of inference are invalidated.
We will start the class by demonstrating why this is the case: if you use standard empirical estimates, then adaptively chosen analyses really can overfit very quickly. The rest of the class will then be focused on mitigations: can we design more sophisticated statistical estimators that can prevent this problem?
Classes will start the week of September 4. (For Penn students: this is a week after the Penn semester starts. You get the first week off!). See the About page for times and locations.
Before class starts, please read Gelman and Loken’s excellent paper The Garden of the Forking Paths which describes the problem of adaptivity (using different terminology) in social science research, and provides interesting real-world examples of the kind of problem we want to build tools to solve.
The final project can have several forms:
- Read and digest a paper (or pair of related papers) that we haven’t covered in class, and produce a set of lecture notes about it, presenting and proving the main results. The lecture notes should be of the form that we have been producing: a rigorous, and concise presentation of the main result of the paper. They should draw connections to material we have covered in class where appropriate.
- A more open ended research investigation of your choosing. This can be entirely empirical, or entirely theoretical, or anywhere in between. The product of this should be a clear, precise writeup in the form of an academic paper, clearly framing your problem in the context of the academic literature, and describing your investigations and findings.
Students are invited to work in groups. In either case, the final product will be due on or before the date on which the final exam would have been scheduled (At Penn, this is December 18). Students who are interested in presenting their work in the last several class periods are invited to do so, for extra credit. Contact us!
Topics to turn into a lecture:
- Proving Chernoff-like concentration bounds using the stability tools we have developed in class:
- Other Differentially Private Estimators:
- Adaptive Data Gathering
- Bayesian Adaptive Data Analysis
- Hypothesis Testing
- Approaches from Statistics (Selective Inference and Post Selection Inference)
- Economic Approaches
Information Theoretic Views of Generalization (these will likely be covered in class)
Other project ideas:
- More Theory: Can you improve the bounds we have presented in class? Even improvements just in the constant factors can be extremely significant in practice, and the difference between a theoretical curiosity and a useful technique. What about efficient algorithms that obtain statistically valid results for large numbers of queries from structured classes? (We know we can’t hope for efficient mechanisms for arbitrary queries that are substantially better than the Gaussian mechanism… )
- More practice: How well can the techniques we’ve presented in class be made to work in practical settings? What about if you set the parameters heuristically (rather than as dictated by the theorems?) The GuessAndCheck mechanism can be used to answer large numbers of queries if you have a way of coming up with mostly accurate guesses. We know that methods that can provably do this can’t run in polynomial time in general — but there might be good methods for heuristically coming up with accurate guesses. The reusable holdout is an example of this. Can you find others?
- Other kinds of adaptivity: We studied mitigations for the problem of adaptively selecting analyses. But adaptive data gathering can also introduce overfitting. (For example, I will continue gathering data until I obtain a p-value of < 0.05, at which point I will stop). See also the linked paper above. Are techniques from this class helpful in mitigating biases from other kinds of adaptivity?
- Markets for Statistical Validity: We have been studying techniques to guarantee statistical validity in a single research study. But datasets are shared amongst many researchers, and false discovery and overfitting are therefore research-community scale problems. How should we reason about overfitting at a multi-study scale? Are there market based approaches to allocating access to finite data sets?
- More: Don’t feel constrained by these suggestions — pick your own idea.
Andrew Gelman (whose paper on the “Garden of the Forking Paths” we read at the start of this course), together with Simmons and Simonsohn (here at Penn) have for years helped bring attention to the ability for adaptive data analysis (p-hacking) to lead to false discovery. Their blogs have often done this by picking apart the methodology of high profile papers. The New York Times has an in depth history of this movement, with a focus on the social upheaval it caused amongst the social psychology community itself, and particularly for the authors of flawed studies that were held up as examples.