Adaptive Data Analysis

Important Links


This class will take a mathematically rigorous approach to understanding how to mitigate overfitting and false discovery when doing data analysis in the common case in which data is repeatedly re-used, both to suggest which analyses should be performed, and to actually conduct those analyses. Despite being the de-facto way in which most data analysis is performed, your statistics 101 textbook will tell you not to do this: if your data analyses are themselves chosen from the data, most common methods of inference are invalidated.

We will start the class by demonstrating why this is the case: if you use standard empirical estimates, then adaptively chosen analyses really can overfit very quickly. The rest of the class will then be focused on mitigations: can we design more sophisticated statistical estimators that can prevent this problem?

Classes will start the week of September 4. (For Penn students: this is a week after the Penn semester starts. You get the first week off!). See the About page for times and locations.

Before class starts, please read Gelman and Loken’s excellent paper The Garden of the Forking Paths which describes the problem of adaptivity (using different terminology) in social science research, and provides interesting real-world examples of the kind of problem we want to build tools to solve.