The final project can have several forms:
- Read and digest a paper (or pair of related papers) that we haven’t covered in class, and produce a set of lecture notes about it, presenting and proving the main results. The lecture notes should be of the form that we have been producing: a rigorous, and concise presentation of the main result of the paper. They should draw connections to material we have covered in class where appropriate.
- A more open ended research investigation of your choosing. This can be entirely empirical, or entirely theoretical, or anywhere in between. The product of this should be a clear, precise writeup in the form of an academic paper, clearly framing your problem in the context of the academic literature, and describing your investigations and findings.
Students are invited to work in groups. In either case, the final product will be due on or before the date on which the final exam would have been scheduled (At Penn, this is December 18). Students who are interested in presenting their work in the last several class periods are invited to do so, for extra credit. Contact us!
Topics to turn into a lecture:
- Proving Chernoff-like concentration bounds using the stability tools we have developed in class:
- Other Differentially Private Estimators:
- Adaptive Data Gathering
- Bayesian Adaptive Data Analysis
- Hypothesis Testing
- Approaches from Statistics (Selective Inference and Post Selection Inference)
- Exact Post-Selection Inference with Application to the Lasso
- Optimal Inference after Model Selection
- Selective Inference with a Randomized Response
- Valid Post-Selection Inference
- Inferactive Data Analysis (N. Bi, J. Markovic, L. Xia, J. Taylor, arxiv 2017)
- Economic Approaches
Information Theoretic Views of Generalization(these will likely be covered in class)
Other project ideas:
- More Theory: Can you improve the bounds we have presented in class? Even improvements just in the constant factors can be extremely significant in practice, and the difference between a theoretical curiosity and a useful technique. What about efficient algorithms that obtain statistically valid results for large numbers of queries from structured classes? (We know we can’t hope for efficient mechanisms for arbitrary queries that are substantially better than the Gaussian mechanism… )
- More practice: How well can the techniques we’ve presented in class be made to work in practical settings? What about if you set the parameters heuristically (rather than as dictated by the theorems?) The GuessAndCheck mechanism can be used to answer large numbers of queries if you have a way of coming up with mostly accurate guesses. We know that methods that can provably do this can’t run in polynomial time in general — but there might be good methods for heuristically coming up with accurate guesses. The reusable holdout is an example of this. Can you find others?
- Other kinds of adaptivity: We studied mitigations for the problem of adaptively selecting analyses. But adaptive data gathering can also introduce overfitting. (For example, I will continue gathering data until I obtain a p-value of < 0.05, at which point I will stop). See also the linked paper above. Are techniques from this class helpful in mitigating biases from other kinds of adaptivity?
- Markets for Statistical Validity: We have been studying techniques to guarantee statistical validity in a single research study. But datasets are shared amongst many researchers, and false discovery and overfitting are therefore research-community scale problems. How should we reason about overfitting at a multi-study scale? Are there market based approaches to allocating access to finite data sets?
- More: Don’t feel constrained by these suggestions — pick your own idea.