Rigorous treatment effect estimation #59

RobertTLange · 2021-04-21T12:42:16Z

The original motivation behind the toolbox has been to put an emphasis on biological study design and to provide a new perspective of analysing neural networks (inspired by Jon Frankle). This means replicable experiment pipelines (git hash, random seeds, configuration cloning, protocol logging, etc.). This has mainly been implemented by now.

But another super useful perspective comes from Economics and the evaluation of treatments (Thx Nandan!). In ML most result presentations do not show any significance levels/results. Often the learning curve confidence intervals even overlap. The entire process is not scientific and can benefit from tools from Econ such as diff-in-diff style experiments and simply being rigorous about the effect you are trying to sell. I want to incorporate this into the toolbox and make it part of my daily research routine for testing scientific ML hypothesis!

In the near future:

Add set of effect estimators (for now frequentist): Linear model, different standard errors and p-value extraction.
Incorporate into gridsearch/postprocessing routines for a specific metric and test.
Add correction for multiple testing (Bonferroni, False Discovery - look at Omiros' assignment).
Add difference-in-difference estimation setup: Compute effects for different timepoints.
Add plot utilities for visualizing significance over time/overall.

Note I: I think that it makes sense to add a causality subdirectory, which collects the different tests and can easily be extended. E.g. starting with a base class TreatmentTest, we can have different models standard error and correction formulations.

Note II: Many of these features will also be needed in the population-based training pipeline, e.g. Wald t-test for deciding whether two sampled population members are performing better/worse.

In the distant future:

Add experiment type that performs "an intervention" (e.g. learning rate change) at specific timepoint and estimates treatment effect over time.
Add support for Nandan's 'virtual-labs': Automatic spawning of more random seeds based on variance estimate of the effect coefficient. Fix a budget of total jobs and jobs per iteration.
Add Bayesian effect modelling with credible intervals. Get full posterior over the effect coefficient.

Things to Think About

Non-gaussianity/Bootstrap (param/non-param) standard error/Fisher exact p-values/permutation tests
Instrumental variables
Different timepoints of treatment and tracking their timeseries - e.g. when to adjust lrate?

The text was updated successfully, but these errors were encountered:

RobertTLange · 2021-04-21T19:52:57Z

Adding some references for completeness:

RobertTLange · 2021-12-11T17:25:09Z

I would love to add rliable here. Many of the comparisons/test techniques would be great to have for benchmarking. These include:

Stratified Bootstrap Confidence Intervals
Score Distributions
Interquartile Mean

I would recommend installation and importing metrics/bootstrap procedure. This could automatically be calculated on the hyper_log of a search or only be provided offline afterwards. See what works best with benchmark project.

RobertTLange self-assigned this Apr 21, 2021

RobertTLange added core-func Core functionality enhancement New feature or request labels Apr 21, 2021

RobertTLange mentioned this issue Apr 25, 2021

Add HypothesisTester + MetaLog & HyperLog #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rigorous treatment effect estimation #59

Rigorous treatment effect estimation #59

RobertTLange commented Apr 21, 2021 •

edited

Loading

RobertTLange commented Apr 21, 2021

RobertTLange commented Dec 11, 2021

Rigorous treatment effect estimation #59

Rigorous treatment effect estimation #59

Comments

RobertTLange commented Apr 21, 2021 • edited Loading

RobertTLange commented Apr 21, 2021

RobertTLange commented Dec 11, 2021

RobertTLange commented Apr 21, 2021 •

edited

Loading