You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The original motivation behind the toolbox has been to put an emphasis on biological study design and to provide a new perspective of analysing neural networks (inspired by Jon Frankle). This means replicable experiment pipelines (git hash, random seeds, configuration cloning, protocol logging, etc.). This has mainly been implemented by now.
But another super useful perspective comes from Economics and the evaluation of treatments (Thx Nandan!). In ML most result presentations do not show any significance levels/results. Often the learning curve confidence intervals even overlap. The entire process is not scientific and can benefit from tools from Econ such as diff-in-diff style experiments and simply being rigorous about the effect you are trying to sell. I want to incorporate this into the toolbox and make it part of my daily research routine for testing scientific ML hypothesis!
In the near future:
Add set of effect estimators (for now frequentist): Linear model, different standard errors and p-value extraction.
Incorporate into gridsearch/postprocessing routines for a specific metric and test.
Add correction for multiple testing (Bonferroni, False Discovery - look at Omiros' assignment).
Add difference-in-difference estimation setup: Compute effects for different timepoints.
Add plot utilities for visualizing significance over time/overall.
Note I: I think that it makes sense to add a causality subdirectory, which collects the different tests and can easily be extended. E.g. starting with a base class TreatmentTest, we can have different models standard error and correction formulations.
Note II: Many of these features will also be needed in the population-based training pipeline, e.g. Wald t-test for deciding whether two sampled population members are performing better/worse.
In the distant future:
Add experiment type that performs "an intervention" (e.g. learning rate change) at specific timepoint and estimates treatment effect over time.
Add support for Nandan's 'virtual-labs': Automatic spawning of more random seeds based on variance estimate of the effect coefficient. Fix a budget of total jobs and jobs per iteration.
Add Bayesian effect modelling with credible intervals. Get full posterior over the effect coefficient.
Things to Think About
Non-gaussianity/Bootstrap (param/non-param) standard error/Fisher exact p-values/permutation tests
Instrumental variables
Different timepoints of treatment and tracking their timeseries - e.g. when to adjust lrate?
The text was updated successfully, but these errors were encountered:
I would love to add rliable here. Many of the comparisons/test techniques would be great to have for benchmarking. These include:
Stratified Bootstrap Confidence Intervals
Score Distributions
Interquartile Mean
I would recommend installation and importing metrics/bootstrap procedure. This could automatically be calculated on the hyper_log of a search or only be provided offline afterwards. See what works best with benchmark project.
The original motivation behind the toolbox has been to put an emphasis on biological study design and to provide a new perspective of analysing neural networks (inspired by Jon Frankle). This means replicable experiment pipelines (git hash, random seeds, configuration cloning, protocol logging, etc.). This has mainly been implemented by now.
But another super useful perspective comes from Economics and the evaluation of treatments (Thx Nandan!). In ML most result presentations do not show any significance levels/results. Often the learning curve confidence intervals even overlap. The entire process is not scientific and can benefit from tools from Econ such as diff-in-diff style experiments and simply being rigorous about the effect you are trying to sell. I want to incorporate this into the toolbox and make it part of my daily research routine for testing scientific ML hypothesis!
In the near future:
Note I: I think that it makes sense to add a
causality
subdirectory, which collects the different tests and can easily be extended. E.g. starting with a base classTreatmentTest
, we can have different models standard error and correction formulations.Note II: Many of these features will also be needed in the population-based training pipeline, e.g. Wald t-test for deciding whether two sampled population members are performing better/worse.
In the distant future:
Things to Think About
The text was updated successfully, but these errors were encountered: