-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscience-wg.tex
16 lines (9 loc) · 5.33 KB
/
science-wg.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
\section{MLCommons Science Working Group}
\label{sec:science-wg}
\subsection{About the Working Group}
The Science working group~\cite{mlcommons-science} was an early member of MLCommons Research, created by the international community working on AI for Science, such as various national laboratories, large-scale experimental facilities, universities and commercial entities, to advance AI for Science along with other national and international level initiatives (e.g.,~\cite{ai4s-doe-report}). The overarching drive of the WG is to support various scientific communities that are trying to leverage AI for advancing scientific discoveries. Since the inception, the WG has expanded to include almost 120 members, located across various organizations. The group also works with a number of other working groups within MLCommons, such HPC~WG~\cite{hpc-wg-arxiv}, where there are a number of overlapping issues of interest. The overall mission of the group entails collaborative engagements across different domains of sciences.
\subsection{Science Benchmarking}
Achieving the overall goals of the working group requires a number of sub-aspects to be covered by the WG, such as, (a)identifying a number of representative scientific problems where AI can make a difference, (b) engineering at least one ML solution to the problem, to be considered as a baseline implementation, (c) identifying relevant datasets upon which the ML models can be trained or tested, (d) identifying a scientifically-driven metric that can help recognizing the scientific advancement to the problem, (e) curating and publishing those relevant datasets, (f) publishing the scientific results that can help the communities to develop improve these solutions, and (g) fostering collaborations and scientific achievements across multidisciplinary communities. All these activities are akin to conventional benchmarking, but with a major difference of focusing on scientific merits than pure performance, and hence the notion of science benchmarking. Since the formation, the WG has consulted a large number of scientific organizations, and worked with scientists in achieving some of the sub-aspects listed above. In particular, the WG has succeeded in identifying four science benchmarks derived from different branches of sciences, namely, (a) Cloud masking ({\tt cloud-masking})~\cite{sciml-bench:2021} --- atmospheric sciences, (b) Space group classification of solid state materials from Scanning Transmission Electron Microscope (STEM) data using Deep Learning ({\tt stemdl})~\cite{laanait-scanning} --- solid state physics, (c) Time evolution operator ({\tt tevelop})~\cite{fox2022-jm} exemplified using predicting earthquakes --- earth sciences and (d) predicting tumor response to drugs ({\tt candle-uno}) --- healthcare.
We discuss these benchmarks in detail in Section~\ref{sec:benchmarks}. The key aspect here is that a single benchmark is actually a combination of a baseline or reference implementation and one or more datasets. The scientific data here requires a special attention. Although scientific datasets are widespread and common, curating, maintaining, and distributing large-scale, scientific datasets for public consumption is a challenging process, covering various aspects, from abiding by the FAIR principles~\cite{wilkinson2016fair} to distribution to versioning of the datasets. These benchmarks have a multitude of purpose, which are discussed at length in~\cite{natrev:jeyan,roysoc:tonyhey}. However, it is worth highlighting that these scientific benchmarks serve one important purpose to the wider AI community: offering an unprecedented pedagogical value across domain boundaries.
\subsection{Policies for Benchmarking}
Benchmarking is an art and can be very subjective. Without clear policies, the benchmarking results can be subjectively and differently interpreted, leading to the whole initiative not serving the intended purpose. As such, establishing a set of policies, rules and guidelines for evaluating and reporting results for the benchmarks is an important step. The Science WG is in the process of drafting a detailed policy statement, and, here, we mention some of the key points for the reasons of brevity. The overarching policy will cover training and inference benchmarks, with a number of sub-policies focusing on each and every benchmark, as no two benchmarks are the same. In general, the policies will cover the evaluation of benchmarks under two divisions, namely, Open and Closed divisions. Benchmark evaluation under the former will focus on achieving better scientific results (using the established scientific metric). As such, the community has considerable amount of freedom to enhance the underlying ML models or pre- or post-processing aspects of the benchmarks, including data augmentation, wherever that is possible or sensible. Evaluation under the Closed division, on the other hand, limits the freedom and often will list permissible changes for each and every benchmark. In general, pre- and/or post-processing, and data aspects are often kept fixed, with flexibility to change or fine-tune the underling ML model. Similarly, policies around submission of results may also vary across benchmarks. For example, some benchmarks may insist on certain set of measurements to be submitted, such as power or network performance, while some may rely on generic details along with scientific metrics.