Skip to content

Latest commit

 

History

History
152 lines (106 loc) · 6.93 KB

UKRI_cloud_demo_2019-02-12.md

File metadata and controls

152 lines (106 loc) · 6.93 KB

Reproducible, Reliable, Reusable Analyses with BinderHub on Cloud

Sarah Gibson, The Alan Turing Institute

UKRI Cloud Workshop, 12 February 2019

You can follow along with this demo at http://bit.ly/sgibson-ukri-demo-2019

What does "Reproducibility" mean?

  • It has a lot of different meanings across research fields
  • In this context, to be "reproducible" means the same results (e.g. that are published in a paper) are generated given the same input data pushed through the same analysis pipeline

An independent person should be able to easily check my (our) work.

An Example of Not Being Reproducible

(AKA: I'm guilty of this too...)

graduation

  • Astrophysics PhD, graduated January 2019
  • Researched phenomenon known as Gamma-Ray Bursts (not your average supernova...) and the neutron stars that power them

figure2

  • This is a figure published in my first journal paper
  • It describes the evolution of the spin of a neutron star (middle panel) and the mass of its accretion disc (top panel) as it was being fed by fallback accretion
  • The bottom panel is the property we were interested in which is when a derived parameter would cross a specific threshold (the dashed line)

What my research was about isn't necessarily important. What is important is whether other scientists in my field could verify my work.

fig2_creation

This is a GIF of my PhD laptop producing the figure which turned out to not be reproducible for a number of reasons:

  • Was not version controlled
  • Computing environment(s) was not documented
  • Computing environment no longer exists - the laptop has been returned and wiped 😱

Binder to the Rescue!

With a little bit of work, I've managed to reproduce my figure using Binder (https://mybinder.org).

  • My code is in a public GitHub repo - now version controlled ☑️
  • The computing environment has been documented in an environment.yml file ☑️ (other config file types are available)

My PhD repo:

Link to full workflow GIF: phd_demo.gif 🚫 Emergency back-up GIF

What is Binder doing?

mybinder

Courtesy of Juliette Belin

Read the docs on making your own repo Binder-ready at https://mybinder.readthedocs.io

Limitations of the public Binder instance

By design, because it costs the Binder Team about 5000 USD per month to run, the public Binder instance:

  • Only works for public repos, cannot host private code or sensitive data
  • Large datasets are discouraged
  • Computing resources are minimal

Solution: BinderHub 4 U

yourbinder

  • The host institution/organisation/RSE group can choose whether to make repos public or private
  • This is an on-going project at the Turing Institute

BinderHub is an umbrella for:

  • Building a docker image from a code repository
    • repo2docker
  • Launching an interactive browser displaying that code repository
    • JupyterHub
  • Distributing multiple instances of that code repository across the Cloud
    • Kubernetes with Microsoft Azure/Google Cloud/Amazon Web Services

Some useful links:

Why tracking Dependencies is Important (demo)

Version updates to software packages could cause fundamental changes to your code that do not raise a fatal error, and so will pass without you realising.

Here's a little demo repo to highlight this: binder-examples/matplotlib-versions

Link to full workflow GIF: ukri_demo.gif 🚫 Emergency back-up GIF

Ok, so you may not worry too much about reproducing "style" in this way, but imagine if this was numerical. Or that a suite of interacting libraries are updated and are no longer compatible.

Thank You!

Thanks to The Turing Way team!

  • Becky Arnold 💬 💻 📖 👀
  • Louise Bowler 💬 💻 📖 💡 📋 👀
  • Sarah Gibson 💬 💻 📖 🔧 👀 📢
  • Patricia Herterich 💬 📖 👀
  • Rosie Higman 💬 📋 👀
  • Anna Krystalli 💬 💡 📋 👀
  • Alexander Morley 💬 👀 ⚠️
  • Martin O'Reilly 💬 🔧
  • Kirstie Whitaker 💬 🎨 🔍 🤔 👀 ⚠️ 📢

The Turing Way is a lightly opinionated guide to reproducible data science. Our goal is to provide all the information that researchers need at the start of their projects to ensure that they are easy to reproduce at the end.

Please visit our repo and help us deliver our dream!

github.com/alan-turing-institute/the-turing-way

Also, thanks to the Binder team for sharing their knowledge!

  • Tim Head 💬 🤔
  • Chris Holdgraf 💬 🤔
  • Benjamin Ragan-Kelley 💬 🤔
  • and many others!

Binder/BinderHub Workshops

Emoji Contribution Key

Emoji Represents
💬 Answering Questions (on gitter, GitHub, or in person)
💻 Code
📖 Documentation and specification
🎨 Design
💡 Examples
📋 Event Organizers
🔍 Funding/Grant Finders
🤔 Ideas & Planning
👀 Reviewed Pull Requests
🔧 Tools
⚠️ Tests
📢 Talks