BrainHack Kraków

Tools for reproducible workflows

title image
Stephan Heunis
@fMRwhy jsheunis

Psychoinformatics lab
Institute of Neuroscience and Medicine, Brain & Behavior (INM-7), Germany

Acknowledgements

A couple of things about me...

  • I'm a research software engineer at FZ Jülich. My background:
    • Studied engineering (robotics)
    • Worked for large company and then a tiny startup
    • Moved to the Netherlands to do a Phd in neuroimage processing
    • Started my current job at FZJ in Germany
  • I passionately like: open and inclusive science, programming, learning, a South African "braai"
  • I passionately dislike: hierarchies, systemic mysogyny/racism/ableism, any form of admin

A couple of things about me...

  • My biases:

A couple of things about YOU...

http://etc.ch/AHNN

A couple of things about YOU...

A couple of things about YOU...

A couple of things about YOU...

A couple of things about YOU...

A couple of things about YOU...

A couple of things about YOU...

A couple of things about YOU...

So let's talk about reproducibility

Reproducibility: what does it mean?


CC-BY Scriberia and The Turing Way

Reproducibility: what is needed?


An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete ... set of instructions [and data] which generated the figures." David Donoho, 1998

http://statweb.stanford.edu/~wavelab/Wavelab_850/wavelab.pdf

Reproducibility: what is needed?

Transparency in methods

Reproducibility: why neuroscience?

Complex issues: Data

  • Neuroscientists acquire interesting data, but it has its peculiarities:
    • Depending on acquisition hardware and analysis software, some data are in proprietary formats (e.g., Neuromag, brain voyager, brain vision)
    • Depending on field, data can be sizeable (e.g., (f)MRI, CT, EEG, PET, MEG)
    • Heterogenous data from complex acquisitions with multiple data channels and modalities
    • Datasets are getting bigger and bigger ( Bzdok & Yeo, 2017), e.g. multi-modal imaging, behavioral + genetics data in HCP (humanconnectome.org) or UKBiobank (ukbiobank.ac.uk/)
    • Some data fall under General Data Protection Regulation (GDPR)

This makes data harder to access, structure, and share

Complex issues: analyses

  • Much of neuroscientific research is computationally intensive, with complex workflows from raw data to result, and plenty of researchers degrees of freedom

Poline et al., 2011

Complex issues: analyses

    The analytic flexibility leads to sizable variations in conclusion.
    NARPS Study,
    Botvinik-Nezer et al., 2020
    70 independent research groups,
    investigating 9 hypothesis, on the
    same data: Consistent conclusions
    for four hypothesis
    The variety of methodological
    & analytical choices is not the
    enemy to computational
    reproducibility, the challenge
    lies in encoding those degrees
    of freedom in a standardized,
    ideally machine-readable way
    Gilmore et al., 2017

Complex issues: software/tools

  • Software is a part of the digital provenance of your work: Some analysis will only work in the desired way (or at all) in specific versions of a software
  • But it goes beyond "one software": Modern data analysis software has an incredibly complex dependency stack
  • Example Scikit-learn Direct dependencies: 38 packages, 153 dependency relations. Recursive dependencies: 485 packages, 10715 dependency relations

Complex issues: infrastructure

"Works on my machine"

derickbailey.com

Complex issues: changes over time


Based on Piled Higher and Deeper 1531

Complex issues: changes over time


Piled Higher and Deeper 1323

Complex issues: changes over time

"This used to work on my machine..."


And if the analysis does not fail, but produces different results,
how do you know which is the correct one?

So what do we end up with?


Based on xkcd.com/2347/ (CC-BY)

So what do we end up with?

"Shit, which version of which script produced these outputs from which version of what data?"
CC-BY Scriberia and The Turing Way

What should we do about it???

The pipeline needs to become transparent
CC-BY Scriberia & The Turing Way

Digital Provenance = A complete description of how a digital file came to be (FAIR principles)

What should we do about it???

The pipeline needs to become automated
CC-BY Scriberia & The Turing Way

Don't panic!

Large & growing ecosystem of Open Source software & data formats

... and many more!




  • Remove accessibility barriers
  • Transparent digital provenance

And finally...the point of my talk:

Tools for reproducible workflows in neuroscience

A PhD student walks into a lab...

You've just published a paper that calculated the cortical thickness for a group of research participants. You've run some statistical tests and visualized the results.

A PhD student walks into a lab...

Soon after, a researcher in your field sends you an email:

A PhD student walks into a lab...

How do we approach this challenge? Which skills should we have to go...

from reacting like this:

to reacting like this:

A PhD student walks into a lab...

We are going to take a step-wise, "Galaxy brain" approach:

Step 1: share data and code

Why don't we just send them the data and code "via" a download link?

Step 1: share data and code

💬 Why don't we just send them the data and code "via" a download link?

  • You have to create a link and send an email every time someone requests it
  • There are no instructions included for running the analysis
  • The person might not have the correct software or package versions installed
  • They could perhaps figure out from the code which packages are required, but installing them might interfere with their existing environment, which might break things down the line.

Step 1: share data and code

So they tried running the script on their machine, and then...


💬 What went wrong? What should we have done?

Step 1: share data and code

To prevent this issue (or similar issues), while still allowing others to run the code on their machines, we need to share:

1. The required packages requirements.txt
2. The Python version virtual environment
3. Instructions for how to use these to successfully run the script README
4. Data and code and all of the above in an accessible location GitHub

Example GitHub Repository

Step 1: share data and code

Open (and free) data repositories, e.g. OpenNeuro: Extended list of repositories

Step 1: share data and code

"Galaxy brain" update:

Step 2: software environment and requirements

After sharing the news about the public GitHub repo with our colleague, we get the following reply:



Now we'll focus on requirements and virtual environments

Step 2: software environment and requirements

Introducing requirements.txt

  • Single file to capture required Python packages
  • Makes installation straight-forward with pip:
  •             
                    pip install -r requirements.txt
                
            
  • In the file, you can specify the required packages (check script) and versions:
  •             
                    matplotlib==3.2.2
                    numpy>=1.16.5
                    pandas
                    nibabel
                    nilearn>=0.7.1
                    sklearn
                    brainspace
                
            
NOTE: not all packages/tools required for an analysis are necessarily Python packages, or if they are they might not be available on the Python Package Index (PyPI). This means that installing everything with pip and a requirements.txt file might not be sufficient

Step 2: software environment and requirements

Introducing requirements.txt

  • Check whether some packages/tools might have to be installed differently:
    • APT: for managing packages (retrieval, configuration and installation) on Unix-like computer systems
    • conda: Package, dependency and environment management for any language running on Windows, macOS and Linux
  • Add extra installation instructions to README
  •             
                    git clone https://github.com/MICA-MNI/BrainStat.git
                    cd BrainStat
                    python3 setup.py build
                    python3 setup.py install
                
            

Step 2: software environment and requirements

Introducing Virtual Enviroments

  • requirements.txt helps a lot, but what if the colleague already has some of these packages installed? 💬
  • Installing new packages or existing packages with different versions can interfere with their local Python setup, and cause errors.
XKCD

Step 2: software environment and requirements

Introducing Virtual Enviroments

It would be great if colleagues could install our package requirements in a clean and isolated environment ==> virtual environments!
  • virtualenv
  •             
                    #!/bin/bash
                    pip install virtualenv #install the package
                    virtualenv --python=python3 mypythonenv #create a new virtual environment
                    source mypythonenv/bin/activate #activate the virtual environment
                    # now install your packages with pip and do the analysis
                    deactivate #deactivate the virtual environment
                
            
  • miniconda
  •             
                    #!/bin/bash
                    # install miniconda using install files via link
                    conda create -n mypythonenv python=3.6
                    conda activate mypythonenv
                    # now install your packages with conda and/or pip and do the analysis
                    conda deactivate #deactivate the virtual environment
                
            

Step 2: software environment and requirements

So we send the update to our colleague:

Step 2: software environment and requirements

"Galaxy brain" update:

Step 3: cloudy containers

So you decide to chill out now that everything should be working, right? Unfortunately, our feeling of accomplishment is short-lived, because Professor Important Dude is back with another question..

Step 3: cloudy containers

And then you realise... many things. At the same time:

  • Damn, you forgot about the OpenGL/LibGL requirement
  • You really don't want to have another Zoom call
  • You absolutely don't want to have multiple Zoom calls forever


So you cry out in desperation: Why can't it just work on their machines?!

Step 3: cloudy containers


Reddit

Step 3: cloudy containers

Introducing containers

Step 3: cloudy containers

Introducing containers


Rachael Ainsworth
If we don't always want to have to explain how to install yet another tool, maybe we can run everything in the cloud...

Step 3: cloudy containers

Introducing Binder: mybinder.org


CC-BY Scriberia and The Turing Way

Step 3: cloudy containers


Example:

Step 3: cloudy containers

We can't hold our excitement, so we send a quick email to notify the colleague:

Step 3: cloudy containers

"Galaxy brain" update:

STEP 4: Interplanetary sci-comm

So, you're busy (rightfully) thinking that you have done a great job of making your work more reproducible, but then ... it looks like we spoke too soon. Our VIP friend has another request...

STEP 4: Interplanetary sci-comm

So firstly, it works for them!!! (Always try and celebrate the wins in academia, however big or small.)

Secondly, let's take a deeper look into this notebook thing.
And what's with Jupiter?


NASA SpacePlace

STEP 4: Interplanetary sci-comm

"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more."

Jupyter.org

STEP 4: Interplanetary sci-comm

Jupyter notebooks:

  • Can be installed (amongst other methods) with conda and pip
  •             
                    conda install -c conda-forge notebook
                    pip install notebook
                
            
  • Can work with kernels for Python, R, Octave, Julia and many more!
  • Form part of the bigger Jupyter ecosystem (including JupyterLab and JupyterHub)
  • If on a public repository, can be viewed statically with nbviewer
  • If on Binder, can be interacted with and edited in the cloud

STEP 4: Interplanetary sci-comm

This is what you have achieved!

Juliette Taka, Logilab and the OpenDreamKit project

STEP 4: Interplanetary sci-comm

"Galaxy brain" update:

STEP 5: Reproducible data management

In many cases we will work with large datasets and multiple analysis pipelines will form part of the full research pipeline, each with their own list of software and environment requirements. Managing the full reproducibility of such workflows can be challenging (or impossible) without the right tools that allow:
  1. Data version control
  2. Provenance capture


STEP 5: Reproducible data management

STEP 5: Reproducible data management

STEP 5: Reproducible data management

STEP 5: Reproducible data management

  • A command-line tool (and GUI!) with a Python API
  • Builds on top of Git and Git-annex
  • Allows...
  • ... version-controlling arbitrarily large content,
    ... easily sharing and obtaining data (note: no data hosting!),
    ... (computationally) reproducible data analysis,
    ... and much more
  • Completely domain-agnostic
  • Available for all major operating systems (Linux, macOS/OSX, Windows): installation instructions
  • Detailed documentation: DataLad Handbook

STEP 5: Reproducible data management

STEP 5: Reproducible data management

STEP 5: Reproducible data management

STEP 5: Reproducible data management

STEP 5: Reproducible data management

STEP 5: Reproducible data management

STEP 5: Reproducible data management

"Galaxy brain" update:

So...
now you might feel like this

via GIPHY

But that is why brainhacks exist!

  • Learn
  • Collaborate
  • Make new friends and colleagues
  • Form a community
  • Improve science!