Research Data Management with DataLad

Talk @ Einstein Center for Neuroscience, Berlin

title image
Stephan Heunis
@jsheunis jsheunis

Psychoinformatics lab, Institute of Neuroscience and Medicine, Brain & Behavior (INM-7)
Research Center Jülich, Germany

Slides: 
jsheunis.github.io/einstein-center-talk


Why research data management?

We need reproducibility


An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete ... set of instructions [and data] which generated the figures." David Donoho, 1998

http://statweb.stanford.edu/~wavelab/Wavelab_850/wavelab.pdf

Reproducibility in neuroscience

It's complicated...

Complex issues: Data

  • Neuroscientists acquire interesting data, but it has its peculiarities:
    • Depending on acquisition hardware and analysis software, some data are in proprietary formats (e.g., Neuromag, brain voyager, brain vision)
    • Depending on field, data can be sizeable (e.g., (f)MRI, CT, EEG, PET, MEG, microscopy)
    • Heterogenous data from complex acquisitions with multiple data channels and modalities
    • Datasets are getting bigger and bigger ( Bzdok & Yeo, 2017), e.g. multi-modal imaging, behavioral + genetics data in HCP (humanconnectome.org) or UKBiobank (ukbiobank.ac.uk/)
    • Some data fall under General Data Protection Regulation (GDPR)

This makes data harder to access, structure, and share

Complex issues: analyses

  • Much of neuroscientific research is computationally intensive, with complex workflows from raw data to result, and plenty of researchers degrees of freedom

Complex issues: analyses

    The analytic flexibility leads to sizable variations in conclusion.
    NARPS Study,
    Botvinik-Nezer et al., 2020
    70 independent research groups,
    investigating 9 hypothesis, on the
    same data: Consistent conclusions
    for four hypothesis
    The variety of methodological
    & analytical choices is not the
    enemy to computational
    reproducibility, the challenge
    lies in encoding those degrees
    of freedom in a standardized,
    ideally machine-readable way
    Gilmore et al., 2017

Complex issues: software/tools

  • Software is a part of the digital provenance of your work: Some analysis will only work in the desired way (or at all) in specific versions of a software
  • But it goes beyond "one software": Modern data analysis software has an incredibly complex dependency stack
  • Example Scikit-learn Direct dependencies: 38 packages, 153 dependency relations. Recursive dependencies: 485 packages, 10715 dependency relations

Complex issues: infrastructure

"Works on my machine"

derickbailey.com

Complex issues: changes over time


Based on Piled Higher and Deeper 1531

Complex issues: changes over time

"This used to work on my machine..."


And if the analysis does not fail, but produces different results,
how do you know which is the correct one?

So what do we end up with?


Based on xkcd.com/2347/ (CC-BY)

So what do we end up with?

"Shit, which version of which script produced these outputs from which version of what data?"
CC-BY Scriberia and The Turing Way

What should we do about it???

The pipeline needs to become transparent
CC-BY Scriberia & The Turing Way

Digital Provenance = A complete description of how a digital file came to be (FAIR principles)

What should we do about it???

The pipeline needs to become automated
CC-BY Scriberia & The Turing Way

Thus: everything should be FAIR...

  • F

    indable
  • A

    ccessible
  • I

    nteroperable
  • R

    eusable


https://www.go-fair.org/fair-principles Wilkinson et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

But what does FAIR really mean, practically?

  • Bench/bed/field-side researchers are an essential source of
    valid metadata, critical for FAIR data
  • Their resources are limited, and they need something in exchange, otherwise FAIR won't happen


Why not focus on enabling practical collaboration
(even if just with one's future self)?


Why not make the aspirational goal "FAIR data"
a by-product of enabling efficient research?

The DataLad approach

V.A.M.P. (practical) vs F.A.I.R. (aspirational)

Divebomb Records

Be FAIR and immediately benefit from it yourself...

...while still working towards the greater good of FAIR data
  • V

    ersion-controlled
  • A

    ctionable metadata
  • M

    odular
  • P

    ortable


An overview of DataLad

What is a DataLad?


  • A free and open source tool
  • for decentralized (research) data management
  • with a command line interface, Python API, and a GUI (pre-alpha)
  • allowing exhaustive tracking of the evolution of digital objects
  • and computational provenance tracking
  • to enhance modularity, portability and reproducibility.



Let's explore this...


they install datalad...



Data publishing

Data consumption






Use cases

(Meta)data deposition (on Dataverses)

  • Register any dataset at any Dataverse site (e.g. Jülich DATA), receive citable DOI
  • No requirement to re-host data (avoids duplication of storage cost)
  • Data owner remains in full control over data access
  • DataLad extension: https://github.com/datalad/datalad-dataverse


DataLad contact and more information



Website + Demos http://datalad.org
Documentation http://handbook.datalad.org
Talks and tutorials https://youtube.com/datalad
Development http://github.com/datalad
Support https://matrix.to/#/#datalad:matrix.org
Open data http://datasets.datalad.org
Mastodon @datalad@fosstodon.org
Twitter @datalad



distribits 2024

The first distribits meeting will happen in 2024, and we are inviting all interested parties to join! The aim of this meeting is to bring together enthusiasts of tools and workflows in the domain of distributed data. It is organized by the people behind the git-annex and DataLad projects. The event will comprise a two-day conference and an additional hackathon day.

Code examples: Basics

Using DataLad in the Terminal

Check the installed version:
            
                datalad --version
            
            

For help on using DataLad from the command line:
                
                    datalad --help
                
            
For extensive info about the installed package, its dependencies, and extensions, use wtf:
                
                    datalad wtf
                
            

git identity

Check git identity:
            
                git config --get user.name
                git config --get user.email
            
        
Configure git identity:
                
                    git config --global user.name "Stephan Heunis"
                    git config --global user.email "s.heunis@fz-juelich.de"
                
            

Using datalad via its Python API

Open a Python environment:
            
                ipython
            
        
Import and start using:
                
                    import datalad.api as dl
                    dl.create(path='mydataset')
                
            
Exit the Python environment:
                
                    exit
                
            

Datalad datasets...

...Datalad datasets

Create a dataset (here, with the text2git config):
            
                datalad create -c text2git bids-data
            
        
Let's have a look inside. Navigate using cd (change directory):
                
                    cd bids-data
                
            
List the directory content, including hidden files, with ls:
                
                    ls -la .
                
            

Version control...

...Version control

Let's add some Markdown text to a README file in the dataset
            
                echo "# A BIDS structured dataset for my input data" > README.md
            
        
Now we can check the status of the dataset:
                
                    datalad status
                
            
We can save the state with save
                
                    datalad save -m "Add a short README"
                
            
Further modifications:
                
                    echo "Contains functional task data of one subject" >> README.md
                
            
Save again:
                
                    datalad save -m "Add information on the dataset contents to the README"
                
            
Now, let's check the dataset history:
                
                    git log
                
            

Data consumption & transport...

...Data consumption & transport...

Install a dataset from remote URL (or local path) using clone:
            
                cd ../
                datalad clone \
                https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
            
        
We can now view the cloned dataset's file tree:
                
                    cd studyforrest-data-phase2
                    ls
                
            
Let's check the dataset size (i.e. git repository):
                
                    du -sh # this will print the size of the directory in human readable sizes
                
            
Let's check the actual dataset size (i.e. git repository + annexed content):
                
                    datalad status --annex
                
            
The DataLad dataset is just the git repository, i.e. the metadata of all files in the dataset, including the content of all files comitted to git. The actual file content in the annex can be retrieved as needed.

...Data consumption & transport

We can retrieve actual file content with get (here, multiple files):
            
                # get all files of sub-01 for all functional runs of the localizer task
                datalad get \
                sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-*.nii.gz
            
        
If we don't need a file locally anymore, we can drop it:
                
                    # drop a specific file
                    datalad drop \
                    sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-4_bold.nii.gz
                
            
And it's no problem if you need that exact file again, just getit:
                
                    # get a specific file
                    datalad drop \
                    sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-4_bold.nii.gz
                
            
Therefore: no need to store all files locally. Data just needs to be available from at least one location, then you can get what you want when you need it, and drop the rest.

Dataset nesting...

Datasets can be nested in superdataset-subdataset hierarchies:
  • Helps with scaling (see e.g. the Human Connectome Project dataset )
  • Version control tools struggle with >100k files
  • Modular units improves intuitive structure and reuse potential
  • Versioned linkage of inputs for reproducibility

...Dataset nesting

Let's make a nest! First we navigate into the top-level dataset:
            
                cd ../bids-data
            
        
Then we clone the input dataset into a specific location in the file tree of the existing dataset, making it a subdataset (using the -d/--dataset flag):
                
                    datalad clone --dataset . \
                    https://github.com/datalad/example-dicom-functional.git  \
                    inputs/rawdata
                
            
Similarly, we can clone the analysis container (actually, a set of containers from ReproNim) as a subdataset:
                
                    datalad clone -d . \
                    https://github.com/ReproNim/containers.git \
                    code/containers
                
            
Let's see what changed in the dataset, using the subdatasets command:
                
                    datalad subdatasets
                
            

Computationally reproducible execution...

  • which script/pipeline version
  • was run on which version of the data
  • to produce which version of the results?

...Computationally reproducible execution...

  • The datalad run can run any command in a way that links the command or script to the results it produces and the data it was computed from
  • The datalad rerun can take this recorded provenance and recompute the command
  • The datalad containers-run (from the extension) can capture software provenance in the form of software containers in addition to the provenance that datalad run captures


With the datalad-container extension, we can inspect the list of registered containers (recursively):
                
                    datalad containers-list --recursive
                
            
We'll use the repronim-reproin container for dicom conversion.

...Computationally reproducible execution

Now, let's try out the containers-run command:
            
                datalad containers-run -m "Convert subject 02 to BIDS" \
                --container-name code/containers/repronim-reproin \
                --input inputs/rawdata/dicoms \
                --output sub-02 \
                "-f reproin -s 02 --bids -l '' --minmeta -o . --files inputs/rawdata/dicoms"
            
        
What changed after the containers-run command has completed?
We can use datalad diff (based on git diff):
                
                    datalad diff -f HEAD~1
                
            
We see that some files were added to the dataset!
And we have a complete provenance record as part of the git history:
                
                    git log -n 1
                
            

Publishing datasets...


For example: (see next section)
  1. OSF
  2. SURFdrive (webdav)
  3. Dataverse
  4. GitHub

Using published data...

Let's use our published data in a new analysis, to demonstrate reusability and the usefulness of modularity.

First let's create a new dataset using the yoda principles:
            
                cd ../
                datalad create -c yoda myanalysis
            
        
Then we can clone our GIN-published dataset as a subdataset
(NB: use the browser URL without ".git" suffix):
                
                    cd myanalysis
                    datalad clone -d . \
                    https://gin.g-node.org/your-gin-username/bids-data \
                    input
                
            

...Using published data...

We have data, and now we need an analysis script. We will use DataLad's download-url which gets the content of a script and registers its source:
            
                datalad download-url -m "Download code for brain masking from Github" \
                -O code/get_brainmask.py \
                https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py
            
        
Now we have data and an analysis script, and we still need the correct software environment within which the run the analysis. We will again use the datalad-container extension to register a container to the new dataset:
                
                    datalad containers-add nilearn \
                    --url shub://adswa/nilearn-container:latest \
                    --call-fmt "singularity exec {img} {cmd}"
                
            

...Using published data

Finally, we can run the analysis:
            
                datalad containers-run -m "Compute brain mask" \
                -n nilearn \
                --input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
                --output figures/ \
                --output "sub-02*" \
                "python code/get_brainmask.py"
            
        
Afterwards, we can inspect how specific files came to be, e.g.:
                
                    git log sub-02_brain-mask.nii.gz
                
            
And since the run-record is part of the dataset's git history, we know the provenance. DataLad can use this machine-readable information to rerun the analysis without you having to specify any information again:
                
                    datalad rerun
                
            

Code examples: publishing data


  1. OSF
  2. SURFdrive/Sciebo (webdav)
  3. Dataverse
  4. GitHub

Publishing to OSF

https://osf.io/

datalad-osf-logo

create-sibling-osf

 (docs)
  1. Log into OSF
  2. Create personal access token
  3. Enter credentials using datalad osf-credentials:
            
                datalad osf-credentials
            
        
4. Create the sibling:
            
                datalad create-sibling-osf -d . -s my-osf-sibling \
                --title 'my-osf-project-title' --mode export --public
            
        
5. Push to the sibling:
            
                datalad push -d . --to my-osf-sibling
            
        
6. Clone from the sibling:
            
                cd ..
                datalad clone osf://my-osf-project-id my-osf-clone
            
        

Publishing to SURFdrive

https://www.surf.nl/en/surfdrive-store-and-share-your-files-securely-in-the-cloud

surfdrive-logo

create-sibling-webdav

 (docs)
  1. Log into SURFdrive
  2. Create a new folder, e.g., datalad-test
  3. Copy your WebDAV URL and add the folder name at the end: Menu > Files > Settings > WebDAV
    E.g.: https://surfdrive.surf.nl/files/remote.php/nonshib-webdav/datalad-test
  4. Create the sibling:
            
                cd midterm_project
                datalad create-sibling-webdav \
                -d . \
                -r \
                -s my-webdav-sibling \
                --mode filetree 'my-webdav-url'
            
        
At this point, DataLad should ask for credentials if you have not entered them before. Enter your Sciebo username and password.
4. Push to the sibling:
            
                datalad push -d . --recursive --to my-webdav-sibling
            
        
5. Clone from the sibling:
            
                cd ..
                datalad clone 'datalad-annex::?type=webdav&encryption=none\
                &exporttree=yes&url=my-webdav-url/dataset-name' my-webdav-clone
            
        

Publishing to Dataverse

https://dataverse.org/

dataladdataverse-logo

add-sibling-dataverse

 (docs)
  1. Create an account and log into demo.dataverse.org (or your instance)
  2. Find your API token (Username > API Token)
  3. Create a new Dataverse dataset
  4. Add required metadata and save dataset
  5. Retieve dataset DOI and the Dataverse instance URL
  6. Create the sibling:
            
                cd midterm_project
                datalad add-sibling-dataverse -d . -s my-dataverse-sibling \
                'my-dataverse-instance-url' doi:'my-dataset-doi'
            
        
for example:
            
                datalad add-sibling-dataverse -d . -s dataverse \
                https://demo.dataverse.org  doi:10.70122/FK2/3K9FOD
            
        
(DataLad asks for credentials (token) if you haven't entered them before)
7. Push to the sibling:
            
                datalad push -d . --to my-dataverse-sibling
            
        
8. Clone from the sibling:
            
                cd ..
                datalad clone 'datalad-annex::?type=external&externaltype=dataverse\
                &encryption=none&exporttree=no&url=my-dataverse-instance-url\
                &doi='my-dataset-doi' my-sciebo-clone
            
        

Publishing to GitLab

https://gitlab.com/

gitlab-logo

create-sibling-gitlab

 (docs)
  1. Log into GitLab
  2. Create personal access token
  3. Create a top-level group
  4. Create a gitlab config file (replace relevant items)
            
                cat << EOF > ~/.python-gitlab.cfg
                [my-site]
                url = https://gitlab.com/
                private_token = my-gitlab-token
                api_version = 4
                EOF
            
        
5. Configure create-sibling-gitlab in the midterm_project dataset:
            
                datalad configuration set datalad.gitlab-default-site='my-site'
                datalad configuration set datalad.gitlab-'my-site'-project='my-top-level-group'
            
        
6. Create the sibling:
            
                datalad create-sibling-gitlab -d . --recursive -s 'my-gitlab-sibling'
            
        
7. Push to the sibling:
            
                datalad push -d . --recursive --to 'my-gitlab-sibling'
            
        

Extras

Our dataset: Midterm YODA Data Analysis Project

  • DataLad dataset: https://github.com/datalad-handbook/midterm_project
  • Find out more: A Data Analysis Project with DataLad
  • All inputs (i.e. building blocks from other sources) are located in the input/ subdataset
  • Custom code is located in code/
  • Relevant software is included as a software container
  • Outcomes are generated with a provenance tracked run command, and located in the root of the dataset:
    • prediction_report.csv contains the main classification metrics
    • output/pairwise_relationships.png is a plot of the relations between features.
            
                [DS~0] ~/midterm_project
                ├── CHANGELOG.md
                ├── README.md
                ├── code/
                │   ├── README.md
                │   └── script.py
                ├── [DS~1] input/
                │   └── iris.csv -> .git/annex/objects/...
                ├── pairwise_relationships.png -> .git/annex/objects/...
                └── prediction_report.csv -> .git/annex/objects/...
            
        

Our dataset: Midterm YODA Data Analysis Project

  • Let's explore the dataset briefly
  • Install the dataset:
                
                    datalad clone \
                    https://github.com/datalad-handbook/midterm_project.git
                
                
    Find out about its subdatasets
                
                    cd midterm_project
                    datalad subdatasets
                
                
    Get some contents
                
                    datalad get input
                
                
    Drop some contents
                
                    datalad drop input
                
                
    Find out about its history
                
                    tig
                
                
    Reproduce an analysis
                
                    datalad rerun HEAD~2