datalad @ OSF2023NL

Dutch Open Science Festival 2023

Stephan Heunis

@jsheunis

jsheunis

@datalad

datalad

Psychoinformatics lab, Institute of Neuroscience and Medicine, Brain & Behavior (INM-7)
Research Center Jülich, Germany
SYNC lab, Department of Psychology, Education and Child Studies
Erasmus School of Social and Behavioral Sciences, EUR, The Netherlands

Slides:
jsheunis.github.io/osfestival-nl-datalad

Agenda

14h00:   An overview of DataLad (30 min)
14h30:   Code-along demo: DataLad basics (15 min)
14h45:   break (10 min)
14h55:   Code-along demo: DataLad basics (10 min)
15h10:   Code-along demo: publishing datasets (30 min)

Acknowledgements

DataLad software & ecosystem Psychoinformatics Lab, Research center Jülich Center for Open Neuroscience, Dartmouth College Joey Hess (git-annex) >100 additional contributors	Funders
Collaborators

`1`

`An overview of DataLad`

What is a DataLad?

A free and open source tool
for decentralized (research) data management
with a command line interface, Python API, and graphical user interface
allowing exhaustive tracking of the evolution of digital objects
and computational provenance tracking
to enhance modularity, portability and reproducibility.

Let's explore this...

Everything is to be made FAIR

F
indable
A
ccessible
I
nteroperable
R
eusable

https://www.go-fair.org/fair-principles Wilkinson et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

But the "I" in FAIR is not you

F?
I already have it, it's right here!
A?
I am working with it already, I made it!
I?
With what?
R?
First let me finish this PhD and then we talk, OK?

Still, someone has to put in the work or nothing will ever be FAIR.

Be FAIR and immediately benefit from it yourself...

V
ersion-controlled
A
ctionable metadata
M
odular
P
ortable

EUDAT B2DROP: data deposition and retrieval

Similar support for Surfdrive, Dataverse, Open Science Framework, S3, ...

DataLad Gooey: Convenience for exploration and management

Companion (not competition) for the terminal

Metadata entry convenience

DataLad contact and more information

Website + Demos	http://datalad.org
Documentation	http://handbook.datalad.org
Talks and tutorials	https://youtube.com/datalad
Development	http://github.com/datalad
Support	https://matrix.to/#/#datalad:matrix.org
Open data	http://datasets.datalad.org
Mastodon	@datalad@fosstodon.org
Twitter	@datalad

!!Stay tuned!!

First ever DataLad+git-annex meeting open for all!

`2`

`Code-along demonstration`

Practical aspects

We'll work in the browser on a cloud server with JupyterHub
Cloud-computing environment:
- jupyterhub.datalad.nl
We have pre-installed DataLad and other requirements
We will work via the terminal
Draw a username, and set a password of your choice when logging in for the first time; remember it!

Using DataLad in the Terminal

Check the installed version:

datalad --version

For help on using DataLad from the command line:

                
                    datalad --help

For extensive info about the installed package, its dependencies, and extensions, use wtf:

                
                    datalad wtf

git identity

Check git identity:

            
                git config --get user.name
                git config --get user.email

Configure git identity:

                
                    git config --global user.name "Stephan Heunis"
                    git config --global user.email "s.heunis@fz-juelich.de"

Using datalad via its Python API

Open a Python environment:

            
                ipython

Import and start using:

                
                    import datalad.api as dl
                    dl.create(path='mydataset')

Exit the Python environment:

                
                    exit

`2 Code-along demo`

`2.1 Datalad Basics`

Datalad datasets...

...Datalad datasets

Create a dataset (here, with the text2git config):

            
                datalad create -c text2git bids-data

Let's have a look inside. Navigate using cd (change directory):

                
                    cd bids-data

List the directory content, including hidden files, with ls:

                
                    ls -la .

Version control...

...Version control

Let's add some Markdown text to a README file in the dataset

            
                echo "# A BIDS structured dataset for my input data" > README.md

Now we can check the status of the dataset:

                
                    datalad status

We can save the state with save

                
                    datalad save -m "Add a short README"

Further modifications:

                
                    echo "Contains functional task data of one subject" >> README.md

Save again:

                
                    datalad save -m "Add information on the dataset contents to the README"

Now, let's check the dataset history:

                
                    git log

Data consumption & transport...

...Data consumption & transport...

Install a dataset from remote URL (or local path) using clone:

            
                cd ../
                datalad clone \
                https://github.com/psychoinformatics-de/studyforrest-data-phase2.git

We can now view the cloned dataset's file tree:

                
                    cd studyforrest-data-phase2
                    ls

Let's check the dataset size (i.e. git repository):

                
                    du -sh # this will print the size of the directory in human readable sizes

Let's check the actual dataset size (i.e. git repository + annexed content):

                
                    datalad status --annex

The DataLad dataset is just the git repository, i.e. the metadata of all files in the dataset, including the content of all files comitted to git. The actual file content in the annex can be retrieved as needed.

...Data consumption & transport

We can retrieve actual file content with get (here, multiple files):

            
                # get all files of sub-01 for all functional runs of the localizer task
                datalad get \
                sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-*.nii.gz

If we don't need a file locally anymore, we can drop it:

                
                    # drop a specific file
                    datalad drop \
                    sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-4_bold.nii.gz

And it's no problem if you need that exact file again, just getit:

                
                    # get a specific file
                    datalad drop \
                    sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-4_bold.nii.gz

Therefore: no need to store all files locally. Data just needs to be available from at least one location, then you can get what you want when you need it, and drop the rest.

Dataset nesting...

Datasets can be nested in superdataset-subdataset hierarchies:

Helps with scaling (see e.g. the Human Connectome Project dataset )
Version control tools struggle with >100k files
Modular units improves intuitive structure and reuse potential
Versioned linkage of inputs for reproducibility

...Dataset nesting

Let's make a nest! First we navigate into the top-level dataset:

            
                cd ../bids-data

Then we clone the input dataset into a specific location in the file tree of the existing dataset, making it a subdataset (using the -d/--dataset flag):

                
                    datalad clone --dataset . \
                    https://github.com/datalad/example-dicom-functional.git  \
                    inputs/rawdata

Similarly, we can clone the analysis container (actually, a set of containers from ReproNim) as a subdataset:

                
                    datalad clone -d . \
                    https://github.com/ReproNim/containers.git \
                    code/containers

Let's see what changed in the dataset, using the subdatasets command:

                
                    datalad subdatasets

Computationally reproducible execution...

which script/pipeline version
was run on which version of the data
to produce which version of the results?

...Computationally reproducible execution...

The datalad run can run any command in a way that links the command or script to the results it produces and the data it was computed from
The datalad rerun can take this recorded provenance and recompute the command
The datalad containers-run (from the extension) can capture software provenance in the form of software containers in addition to the provenance that datalad run captures

With the datalad-container extension, we can inspect the list of registered containers (recursively):

                
                    datalad containers-list --recursive

We'll use the repronim-reproin container for dicom conversion.

...Computationally reproducible execution

Now, let's try out the containers-run command:

            
                datalad containers-run -m "Convert subject 02 to BIDS" \
                --container-name code/containers/repronim-reproin \
                --input inputs/rawdata/dicoms \
                --output sub-02 \
                "-f reproin -s 02 --bids -l '' --minmeta -o . --files inputs/rawdata/dicoms"

What changed after the containers-run command has completed?
We can use datalad diff (based on git diff):

                
                    datalad diff -f HEAD~1

We see that some files were added to the dataset!
And we have a complete provenance record as part of the git history:

                
                    git log -n 1

Publishing datasets...

For example: (see next section)

OSF
SURFdrive (webdav)
Dataverse
GitHub

Using published data...

Let's use our published data in a new analysis, to demonstrate reusability and the usefulness of modularity.

First let's create a new dataset using the yoda principles:

            
                cd ../
                datalad create -c yoda myanalysis

Then we can clone our GIN-published dataset as a subdataset
(NB: use the browser URL without ".git" suffix):

                
                    cd myanalysis
                    datalad clone -d . \
                    https://gin.g-node.org/your-gin-username/bids-data \
                    input

...Using published data...

We have data, and now we need an analysis script. We will use DataLad's download-url which gets the content of a script and registers its source:

            
                datalad download-url -m "Download code for brain masking from Github" \
                -O code/get_brainmask.py \
                https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py

Now we have data and an analysis script, and we still need the correct software environment within which the run the analysis. We will again use the datalad-container extension to register a container to the new dataset:

                
                    datalad containers-add nilearn \
                    --url shub://adswa/nilearn-container:latest \
                    --call-fmt "singularity exec {img} {cmd}"

...Using published data

Finally, we can run the analysis:

            
                datalad containers-run -m "Compute brain mask" \
                -n nilearn \
                --input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
                --output figures/ \
                --output "sub-02*" \
                "python code/get_brainmask.py"

Afterwards, we can inspect how specific files came to be, e.g.:

                
                    git log sub-02_brain-mask.nii.gz

And since the run-record is part of the dataset's git history, we know the provenance. DataLad can use this machine-readable information to rerun the analysis without you having to specify any information again:

                
                    datalad rerun

`2 Code-along demo`

`2.2 Publishing data`

OSF
SURFdrive (webdav)
Dataverse
GitHub

`Publishing to OSF`

https://osf.io/

create-sibling-osf

(docs)

Log into OSF
Create personal access token
Enter credentials using datalad osf-credentials:

            
                datalad osf-credentials

4. Create the sibling:

            
                datalad create-sibling-osf -d . -s my-osf-sibling \
                --title 'my-osf-project-title' --mode export --public

5. Push to the sibling:

            
                datalad push -d . --to my-osf-sibling

6. Clone from the sibling:

            
                cd ..
                datalad clone osf://my-osf-project-id my-osf-clone

`Publishing to SURFdrive`

https://www.surf.nl/en/surfdrive-store-and-share-your-files-securely-in-the-cloud

surfdrive-logo

create-sibling-webdav

(docs)

Log into SURFdrive
Create a new folder, e.g., datalad-test
Copy your WebDAV URL and add the folder name at the end: Menu > Files > Settings > WebDAV
E.g.: https://surfdrive.surf.nl/files/remote.php/nonshib-webdav/datalad-test
Create the sibling:

            
                cd midterm_project
                datalad create-sibling-webdav \
                -d . \
                -r \
                -s my-webdav-sibling \
                --mode filetree 'my-webdav-url'

At this point, DataLad should ask for credentials if you have not entered them before. Enter your Sciebo username and password.

4. Push to the sibling:

            
                datalad push -d . --recursive --to my-webdav-sibling

5. Clone from the sibling:

            
                cd ..
                datalad clone 'datalad-annex::?type=webdav&encryption=none\
                &exporttree=yes&url=my-webdav-url/dataset-name' my-webdav-clone

`Publishing to Dataverse`

https://dataverse.org/

add-sibling-dataverse

(docs)

Create an account and log into demo.dataverse.org (or your instance)
Find your API token (Username > API Token)
Create a new Dataverse dataset
Add required metadata and save dataset
Retieve dataset DOI and the Dataverse instance URL
Create the sibling:

            
                cd midterm_project
                datalad add-sibling-dataverse -d . -s my-dataverse-sibling \
                'my-dataverse-instance-url' doi:'my-dataset-doi'

for example:

            
                datalad add-sibling-dataverse -d . -s dataverse \
                https://demo.dataverse.org  doi:10.70122/FK2/3K9FOD

(DataLad asks for credentials (token) if you haven't entered them before)

7. Push to the sibling:

            
                datalad push -d . --to my-dataverse-sibling

8. Clone from the sibling:

            
                cd ..
                datalad clone 'datalad-annex::?type=external&externaltype=dataverse\
                &encryption=none&exporttree=no&url=my-dataverse-instance-url\
                &doi='my-dataset-doi' my-sciebo-clone

`Extras`

Our dataset: Midterm YODA Data Analysis Project

DataLad dataset: https://github.com/datalad-handbook/midterm_project
Find out more: A Data Analysis Project with DataLad
All inputs (i.e. building blocks from other sources) are located in the input/ subdataset
Custom code is located in code/
Relevant software is included as a software container
Outcomes are generated with a provenance tracked run command, and located in the root of the dataset:

prediction_report.csv contains the main classification metrics
output/pairwise_relationships.png is a plot of the relations between features.

            
                [DS~0] ~/midterm_project
                ├── CHANGELOG.md
                ├── README.md
                ├── code/
                │   ├── README.md
                │   └── script.py
                ├── [DS~1] input/
                │   └── iris.csv -> .git/annex/objects/...
                ├── pairwise_relationships.png -> .git/annex/objects/...
                └── prediction_report.csv -> .git/annex/objects/...

Our dataset: Midterm YODA Data Analysis Project

Let's explore the dataset briefly

Install the dataset:

            
                datalad clone \
                https://github.com/datalad-handbook/midterm_project.git

Find out about its subdatasets

            
                cd midterm_project
                datalad subdatasets

Get some contents

            
                datalad get input

Drop some contents

            
                datalad drop input

Find out about its history

tig

Reproduce an analysis

            
                datalad rerun HEAD~2

`2a`

`Publishing to GitLab`

https://gitlab.com/

create-sibling-gitlab

(docs)

Log into GitLab
Create personal access token
Create a top-level group
Create a gitlab config file (replace relevant items)

                
                    cat << EOF > ~/.python-gitlab.cfg
                    [my-site]
                    url = https://gitlab.com/
                    private_token = my-gitlab-token
                    api_version = 4
                    EOF

5. Configure create-sibling-gitlab in the midterm_project dataset:

                
                    datalad configuration set datalad.gitlab-default-site='my-site'
                    datalad configuration set datalad.gitlab-'my-site'-project='my-top-level-group'

6. Create the sibling:

                
                    datalad create-sibling-gitlab -d . --recursive -s 'my-gitlab-sibling'

7. Push to the sibling:

                
                    datalad push -d . --recursive --to 'my-gitlab-sibling'

`How do we publish data?`

"Share data like source code"

Datasets can be cloned, pushed, and updated from and to local and remote paths, remote hosting services, external special remotes

Examples:
Local path

../my-projects/experiment_data

Remote path

myuser@myinstitutes.hcp.system:/home/myuser/my-projects/experiment_data

Hosting service

git.github.com:myuser/experiment_data.git

External special remotes

osf://my-osf-project-id

Interoperability

DataLad is built to maximize interoperability and use with hosting and storage technology

See the chapter Third party infrastructure for walk-throughs for different services

Interoperability

DataLad is built to maximize interoperability and use with hosting and storage technology

See the chapter Third party infrastructure for walk-throughs for different services

Publishing datasets

I have a dataset on my computer. How can I share it, or collaborate on it?

Glossary

Sibling (remote): Linked clones of a dataset. You can usually update (from) siblings to keep all your siblings in sync (e.g., ongoing data acquisition stored on experiment compute and backed up on cluster and external hard-drive)
Repository hosting service: Webservices to host Git repositories, such as GitHub, GitLab, Bitbucket, Gin, ...
Third-party storage: Infrastructure (private/commercial/free/...) that can host data. A "special remote" protocol is used to publish or pull data to and from it
Publishing datasets: Pushing dataset contents (Git and/or annex) to a sibling using datalad push
Updating datasets: Pulling new changes from a sibling using datalad update --merge

Publishing datasets

Most public datasets separate content in Git versus git-annex behind the scenes

Publishing datasets

Typical case:

Datasets are exposed via a private or public repository on a repository hosting service
Data can't be stored in the repository hosting service, but can be kept in almost any third party storage

Publication dependencies automate pushing to the correct place, e.g.,

                    
    $ git config --local remote.github.datalad-publish-depends gdrive
    # or
    $ datalad siblings add --name origin --url git@git.jugit.fzj.de:adswa/experiment-data.git --publish-depends s3

Publishing datasets

Special case 1: repositories with annex support

Publishing datasets

Special case 2: Special remotes with repositories

Dutch Open Science Festival 2023

Agenda

Acknowledgements

1

An overview of DataLad

What is a DataLad?

Everything is to be made FAIR

F

A

I

R

But the "I" in FAIR is not you

F?

A?

I?

R?

Be FAIR and immediately benefit from it yourself...

V

A

M

P

EUDAT B2DROP: data deposition and retrieval

DataLad contact and more information

!!Stay tuned!!

2

Code-along demonstration

Practical aspects

Using DataLad in the Terminal

git identity

Using datalad via its Python API

2 Code-along demo

2.1 Datalad Basics

Datalad datasets...

...Datalad datasets

Version control...

...Version control

Data consumption & transport...

...Data consumption & transport...

...Data consumption & transport

Dataset nesting...

...Dataset nesting

Computationally reproducible execution...

...Computationally reproducible execution...

...Computationally reproducible execution

Publishing datasets...

Using published data...

...Using published data...

...Using published data

2 Code-along demo

2.2 Publishing data

Publishing to OSF

create-sibling-osf

Publishing to SURFdrive

create-sibling-webdav

Publishing to Dataverse

add-sibling-dataverse

Extras

Our dataset: Midterm YODA Data Analysis Project

Our dataset: Midterm YODA Data Analysis Project

Our dataset: Midterm YODA Data Analysis Project

2a

Publishing to GitLab

create-sibling-gitlab

How do we publish data?

"Share data like source code"

Interoperability

Interoperability

Publishing datasets

Glossary

Publishing datasets

Publishing datasets

Publishing datasets

Publishing datasets

Publishing datasets

Publishing datasets

`1`

`An overview of DataLad`

`2`

`Code-along demonstration`

`2 Code-along demo`

`2.1 Datalad Basics`

`2 Code-along demo`

`2.2 Publishing data`

`Publishing to OSF`

`Publishing to SURFdrive`

`Publishing to Dataverse`

`Extras`

`2a`

`Publishing to GitLab`

`How do we publish data?`