Psychoinformatics lab,
Institute of Neuroscience and Medicine, Brain & Behavior (INM-7)
Research Center Jülich, Germany
Why research data management?
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete ... set of instructions [and data] which generated the figures." David Donoho, 1998
It's complicated...
This makes data harder to access, structure, and share
|
70 independent research groups, investigating 9 hypothesis, on the same data: Consistent conclusions for four hypothesis |
The variety of methodological & analytical choices is not the enemy to computational reproducibility, the challenge lies in encoding those degrees of freedom in a standardized, ideally machine-readable way Gilmore et al., 2017 |
|
|
Digital Provenance = A complete description of how a digital file came to be (FAIR principles) |
|
computational reproducibility |
An overview of DataLad
Use cases
DataLad contact and more information
Website + Demos | http://datalad.org |
Documentation | http://handbook.datalad.org |
Talks and tutorials | https://youtube.com/datalad |
Development | http://github.com/datalad |
Support | https://matrix.to/#/#datalad:matrix.org |
Open data | http://datasets.datalad.org |
Mastodon | @datalad@fosstodon.org |
@datalad |
Code examples: Basics
datalad --version
datalad --help
wtf
:
datalad wtf
git config --get user.name
git config --get user.email
git config --global user.name "Stephan Heunis"
git config --global user.email "s.heunis@fz-juelich.de"
ipython
import datalad.api as dl
dl.create(path='mydataset')
exit
text2git
config):
datalad create -c text2git bids-data
cd
(change directory):
cd bids-data
ls
:
ls -la .
echo "# A BIDS structured dataset for my input data" > README.md
status
of the dataset:
datalad status
save
datalad save -m "Add a short README"
echo "Contains functional task data of one subject" >> README.md
datalad save -m "Add information on the dataset contents to the README"
git log
clone
:
cd ../
datalad clone \
https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
cd studyforrest-data-phase2
ls
du -sh # this will print the size of the directory in human readable sizes
datalad status --annex
get
(here, multiple files):
# get all files of sub-01 for all functional runs of the localizer task
datalad get \
sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-*.nii.gz
drop
it:
# drop a specific file
datalad drop \
sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-4_bold.nii.gz
get
it:
# get a specific file
datalad drop \
sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-4_bold.nii.gz
get
what you want when you need it,
and drop
the rest.
cd ../bids-data
-d/--dataset
flag):
datalad clone --dataset . \
https://github.com/datalad/example-dicom-functional.git \
inputs/rawdata
datalad clone -d . \
https://github.com/ReproNim/containers.git \
code/containers
subdatasets
command:
datalad subdatasets
datalad run
can run any command in a way that links the command or script to the results it produces and the data it was computed fromdatalad rerun
can take this recorded provenance and recompute the commanddatalad containers-run
(from the extension) can capture software provenance in the form of software containers in addition to the provenance that datalad run capturesdatalad-container
extension, we can inspect the list of registered containers (recursively):
datalad containers-list --recursive
repronim-reproin
container for dicom conversion.
containers-run
command:
datalad containers-run -m "Convert subject 02 to BIDS" \
--container-name code/containers/repronim-reproin \
--input inputs/rawdata/dicoms \
--output sub-02 \
"-f reproin -s 02 --bids -l '' --minmeta -o . --files inputs/rawdata/dicoms"
containers-run
command has completed?
datalad diff
(based on git diff
):
datalad diff -f HEAD~1
git log -n 1
cd ../
datalad create -c yoda myanalysis
cd myanalysis
datalad clone -d . \
https://gin.g-node.org/your-gin-username/bids-data \
input
download-url
which gets the content of a script and registers its source:
datalad download-url -m "Download code for brain masking from Github" \
-O code/get_brainmask.py \
https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py
datalad-container
extension to register a container to the new dataset:
datalad containers-add nilearn \
--url shub://adswa/nilearn-container:latest \
--call-fmt "singularity exec {img} {cmd}"
datalad containers-run -m "Compute brain mask" \
-n nilearn \
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
--output figures/ \
--output "sub-02*" \
"python code/get_brainmask.py"
git log sub-02_brain-mask.nii.gz
datalad rerun
Code examples: publishing data
Publishing to OSF
datalad osf-credentials
:
datalad osf-credentials
datalad create-sibling-osf -d . -s my-osf-sibling \
--title 'my-osf-project-title' --mode export --public
datalad push -d . --to my-osf-sibling
cd ..
datalad clone osf://my-osf-project-id my-osf-clone
Publishing to SURFdrive
datalad-test
https://surfdrive.surf.nl/files/remote.php/nonshib-webdav/datalad-test
cd midterm_project
datalad create-sibling-webdav \
-d . \
-r \
-s my-webdav-sibling \
--mode filetree 'my-webdav-url'
At this point, DataLad should ask for credentials if you have not entered them before. Enter your Sciebo username and password.
datalad push -d . --recursive --to my-webdav-sibling
cd ..
datalad clone 'datalad-annex::?type=webdav&encryption=none\
&exporttree=yes&url=my-webdav-url/dataset-name' my-webdav-clone
Publishing to Dataverse
cd midterm_project
datalad add-sibling-dataverse -d . -s my-dataverse-sibling \
'my-dataverse-instance-url' doi:'my-dataset-doi'
for example:
datalad add-sibling-dataverse -d . -s dataverse \
https://demo.dataverse.org doi:10.70122/FK2/3K9FOD
(DataLad asks for credentials (token) if you haven't entered them before)
datalad push -d . --to my-dataverse-sibling
cd ..
datalad clone 'datalad-annex::?type=external&externaltype=dataverse\
&encryption=none&exporttree=no&url=my-dataverse-instance-url\
&doi='my-dataset-doi' my-sciebo-clone
Publishing to GitLab
cat << EOF > ~/.python-gitlab.cfg
[my-site]
url = https://gitlab.com/
private_token = my-gitlab-token
api_version = 4
EOF
create-sibling-gitlab
in the midterm_project
dataset:
datalad configuration set datalad.gitlab-default-site='my-site'
datalad configuration set datalad.gitlab-'my-site'-project='my-top-level-group'
datalad create-sibling-gitlab -d . --recursive -s 'my-gitlab-sibling'
datalad push -d . --recursive --to 'my-gitlab-sibling'
Extras
input/
subdatasetcode/
prediction_report.csv
contains the main classification metricsoutput/pairwise_relationships.png
is a plot of the relations between features.
[DS~0] ~/midterm_project
├── CHANGELOG.md
├── README.md
├── code/
│ ├── README.md
│ └── script.py
├── [DS~1] input/
│ └── iris.csv -> .git/annex/objects/...
├── pairwise_relationships.png -> .git/annex/objects/...
└── prediction_report.csv -> .git/annex/objects/...
datalad clone \
https://github.com/datalad-handbook/midterm_project.git
cd midterm_project
datalad subdatasets
datalad get input
datalad drop input
tig
datalad rerun HEAD~2