Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions datalad/ssh_repo_elm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
## 🧠 Datalad for Students: Minimal Reproducible Workflow

### 📦 1. Create a Datalad Dataset for Data (on `elm`)

**On your local machine:**

```bash
datalad create -c text2git image10k-zooniverse
cd image10k-zooniverse
```

**Annex the big files (e.g. CSVs):**

```bash
echo "*.csv annex.largefiles=anything" >> .gitattributes
datalad save -m "Set annex rules for CSVs"
```

**Push data to `elm`:**

```bash
datalad create-sibling \
--name elm \
--site datalad \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is the --site option, it is not in the docs, maybe a GPT-hallulu.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, I used chatgpt to create a summary of the steps, and he got this one wrong. Apologies I should have double checked. I believe that's what I used from my bash history: datalad create-sibling -s elm ssh://elm/data/simexp/pbellec/image10k-zooniverse --existing=skip

--sshurl ssh://elm/data/simexp/pbellec/image10k-zooniverse \
--shared all
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want datasets to be writable by the group and readable by all?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was thinking to let people deal with permissions in their own space. Then if we want to publish the dataset either upload a version of it on zooniverse (for open data) or create a new sibling on S3. You're suggesting we would get a single folder for the lab hosting all datalad datasets?


datalad push --to elm --data anything
```

**Push Git-only metadata to GitHub (optional):**

```bash
datalad create-sibling-github courtois-neuromod image10k-zooniverse \
--github-organization courtois-neuromod \
--access-protocol ssh
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
datalad create-sibling-github courtois-neuromod image10k-zooniverse \
--github-organization courtois-neuromod \
--access-protocol ssh
datalad create-sibling-github courtois-neuromod/image10k-zooniverse \
--access-protocol ssh

fix deprecation.

This requires a personal access token with adequate permissions to create repos for that org/user.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. I'm going to describe the method where the repo is created manually on github then added as a sibling.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I've used the syntax your suggested for org name, but with the datalad version I get through pip on my machine this does not seem to work, and I had to use the soon-obsolete flag --github-organization


datalad push --to origin
```

---

### 👩‍💻 2. For Students: Install and Use

**Clone the dataset from GitHub or `elm`:**

```bash
# Option A: from GitHub (metadata only)
datalad install git@github.com:courtois-neuromod/image10k-zooniverse.git

# Option B: from elm (knows about the data)
datalad install ssh://elm/data/simexp/pbellec/image10k-zooniverse.git
```

**Navigate and get data:**

```bash
cd image10k-zooniverse
datalad get Zooniverse_Results_2022_01_28.csv
```

---

### 🖼 3. Managing Outputs (Optional)

**Create a separate dataset for outputs:**

```bash
datalad create image10k-zooniverse.plots
cd image10k-zooniverse.plots

echo "*.png annex.largefiles=anything" >> .gitattributes
datalad save -m "Track plots in annex"
```

**Link it back into the analysis repo:**

```bash
cd image10k-zooniverse
datalad install -d . -s ../image10k-zooniverse.plots plots
```

---

### ⚠️ Tips & Troubleshooting

* If `datalad get` fails with `annex-ignore`, you likely cloned from GitHub only. Clone once from `elm` to propagate sibling config.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using --as-common-datasrc NAME see above would fix that. Or setting the create sibling as autoenabled afterward git-annex configremote elm autoenable=true.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this has been a point I'm still struggling with!! I could not get it to work such that installing from github would download from elm. So if I add --as-common-datasrc when I create the elm siblings it should fix it? or is that configuration staying local?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK experimented a bit and could not get it to work. I tried to remove the elm siblings then adding it back with: datalad siblings add --name elm --url ssh://elm/data/simexp/pbellec/image10k-zooniverse --as-common-datasrc origin Got this error:
add-sibling(impossible): . (sibling) [cannot configure as a common data source, URL protocol is not http or https] .: elm(+) [ssh://elm/data/simexp/pbellec/image10k-zooniverse (git)]

* To inspect siblings:

```bash
datalad siblings
```

* To pull subdataset updates:

```bash
datalad update --merge
```