-
Notifications
You must be signed in to change notification settings - Fork 783
Add Seqera NIO filesystem for datasets and refactor TowerClient/TowerObserver split #6946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jorgee
wants to merge
17
commits into
master
Choose a base branch
from
260310-seqera-dataset-fs
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 14 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
bbad0e0
first seqera fs implementation
jorgee b1d0620
removed unnecessary logs
jorgee ca9eb97
fixing incorrect paths when filename and empty seqera path
jorgee 931960e
fix issues
jorgee ba46420
Refactor tower observer client
jorgee 38420e1
add integration test
jorgee 76cf74c
Merge remote-tracking branch 'origin/master' into 260310-seqera-datas…
jorgee a44e197
Merge remote-tracking branch 'origin/master' into 260310-seqera-datas…
jorgee 6360d84
Merge remote-tracking branch 'origin/master' into 260310-seqera-datas…
jorgee 164251e
add missing merge chnages
jorgee 9932883
merge TowerCommonAPI in TowerConfig
jorgee c75686c
Merge branch 'master' into 260310-seqera-dataset-fs
jorgee 3520d66
add adr
jorgee 9b5d99a
address review comments
jorgee a59f2d7
Apply suggestions from code review
jorgee 3ee78aa
Merge branch 'master' into 260310-seqera-dataset-fs
jorgee d978811
review changes
jorgee File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| # NIO Filesystem for Seqera Platform Datasets | ||
|
|
||
| - Authors: Jorge Ejarque | ||
| - Status: draft | ||
| - Date: 2026-03-10 | ||
| - Tags: nio, filesystem, seqera, datasets, nf-tower | ||
|
|
||
| Technical Story: Enable Nextflow pipelines to read Seqera Platform datasets as ordinary file paths using `seqera://` URIs. | ||
|
|
||
| ## Summary | ||
|
|
||
| Add a Java NIO `FileSystemProvider` to the `nf-tower` plugin that registers the `seqera://` scheme, allowing pipelines to reference Seqera Platform datasets (CSV/TSV) as standard file paths without manual download steps. The implementation reuses the existing `TowerClient` for all HTTP communication, inheriting authentication and retry behaviour. | ||
|
|
||
| ## Problem Statement | ||
|
|
||
| Nextflow users managing datasets on the Seqera Platform must currently download dataset files manually or through custom scripts before referencing them in pipelines. There is no native integration between Nextflow's file abstraction and the Seqera Platform dataset API. This creates friction in workflows where datasets are the primary input and forces users to handle authentication, versioning, and file staging outside the pipeline definition. | ||
|
|
||
| ## Goals or Decision Drivers | ||
|
|
||
| - Transparent access to Seqera Platform datasets using standard Nextflow file path syntax | ||
| - Reuse of existing nf-tower plugin infrastructure (authentication, HTTP client, retry/backoff) | ||
| - Hierarchical path browsing matching the platform's org/workspace/dataset structure | ||
| - Extensible architecture that can support future Seqera-managed resource types (e.g. data-links) | ||
| - No new plugin or module — feature lives within nf-tower | ||
|
|
||
| ## Non-goals | ||
|
|
||
| - Streaming large datasets — the Platform API does not support streaming; content is fully buffered on download | ||
| - Implementing resource types beyond `datasets` — only the extensible architecture is required | ||
| - Local caching across pipeline runs — Nextflow's standard task staging handles caching | ||
| - Dataset management operations (delete, rename) — the filesystem is read-only in the initial implementation | ||
|
|
||
| ## Considered Options | ||
|
|
||
| ### Option 1: Standalone plugin with dedicated HTTP client | ||
|
|
||
| A new `nf-seqera-fs` plugin with its own HTTP client configuration and authentication setup. | ||
|
|
||
| - Good, because it isolates the filesystem code from the nf-tower plugin | ||
| - Bad, because it duplicates authentication configuration and HTTP client setup | ||
| - Bad, because two separate HTTP clients sharing a refresh token would corrupt each other's auth state | ||
|
|
||
| ### Option 2: NIO filesystem within nf-tower using TowerClient delegation | ||
|
|
||
| Add the filesystem to nf-tower, delegating all HTTP through the existing `TowerClient` singleton via a typed `SeqeraDatasetClient` wrapper. | ||
|
|
||
| - Good, because it shares authentication and token refresh with TowerClient | ||
| - Good, because it reuses existing retry/backoff configuration | ||
| - Good, because no new dependencies are needed | ||
|
|
||
| ### Option 3: Direct HxClient usage within nf-tower | ||
|
|
||
| Add the filesystem to nf-tower but use `HxClient` directly rather than going through TowerClient. | ||
|
|
||
| - Good, because it gives full control over request construction | ||
| - Bad, because exposing HxClient internals couples the filesystem to implementation details | ||
| - Bad, because token refresh coordination with TowerClient becomes manual | ||
|
|
||
| ## Solution or decision outcome | ||
|
|
||
| Option 2 — NIO filesystem within nf-tower using TowerClient delegation. All HTTP calls go through `TowerClient.sendApiRequest()`, ensuring a single point of authentication and retry logic. | ||
|
|
||
| ## Rationale & discussion | ||
|
|
||
| ### Path Hierarchy | ||
|
|
||
| The `seqera://` path encodes the Platform's organizational structure directly: | ||
|
|
||
| ``` | ||
| seqera:// → ROOT (directory, depth 0) | ||
| └── <org>/ → ORGANIZATION (directory, depth 1) | ||
| └── <workspace>/ → WORKSPACE (directory, depth 2) | ||
| └── datasets/ → RESOURCE TYPE (directory, depth 3) | ||
| └── <name>[@<version>] → DATASET (file, depth 4) | ||
| ``` | ||
|
|
||
| Each level is a directory except the leaf dataset, which is a file. Version pinning uses an `@version` suffix on the dataset name segment (e.g. `seqera://acme/research/datasets/samples@2`). Without it, the latest non-disabled version is resolved. | ||
|
|
||
| ### Name-to-ID Resolution | ||
|
|
||
| The path uses human-readable names but the Platform API requires numeric IDs. Resolution is built from two API calls at filesystem initialization: | ||
|
|
||
| 1. `GET /user-info` → obtain `userId` | ||
| 2. `GET /user/{userId}/workspaces` → returns all accessible org/workspace pairs | ||
|
|
||
| This single source provides both directory listing content and name→ID mapping. Results are cached in `SeqeraFileSystem` with invalidation on write operations. `GET /orgs` is intentionally not used as it returns all platform orgs, not scoped to user membership. | ||
|
|
||
| ### Component Structure | ||
|
|
||
| ``` | ||
| plugins/nf-tower/src/main/io/seqera/tower/plugin/ | ||
| ├── fs/ ← NIO layer | ||
| │ ├── SeqeraFileSystemProvider ← FileSystemProvider (scheme: "seqera") | ||
| │ ├── SeqeraFileSystem ← FileSystem with org/workspace/dataset caches | ||
| │ ├── SeqeraPath ← Path implementation (depth 0–4) | ||
| │ ├── SeqeraFileAttributes ← BasicFileAttributes | ||
| │ ├── SeqeraPathFactory ← PF4J FileSystemPathFactory extension | ||
| │ └── DatasetInputStream ← SeekableByteChannel over InputStream | ||
| ├── dataset/ ← API client layer | ||
| │ ├── SeqeraDatasetClient ← Typed HTTP client wrapping TowerClient | ||
| │ ├── DatasetDto ← Dataset API response model | ||
| │ ├── DatasetVersionDto ← Version API response model | ||
| │ ├── OrgAndWorkspaceDto ← Org/workspace list model | ||
| │ └── WorkspaceOrgDto ← Workspace/org mapping model | ||
| └── resources/META-INF/services/ | ||
| └── java.nio.file.spi.FileSystemProvider | ||
| ``` | ||
|
|
||
| ### Key Design Decisions | ||
|
|
||
| 1. **TowerClient delegation**: `SeqeraDatasetClient` delegates all HTTP through `TowerFactory.client()` → `TowerClient.sendApiRequest()`. This ensures shared authentication state and avoids the token refresh corruption that would occur with separate HTTP client instances. | ||
|
|
||
| 2. **One filesystem per JVM**: `SeqeraFileSystemProvider` maintains a single `SeqeraFileSystem` keyed by scheme. This matches the `TowerClient` singleton-per-session pattern. | ||
|
|
||
| 3. **Read-only initial scope**: The filesystem reports `isReadOnly()=true`. Write support (dataset upload via multipart POST) is deferred to a future iteration. | ||
|
|
||
| 4. **Download filename constraint**: The Platform API's download endpoint (`GET /datasets/{id}/v/{version}/n/{fileName}`) requires the exact filename from upload time. The implementation always resolves `DatasetVersionDto.fileName` from `GET /datasets/{id}/versions` before constructing the download URL. | ||
|
|
||
| 5. **Extensible resource types**: The path hierarchy reserves depth 3 for a resource type segment (currently only `datasets`). Adding support for data-links or other resource types requires only a new handler at the directory listing and I/O layers, with no changes to path resolution or authentication. | ||
|
|
||
| 6. **Thread safety**: `SeqeraFileSystem` cache methods and `SeqeraFileSystemProvider` lifecycle methods are `synchronized`. The filesystem map uses `LinkedHashMap` with external synchronization rather than `ConcurrentHashMap`, matching the low-contention access pattern. | ||
|
|
||
| ### Limitations | ||
|
|
||
| - **No size metadata**: `SeqeraFileAttributes.size()` returns 0 for all paths because the Platform API does not expose content length in dataset metadata. | ||
| - **Single endpoint per JVM**: The filesystem key is scheme-only; concurrent access to different Platform endpoints in the same JVM is not supported. | ||
|
|
||
| ### Streaming Downloads | ||
|
|
||
| Dataset downloads use `TowerClient.sendStreamingRequest()` which calls `HxClient.sendAsStream()` — the response body is returned as an `InputStream` streamed directly from the HTTP connection. This avoids the triple-buffering problem (`String` → `getBytes()` → `ByteArrayInputStream`) that would otherwise consume ~40 MB heap per 10 MB dataset. The `HxClient.sendAsStream()` method goes through the same `sendWithRetry()` path as `sendAsString()`, so retry logic and token refresh are preserved. | ||
|
|
||
| ## Links | ||
|
|
||
| - [Spec](../specs/260310-seqera-dataset-fs/spec.md) | ||
| - [Implementation plan](../specs/260310-seqera-dataset-fs/plan.md) | ||
| - [Data model](../specs/260310-seqera-dataset-fs/data-model.md) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel this can be improved adding
TowerClientconstructor or a factory methodThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this change would imply an important refactor of auth and launch commands. This method is called in several places and the arguments can change depending on the code. For instance, in auth command, it is first called with the login token and later with the configured token. I prefer to leave it out of the scope of the PR.