Bug report
I am documenting this problem as a known issue with discussion here in slack.
There is no plan to support the fix to this issue (per slack discussion, the solution is "too clunky"), but I would like to document it so that others encountering this may chime in and/or save some time debugging why their google batch pipelines are taking so long to complete in comparison to other executors.
Expected behavior and actual behavior
Expected behavior:
File staging and unstaging using the default, built-in support for the Google Batch executor should complete in a reasonable amount of time. "Reasonable" means transfer speeds comparable to standard GCP file transfer achieved by using, e.g., gcloud storage or gsutil.
Actual behavior:
File unstaging/delocalization using the built-in support in the Google Batch executor is extremely slow, e.g., taking ~11 hours for 600GB of data. This is because it's using gcsfuse, which is significantly slower than gsutil which is used by the Google LifeSciences executor (via the nxf_gs_upload function in .command.run).
The officially recommended solution is to use Wave/Fusion; however, this solution is not appropriate for all users, especially those who cannot support the injection of 3rd party services into their workflows.
This is problematic especially because (1) poor performance is essentially the default behavior for the executor (as in, wave/fusion are the recommended solution but are not enabled by default), and (2) this reflects a disparity with the lifesciences executor, which is deprecated by GCP and will not be available after July 8, 2025.
Steps to reproduce the problem
- Config
process.executor = 'google-batch'
fusion.enabled = false or omit from config
wave.enabled = false or omit from config
- Run any process that generates at least a few GB of data.
- See the example process below. Running this with 20GB of data takes ~27 mins with defaults and ~12 mins with a
gsutil-enabled solution I wrote in a fork (~3-4 mins of this is generating the file).)
Here's an example process for testing purposes that just writes a file of the given size.
process DUMMY_WRITE {
label 'process_single'
disk { 2 * file_size_gb as nextflow.util.MemoryUnit * 1024**3 }
input:
val file_size_gb
output:
path 'dummy_dir/dummy.txt', emit: ch_dummy
publishDir (
path: "${params.publish_dir}/",
mode: 'copy',
)
script:
"""
# Write a file of size file_size_gb
echo "Writing a file of size ${file_size_gb}GB."
mkdir dummy_dir/
dd if=/dev/zero of=dummy_dir/dummy.txt bs=1G count=${file_size_gb}
echo "Done writing file."
"""
}
Program output
Here are the performance characteristics that caused me to look into this issue. I have a workflow that performs the following steps:
- Localizes data:
- Method: Manually staged using
gsutil due to the known bucket underscore issue. See related issues: #3619, #1069, #1527.
- Size: ~300 GB.
- Duration: ~8 minutes.
- Runs some code:
- Delocalizes data:
- Method: Using Nextflow's built-in gcsfuse support. Files are moved to the workdir only (no publishing).
- Size: ~600 GB.
- Duration: ~11 hours.
- Comparable upload using
gsutil takes ~15 minutes.
Environment
- Nextflow version: 24.10.0
- Java version: openjdk 11.0.25 2024-10-15
- Operating system: Linux
- Bash version: zsh 5.8.1 (x86_64-ubuntu-linux-gnu)
Additional context
See this slack discussion for additional context.
Bug report
I am documenting this problem as a known issue with discussion here in slack.
There is no plan to support the fix to this issue (per slack discussion, the solution is "too clunky"), but I would like to document it so that others encountering this may chime in and/or save some time debugging why their google batch pipelines are taking so long to complete in comparison to other executors.
Expected behavior and actual behavior
Expected behavior:
File staging and unstaging using the default, built-in support for the Google Batch executor should complete in a reasonable amount of time. "Reasonable" means transfer speeds comparable to standard GCP file transfer achieved by using, e.g.,
gcloud storageorgsutil.Actual behavior:
File unstaging/delocalization using the built-in support in the Google Batch executor is extremely slow, e.g., taking ~11 hours for 600GB of data. This is because it's using
gcsfuse, which is significantly slower thangsutilwhich is used by the Google LifeSciences executor (via thenxf_gs_uploadfunction in.command.run).The officially recommended solution is to use Wave/Fusion; however, this solution is not appropriate for all users, especially those who cannot support the injection of 3rd party services into their workflows.
This is problematic especially because (1) poor performance is essentially the default behavior for the executor (as in, wave/fusion are the recommended solution but are not enabled by default), and (2) this reflects a disparity with the lifesciences executor, which is deprecated by GCP and will not be available after July 8, 2025.
Steps to reproduce the problem
process.executor = 'google-batch'fusion.enabled = falseor omit from configwave.enabled = falseor omit from configgsutil-enabled solution I wrote in a fork (~3-4 mins of this is generating the file).)Here's an example process for testing purposes that just writes a file of the given size.
Program output
Here are the performance characteristics that caused me to look into this issue. I have a workflow that performs the following steps:
gsutildue to the known bucket underscore issue. See related issues: #3619, #1069, #1527.gsutiltakes ~15 minutes.Environment
Additional context
See this slack discussion for additional context.