Skip to content

Poor File Staging/Unstaging Performance with Google Batch Executor #5653

@KatieAtGordian

Description

@KatieAtGordian

Bug report

I am documenting this problem as a known issue with discussion here in slack.

There is no plan to support the fix to this issue (per slack discussion, the solution is "too clunky"), but I would like to document it so that others encountering this may chime in and/or save some time debugging why their google batch pipelines are taking so long to complete in comparison to other executors.

Expected behavior and actual behavior

Expected behavior:
File staging and unstaging using the default, built-in support for the Google Batch executor should complete in a reasonable amount of time. "Reasonable" means transfer speeds comparable to standard GCP file transfer achieved by using, e.g., gcloud storage or gsutil.

Actual behavior:
File unstaging/delocalization using the built-in support in the Google Batch executor is extremely slow, e.g., taking ~11 hours for 600GB of data. This is because it's using gcsfuse, which is significantly slower than gsutil which is used by the Google LifeSciences executor (via the nxf_gs_upload function in .command.run).

The officially recommended solution is to use Wave/Fusion; however, this solution is not appropriate for all users, especially those who cannot support the injection of 3rd party services into their workflows.

This is problematic especially because (1) poor performance is essentially the default behavior for the executor (as in, wave/fusion are the recommended solution but are not enabled by default), and (2) this reflects a disparity with the lifesciences executor, which is deprecated by GCP and will not be available after July 8, 2025.

Steps to reproduce the problem

  • Config
    • process.executor = 'google-batch'
    • fusion.enabled = false or omit from config
    • wave.enabled = false or omit from config
  • Run any process that generates at least a few GB of data.
    • See the example process below. Running this with 20GB of data takes ~27 mins with defaults and ~12 mins with a gsutil-enabled solution I wrote in a fork (~3-4 mins of this is generating the file).)

Here's an example process for testing purposes that just writes a file of the given size.

process DUMMY_WRITE {
    label 'process_single'
    disk { 2 * file_size_gb as nextflow.util.MemoryUnit * 1024**3 }
    

    input:
    val file_size_gb

    output:
    path 'dummy_dir/dummy.txt', emit: ch_dummy

    publishDir (
        path: "${params.publish_dir}/",
        mode: 'copy',
        )

    script:
    """
    # Write a file of size file_size_gb
    echo "Writing a file of size ${file_size_gb}GB."
    mkdir dummy_dir/
    dd if=/dev/zero of=dummy_dir/dummy.txt bs=1G count=${file_size_gb} 
    echo "Done writing file."
    """
}

Program output

Here are the performance characteristics that caused me to look into this issue. I have a workflow that performs the following steps:

  1. Localizes data:
  • Method: Manually staged using gsutil due to the known bucket underscore issue. See related issues: #3619, #1069, #1527.
  • Size: ~300 GB.
  • Duration: ~8 minutes.
  1. Runs some code:
  • Duration: ~2 hours.
  1. Delocalizes data:
  • Method: Using Nextflow's built-in gcsfuse support. Files are moved to the workdir only (no publishing).
  • Size: ~600 GB.
  • Duration: ~11 hours.
  • Comparable upload using gsutil takes ~15 minutes.

Environment

  • Nextflow version: 24.10.0
  • Java version: openjdk 11.0.25 2024-10-15
  • Operating system: Linux
  • Bash version: zsh 5.8.1 (x86_64-ubuntu-linux-gnu)

Additional context

See this slack discussion for additional context.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions