Skip to content

speed up ftw download data command#113

Open
geospatial-jeff wants to merge 4 commits intofieldsoftheworld:mainfrom
geospatial-jeff:faster-data-download
Open

speed up ftw download data command#113
geospatial-jeff wants to merge 4 commits intofieldsoftheworld:mainfrom
geospatial-jeff:faster-data-download

Conversation

@geospatial-jeff
Copy link
Copy Markdown
Collaborator

@geospatial-jeff geospatial-jeff commented Jul 3, 2025

This introduces some performance optimizations to the ftw download data command:

  • Download files in parallel across multiple threads.
  • Use the smart_open package which is specifically designed to stream large files from blob stores. This keeps memory footprint low which helps prevent exploding the memory usage when using multiple threads.
  • Download directly from the S3 bucket, instead of going through the source.coop proxy.

And one small syntax change to prefix "private" functions with a _.

There are no changes to how ftw download data is called. And I've left all the logging, print statements, and exception handling the same. I think these could be improved (ex. we don't need prints AND logs) but that is outside the scope of this PR.

Closes #112

@geospatial-jeff
Copy link
Copy Markdown
Collaborator Author

will fix tests later, moving into draft until then

@geospatial-jeff geospatial-jeff marked this pull request as draft July 3, 2025 16:28
@m-mohr
Copy link
Copy Markdown
Collaborator

m-mohr commented Jul 3, 2025

Download directly from the S3 bucket, instead of going through the source.coop proxy.

We may need to be a bit careful about this. Did you chat with Jed regarding this? I guess one goal was always also to promote Source so avoiding this seems counter-productive ;-) I'm fine with the change, but just want to make sure we don't step on someones toes.

@geospatial-jeff
Copy link
Copy Markdown
Collaborator Author

geospatial-jeff commented Jul 3, 2025

I have spoken with Jed about this (generally speaking, not specific to FTW) and he said they have publicly exposed the S3 bucket for use cases like this that require downloading lots of data. I've seen him mention this publicly in slack channels as well.

The current iteration of the source.coop proxy just isn't reliable, I'm totally down to use it once it has been refactored and works more reliably. The proxy refactor is an ongoing project, although I'm not sure when it is planned to be completed.

smart_open works with HTTP as well so it's a simple change to switch back over to the proxy. Maybe we should expose a CLI flag --proxy/--no-proxy so users can pick which to use?

@m-mohr
Copy link
Copy Markdown
Collaborator

m-mohr commented Jul 3, 2025

Ok cool, thanks for confirming. No worries, if Radiant is fine with it we can just move forward with this PR and without a dedicated CLI flag.

Edit: Ah, sorry, misclicked on the CoPilot review, I hope it doesn't bother you.

@m-mohr m-mohr requested review from Copilot and m-mohr July 3, 2025 17:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR speeds up the ftw download data command by introducing parallel downloads, streaming through smart_open, and refactoring helper functions to be private.

  • Switch from wget-based downloads to multithreaded S3 streaming via smart_open.
  • Download directly from S3 (unsigned) to reduce proxy overhead and memory usage.
  • Rename internal helpers with a leading underscore.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/ftw_cli/download_ftw.py Replaced single-threaded download logic with ThreadPoolExecutor and smart_open; refactored helpers
pyproject.toml Added smart_open[s3] dependency and formatted the dask[distributed] entry
Comments suppressed due to low confidence (1)

src/ftw_cli/download_ftw.py:143

  • There are no tests covering the parallel download path. Consider adding tests to validate concurrency handling and error scenarios when using ThreadPoolExecutor.
        exec.map(func, country_names)

Comment thread src/ftw_cli/download_ftw.py Outdated
Comment on lines +16 to +17
client = boto3.client(
"s3", config=Config(signature_version=UNSIGNED), region_name="us-west-1"
Copy link

Copilot AI Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The S3 client is configured for us-west-1 but the download URL uses us-west-2. Align the region_name with the actual bucket region or make the bucket endpoint configurable.

Suggested change
client = boto3.client(
"s3", config=Config(signature_version=UNSIGNED), region_name="us-west-1"
bucket_region = os.getenv("BUCKET_REGION", "us-west-2")
bucket_endpoint = os.getenv("BUCKET_ENDPOINT", "us-west-2.opendata.source.coop")
client = boto3.client(
"s3", config=Config(signature_version=UNSIGNED), region_name=bucket_region

Copilot uses AI. Check for mistakes.
Comment thread src/ftw_cli/download_ftw.py Outdated
Comment thread src/ftw_cli/download_ftw.py
Comment thread src/ftw_cli/download_ftw.py
@jedsundwall
Copy link
Copy Markdown

Yes, we encourage people to bypass the proxy for use cases like this. Need to improve the documentation on this asap. Thanks for the nudge.

@cholmes
Copy link
Copy Markdown
Member

cholmes commented Jul 6, 2025

This is awesome @geospatial-jeff - thanks! It's the slowest part of the process, so it's great to get the times on it down. And agreed on going direct to the s3 bucket. We should aim to be the first to try to switch when the proxy is reliable, but for now it makes sense to go to the s3 bucket.

@m-mohr
Copy link
Copy Markdown
Collaborator

m-mohr commented Jul 6, 2025

@cholmes Might be a misunderstanding or not, but this has nothing to do with the inference app. This is not for downloading imagery to run inference on, this is for downloading the FTW baseline from Source/AWS, which is not used in the inference app.

@geospatial-jeff geospatial-jeff marked this pull request as ready for review July 7, 2025 22:29
@m-mohr
Copy link
Copy Markdown
Collaborator

m-mohr commented Jul 8, 2025

I just tried it and I got a throughput of about 200MBit/s with the new code and about 220MBit/s for the old code.
(The 20MBit/s difference are not significant for me. My connection would allow for 550MBit/s, so not sure what exactly limits me to 2xx MBit/s here.)
So for me there's no significant difference with regards to speed, but if others see an improvement, happy to merge.

@geospatial-jeff
Copy link
Copy Markdown
Collaborator Author

Will do some more testing later today 👍

@m-mohr m-mohr added this to the v2.0 milestone Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

ftw data download is painfully slow

5 participants