speed up ftw download data command#113
speed up ftw download data command#113geospatial-jeff wants to merge 4 commits intofieldsoftheworld:mainfrom
ftw download data command#113Conversation
|
will fix tests later, moving into draft until then |
We may need to be a bit careful about this. Did you chat with Jed regarding this? I guess one goal was always also to promote Source so avoiding this seems counter-productive ;-) I'm fine with the change, but just want to make sure we don't step on someones toes. |
|
I have spoken with Jed about this (generally speaking, not specific to FTW) and he said they have publicly exposed the S3 bucket for use cases like this that require downloading lots of data. I've seen him mention this publicly in slack channels as well. The current iteration of the source.coop proxy just isn't reliable, I'm totally down to use it once it has been refactored and works more reliably. The proxy refactor is an ongoing project, although I'm not sure when it is planned to be completed.
|
|
Ok cool, thanks for confirming. No worries, if Radiant is fine with it we can just move forward with this PR and without a dedicated CLI flag. Edit: Ah, sorry, misclicked on the CoPilot review, I hope it doesn't bother you. |
There was a problem hiding this comment.
Pull Request Overview
This PR speeds up the ftw download data command by introducing parallel downloads, streaming through smart_open, and refactoring helper functions to be private.
- Switch from
wget-based downloads to multithreaded S3 streaming viasmart_open. - Download directly from S3 (unsigned) to reduce proxy overhead and memory usage.
- Rename internal helpers with a leading underscore.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/ftw_cli/download_ftw.py | Replaced single-threaded download logic with ThreadPoolExecutor and smart_open; refactored helpers |
| pyproject.toml | Added smart_open[s3] dependency and formatted the dask[distributed] entry |
Comments suppressed due to low confidence (1)
src/ftw_cli/download_ftw.py:143
- There are no tests covering the parallel download path. Consider adding tests to validate concurrency handling and error scenarios when using
ThreadPoolExecutor.
exec.map(func, country_names)
| client = boto3.client( | ||
| "s3", config=Config(signature_version=UNSIGNED), region_name="us-west-1" |
There was a problem hiding this comment.
The S3 client is configured for us-west-1 but the download URL uses us-west-2. Align the region_name with the actual bucket region or make the bucket endpoint configurable.
| client = boto3.client( | |
| "s3", config=Config(signature_version=UNSIGNED), region_name="us-west-1" | |
| bucket_region = os.getenv("BUCKET_REGION", "us-west-2") | |
| bucket_endpoint = os.getenv("BUCKET_ENDPOINT", "us-west-2.opendata.source.coop") | |
| client = boto3.client( | |
| "s3", config=Config(signature_version=UNSIGNED), region_name=bucket_region |
|
Yes, we encourage people to bypass the proxy for use cases like this. Need to improve the documentation on this asap. Thanks for the nudge. |
|
This is awesome @geospatial-jeff - thanks! It's the slowest part of the process, so it's great to get the times on it down. And agreed on going direct to the s3 bucket. We should aim to be the first to try to switch when the proxy is reliable, but for now it makes sense to go to the s3 bucket. |
|
@cholmes Might be a misunderstanding or not, but this has nothing to do with the inference app. This is not for downloading imagery to run inference on, this is for downloading the FTW baseline from Source/AWS, which is not used in the inference app. |
|
I just tried it and I got a throughput of about 200MBit/s with the new code and about 220MBit/s for the old code. |
|
Will do some more testing later today 👍 |
This introduces some performance optimizations to the
ftw download datacommand:smart_openpackage which is specifically designed to stream large files from blob stores. This keeps memory footprint low which helps prevent exploding the memory usage when using multiple threads.And one small syntax change to prefix "private" functions with a
_.There are no changes to how
ftw download datais called. And I've left all the logging, print statements, and exception handling the same. I think these could be improved (ex. we don't need prints AND logs) but that is outside the scope of this PR.Closes #112