speed up `ftw download data` command by geospatial-jeff · Pull Request #113 · fieldsoftheworld/ftw-baselines

geospatial-jeff · 2025-07-03T16:17:38Z

This introduces some performance optimizations to the ftw download data command:

Download files in parallel across multiple threads.
Use the smart_open package which is specifically designed to stream large files from blob stores. This keeps memory footprint low which helps prevent exploding the memory usage when using multiple threads.
Download directly from the S3 bucket, instead of going through the source.coop proxy.

And one small syntax change to prefix "private" functions with a _.

There are no changes to how ftw download data is called. And I've left all the logging, print statements, and exception handling the same. I think these could be improved (ex. we don't need prints AND logs) but that is outside the scope of this PR.

Closes #112

geospatial-jeff · 2025-07-03T16:28:39Z

will fix tests later, moving into draft until then

m-mohr · 2025-07-03T16:43:48Z

Download directly from the S3 bucket, instead of going through the source.coop proxy.

We may need to be a bit careful about this. Did you chat with Jed regarding this? I guess one goal was always also to promote Source so avoiding this seems counter-productive ;-) I'm fine with the change, but just want to make sure we don't step on someones toes.

geospatial-jeff · 2025-07-03T16:50:51Z

I have spoken with Jed about this (generally speaking, not specific to FTW) and he said they have publicly exposed the S3 bucket for use cases like this that require downloading lots of data. I've seen him mention this publicly in slack channels as well.

The current iteration of the source.coop proxy just isn't reliable, I'm totally down to use it once it has been refactored and works more reliably. The proxy refactor is an ongoing project, although I'm not sure when it is planned to be completed.

smart_open works with HTTP as well so it's a simple change to switch back over to the proxy. Maybe we should expose a CLI flag --proxy/--no-proxy so users can pick which to use?

m-mohr · 2025-07-03T17:17:31Z

Ok cool, thanks for confirming. No worries, if Radiant is fine with it we can just move forward with this PR and without a dedicated CLI flag.

Edit: Ah, sorry, misclicked on the CoPilot review, I hope it doesn't bother you.

Copilot

Pull Request Overview

This PR speeds up the ftw download data command by introducing parallel downloads, streaming through smart_open, and refactoring helper functions to be private.

Switch from wget-based downloads to multithreaded S3 streaming via smart_open.
Download directly from S3 (unsigned) to reduce proxy overhead and memory usage.
Rename internal helpers with a leading underscore.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/ftw_cli/download_ftw.py	Replaced single-threaded download logic with `ThreadPoolExecutor` and `smart_open`; refactored helpers
pyproject.toml	Added `smart_open[s3]` dependency and formatted the `dask[distributed]` entry

Comments suppressed due to low confidence (1)

src/ftw_cli/download_ftw.py:143

There are no tests covering the parallel download path. Consider adding tests to validate concurrency handling and error scenarios when using ThreadPoolExecutor.

        exec.map(func, country_names)

Copilot · 2025-07-03T17:19:55Z

+client = boto3.client(
+    "s3", config=Config(signature_version=UNSIGNED), region_name="us-west-1"


The S3 client is configured for us-west-1 but the download URL uses us-west-2. Align the region_name with the actual bucket region or make the bucket endpoint configurable.

Suggested change

client = boto3.client(

"s3", config=Config(signature_version=UNSIGNED), region_name="us-west-1"

bucket_region = os.getenv("BUCKET_REGION", "us-west-2")

bucket_endpoint = os.getenv("BUCKET_ENDPOINT", "us-west-2.opendata.source.coop")

client = boto3.client(

"s3", config=Config(signature_version=UNSIGNED), region_name=bucket_region

jedsundwall · 2025-07-06T02:49:42Z

Yes, we encourage people to bypass the proxy for use cases like this. Need to improve the documentation on this asap. Thanks for the nudge.

cholmes · 2025-07-06T02:50:29Z

This is awesome @geospatial-jeff - thanks! It's the slowest part of the process, so it's great to get the times on it down. And agreed on going direct to the s3 bucket. We should aim to be the first to try to switch when the proxy is reliable, but for now it makes sense to go to the s3 bucket.

m-mohr · 2025-07-06T09:49:37Z

@cholmes Might be a misunderstanding or not, but this has nothing to do with the inference app. This is not for downloading imagery to run inference on, this is for downloading the FTW baseline from Source/AWS, which is not used in the inference app.

m-mohr · 2025-07-08T08:24:49Z

I just tried it and I got a throughput of about 200MBit/s with the new code and about 220MBit/s for the old code.
(The 20MBit/s difference are not significant for me. My connection would allow for 550MBit/s, so not sure what exactly limits me to 2xx MBit/s here.)
So for me there's no significant difference with regards to speed, but if others see an improvement, happy to merge.

geospatial-jeff · 2025-07-08T13:59:20Z

Will do some more testing later today 👍

geospatial-jeff added 2 commits July 3, 2025 10:14

speed up ftw download data command

192b947

don't reraise exception

bba9865

geospatial-jeff marked this pull request as draft July 3, 2025 16:28

m-mohr requested review from Copilot and m-mohr July 3, 2025 17:17

Copilot AI reviewed Jul 3, 2025

View reviewed changes

geospatial-jeff added 2 commits July 7, 2025 15:37

fix unit test

51272c8

copilot review

616ddbf

geospatial-jeff marked this pull request as ready for review July 7, 2025 22:29

cholmes added this to FTW Phase 2 Jul 16, 2025

m-mohr added this to the v2.0 milestone Aug 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up `ftw download data` command#113

speed up `ftw download data` command#113
geospatial-jeff wants to merge 4 commits intofieldsoftheworld:mainfrom
geospatial-jeff:faster-data-download

geospatial-jeff commented Jul 3, 2025 •

edited

Loading

Uh oh!

geospatial-jeff commented Jul 3, 2025

Uh oh!

m-mohr commented Jul 3, 2025 •

edited

Loading

Uh oh!

geospatial-jeff commented Jul 3, 2025 •

edited

Loading

Uh oh!

m-mohr commented Jul 3, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jedsundwall commented Jul 6, 2025

Uh oh!

cholmes commented Jul 6, 2025

Uh oh!

m-mohr commented Jul 6, 2025

Uh oh!

m-mohr commented Jul 8, 2025 •

edited

Loading

Uh oh!

geospatial-jeff commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		client = boto3.client(
		"s3", config=Config(signature_version=UNSIGNED), region_name="us-west-1"

Conversation

geospatial-jeff commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geospatial-jeff commented Jul 3, 2025

Uh oh!

m-mohr commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geospatial-jeff commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m-mohr commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jedsundwall commented Jul 6, 2025

Uh oh!

cholmes commented Jul 6, 2025

Uh oh!

m-mohr commented Jul 6, 2025

Uh oh!

m-mohr commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geospatial-jeff commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

geospatial-jeff commented Jul 3, 2025 •

edited

Loading

m-mohr commented Jul 3, 2025 •

edited

Loading

geospatial-jeff commented Jul 3, 2025 •

edited

Loading

m-mohr commented Jul 3, 2025 •

edited

Loading

m-mohr commented Jul 8, 2025 •

edited

Loading