Skip to content

navigate_connected_paths performance and ability to submit pairs (or a data.frame)#77

Open
apsoras wants to merge 7 commits intoDOI-USGS:mainfrom
apsoras:navigate_connected_paths_performance
Open

navigate_connected_paths performance and ability to submit pairs (or a data.frame)#77
apsoras wants to merge 7 commits intoDOI-USGS:mainfrom
apsoras:navigate_connected_paths_performance

Conversation

@apsoras
Copy link
Copy Markdown

@apsoras apsoras commented Apr 2, 2026

Very useful package!
My use case for navigate_connected_paths was 45,000 pairs of nhdplus flowlines which I knew to be up/downstream of each other. Because navigate_connected_paths does not accept pairs, I had to do this in a loop. The preprocessing of the input hy object was at least 50% of the time of execution by my checks in profvis, which means the time it took to process at least doubled.

In addition, there were several instances where I ran into recursion errors in get_dwn.

Overall:

  • changed navigate_connected_paths to accept a list of pairs or 2-col data.frame (executed 1:1), or lopsided list of pairs (executed with expansion) in addition to a vector
  • changed get_dwn to not use recursion (got Error: evaluation nested too deeply: infinite recursion / options(expressions=)? at times with microbenchmark) - R doesn't optimize recursion in any way so a while loop avoids this risk in big graphs.
  • changed get_dwn to use lists - avoiding c() calls in each iteration was just shy of a 100X speedup, adding to a list avoids memory shuffling associated with resizing vectors/lists.
  • changed get_path to use x[length(x) rather than tail - >100X speed up
  • cleaned up final code bits where lengths were merged in and summaries were calculated

Overall, the function can do my 45k pair list now in about 8 seconds when before the loop would not complete in <1 h.

The below performance is different to 8 s mentioned above because the length of the paths range widely in my dataset and each one in my set is guaranteed to be a matched pair, but we can approximate it with a vector of outlets that is of length 310, (45150 pairs via combn prior to deduplicating)

Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
     current 5085.205 5096.812 5234.333 5214.840 5304.968 5469.838     5
 this_branch  980.914 1071.897 1253.559 1231.648 1263.860 1719.477     5

The ability to accept pairs is the main goal as that is added functionality, but ~5X improvement in the base case of outlets being a vector doesn't hurt!

apsoras added 4 commits April 2, 2026 16:22
…rocessing in once function call. Also removed recursion from get_dwn and removed repeated c() call to speed up get_dwn by a factor of 10. Overall perf improvement is a reduction of 50% for 100 outlets
@dblodgett-usgs
Copy link
Copy Markdown
Collaborator

Good deal. Thanks for the contribution! If you want to be recognized as a contributor, please add your contact info to the package description. Do you have a test case to contribute? I'd like to add test coverage for this case if possible.

@apsoras
Copy link
Copy Markdown
Author

apsoras commented Apr 4, 2026

Great! I will add my info and tests and ping back when that is set up.

@apsoras
Copy link
Copy Markdown
Author

apsoras commented Apr 5, 2026

Alright, I think I have covered everything test-wise and I added myself as a ctb which was hopefully correct!

I also changed the return value back to data.frame() since rbindlist returns a data.table.

The one failing test that occurred in the workflow earlier was due to my removing the message about calculating distances was removed since that ended up being done in the same function call as get_path, so it now only returns 2 messages when status = TRUE rather than 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants