navigate_connected_paths performance and ability to submit pairs (or a data.frame)#77
navigate_connected_paths performance and ability to submit pairs (or a data.frame)#77apsoras wants to merge 7 commits intoDOI-USGS:mainfrom
Conversation
…rocessing in once function call. Also removed recursion from get_dwn and removed repeated c() call to speed up get_dwn by a factor of 10. Overall perf improvement is a reduction of 50% for 100 outlets
|
Good deal. Thanks for the contribution! If you want to be recognized as a contributor, please add your contact info to the package description. Do you have a test case to contribute? I'd like to add test coverage for this case if possible. |
|
Great! I will add my info and tests and ping back when that is set up. |
|
Alright, I think I have covered everything test-wise and I added myself as a ctb which was hopefully correct! I also changed the return value back to data.frame() since The one failing test that occurred in the workflow earlier was due to my removing the message about calculating distances was removed since that ended up being done in the same function call as get_path, so it now only returns 2 messages when |
Very useful package!
My use case for
navigate_connected_pathswas 45,000 pairs of nhdplus flowlines which I knew to be up/downstream of each other. Becausenavigate_connected_pathsdoes not accept pairs, I had to do this in a loop. The preprocessing of the inputhyobject was at least 50% of the time of execution by my checks inprofvis, which means the time it took to process at least doubled.In addition, there were several instances where I ran into recursion errors in
get_dwn.Overall:
navigate_connected_pathsto accept a list of pairs or 2-col data.frame (executed 1:1), or lopsided list of pairs (executed with expansion) in addition to a vectorget_dwnto not use recursion (gotError: evaluation nested too deeply: infinite recursion / options(expressions=)?at times with microbenchmark) - R doesn't optimize recursion in any way so a while loop avoids this risk in big graphs.get_dwnto use lists - avoidingc()calls in each iteration was just shy of a 100X speedup, adding to a list avoids memory shuffling associated with resizing vectors/lists.get_pathto usex[length(x)rather thantail- >100X speed upOverall, the function can do my 45k pair list now in about 8 seconds when before the loop would not complete in <1 h.
The below performance is different to 8 s mentioned above because the length of the paths range widely in my dataset and each one in my set is guaranteed to be a matched pair, but we can approximate it with a vector of outlets that is of length 310, (45150 pairs via
combnprior to deduplicating)The ability to accept pairs is the main goal as that is added functionality, but ~5X improvement in the base case of
outletsbeing a vector doesn't hurt!