perf: fn rav1d_prepare_intra_edges: pre-slice#1415
perf: fn rav1d_prepare_intra_edges: pre-slice#1415ERobsham wants to merge 2 commits intomemorysafety:mainfrom
fn rav1d_prepare_intra_edges: pre-slice#1415Conversation
rework `rav1d_prepare_intra_edges` to precalculate and take a slice of the data to be iterated over so we only incur the overhead of taking the `DisjointImmutGuard` once, which also allows us to safely use the unsafe `unsafe fn get_unchecked(..)` within the loop avoiding the overhead of any runtime bounds checks, since the `slice()` opertaion already implcitly bounds checked the slice we're asking for. inital hyperfine results show a overall improvement (albeit fairly minisucle): ``` Benchmark 1: ./target/release/dav1d -q -i test-data/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null Time (mean ± σ): 9.384 s ± 0.057 s [User: 108.379 s, System: 6.500 s] Range (min … max): 9.342 s … 9.516 s 10 runs Benchmark 1: ./target.main/release/dav1d -q -i test-data/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null Time (mean ± σ): 9.490 s ± 0.076 s [User: 110.062 s, System: 6.521 s] Range (min … max): 9.383 s … 9.606 s 10 runs ``` (changed version took ~98.88% of the original runtime)
|
Okay, looks like I missed running the test script with some flags that CI uses, so I'll take a look at that... However, this errror seems fairly concerning for the general concept of hoisting the borrow out of the loop: It seems like maybe there is no getting around taking a Although that makes me wonder if there is an intentional mechanism keeping those Otherwise, couldn't we still run into the same issue for the |
fix tests when running with `-n` flag run `cargo fmt`
|
Okay, updated to ensure tests pass while running the following locally: cargo build --release
rm -rf build
./.github/workflows/test.sh \
-r target/release/dav1d \
-s target/release/seek_stress
./.github/workflows/test.sh \
-r target/release/dav1d \
-s target/release/seek_stress \
-n
./.github/workflows/test.sh \
-r target/release/dav1d \
-s target/release/seek_stress \
-f 1
./.github/workflows/test.sh \
-r target/release/dav1d \
-s target/release/seek_stress \
-f 2However, Im not seeing the overlap issue locally, so that one is still rather concerning. |
kkysen
left a comment
There was a problem hiding this comment.
cargo build --release rm -rf build ./.github/workflows/test.sh \ -r target/release/dav1d \ -s target/release/seek_stress ./.github/workflows/test.sh \ -r target/release/dav1d \ -s target/release/seek_stress \ -n ./.github/workflows/test.sh \ -r target/release/dav1d \ -s target/release/seek_stress \ -f 1 ./.github/workflows/test.sh \ -r target/release/dav1d \ -s target/release/seek_stress \ -f 2However, Im not seeing the overlap issue locally, so that one is still rather concerning.
The overlap checks are only run with debug_assertions on (they should really be on always, but they are very bad for perf, and we don't know a better way to implement it), so you need to test with the opt-dev or checked-release profiles.
| 0 | ||
| }; | ||
| let dst_slice = | ||
| &*(dst + slice_offset - 1isize).slice::<BD>((px_have - 1) * abs_stride + 1); |
There was a problem hiding this comment.
The slice here overlaps ones from other threads, as the individual indices accessed are interleaved. That's why .index::<BD>() works, but not a single .slice::<BD>(). It's unfortunately a very tricky problem.
fn rav1d_prepare_intra_edges: pre-slice
fn prepare_intra_edgesis slower in rav1d than dav1d #1394.rework
rav1d_prepare_intra_edgesto precalculate and take a slice of the data to be iterated over so we only incur the overhead of taking theDisjointImmutGuardonce, which also allows us to safely use the unsafeunsafe fn get_unchecked(..)within the loop avoiding the overhead of any runtime bounds checks, since theslice()opertaion already implcitly bounds checked the slice we're asking for.inital hyperfine results show a overall improvement (albeit fairly minisucle):
(changed version took ~98.88% of the original runtime)