-
Notifications
You must be signed in to change notification settings - Fork 66
fn create_lf_mask_inter: pull level_cache slice out of inner loop
#1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,6 +14,7 @@ use crate::src::levels::SegmentId; | |
| use crate::src::levels::TxfmSize; | ||
| use crate::src::relaxed_atomic::RelaxedAtomic; | ||
| use crate::src::tables::dav1d_txfm_dimensions; | ||
| use bytemuck; | ||
| use libc::ptrdiff_t; | ||
| use parking_lot::RwLock; | ||
| use std::cmp; | ||
|
|
@@ -149,7 +150,7 @@ fn decomp_tx( | |
| }; | ||
| } | ||
|
|
||
| #[inline] | ||
| #[inline(always)] | ||
| fn mask_edges_inter( | ||
| masks: &[[[[RelaxedAtomic<u16>; 2]; 3]; 32]; 2], | ||
| by4: usize, | ||
|
|
@@ -261,7 +262,7 @@ fn mask_edges_inter( | |
| a[..w4].copy_from_slice(txa_slice); | ||
| } | ||
|
|
||
| #[inline] | ||
| #[inline(always)] | ||
| fn mask_edges_intra( | ||
| masks: &[[[[RelaxedAtomic<u16>; 2]; 3]; 32]; 2], | ||
| by4: usize, | ||
|
|
@@ -452,12 +453,13 @@ pub(crate) fn rav1d_create_lf_mask_intra( | |
| if bw4 != 0 && bh4 != 0 { | ||
| let mut level_cache_off = by * b4_stride + bx; | ||
| for _y in 0..bh4 { | ||
| let idx = 4 * level_cache_off; | ||
| let lvl = &mut *level_cache.index_mut((idx + 0.., ..4 * bw4)); | ||
| let lvl: &mut [[u8; 4]] = bytemuck::cast_slice_mut(lvl); | ||
| for x in 0..bw4 { | ||
| let idx = 4 * (level_cache_off + x); | ||
| // `0.., ..2` is for Y | ||
| let lvl = &mut *level_cache.index_mut((idx + 0.., ..2)); | ||
| lvl[0] = filter_level[0][0][0]; | ||
| lvl[1] = filter_level[1][0][0]; | ||
| lvl[x][0] = filter_level[0][0][0]; | ||
| lvl[x][1] = filter_level[1][0][0]; | ||
|
Comment on lines
+461
to
+462
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i was also looking at this function yesterday, you might want to keep an eye out for ways to convince llvm/rustc that this store can be one 16-bit store instead of a pair of 8-bit stores (i took a "get out and push" approach while proving it out, myself: iximeow@e133a89) the factor by which the loop is unrolled is partially a question of the size of the loop, so where this produces pairs of 5-byte stores for each iteration: i'd seen slightly better still throughput with the iteration being expressed as one 4-byte store: that said i'd initially taken an approach of re-slicing from i also had a hard time precisely measuring this one. on a 9950x, but my change was an improvement on the order of 0.5% for the 10-bit video. on the 8-bit video where i've better numbers: ~1% fewer loads/stores dispatched (presumably mostly because of the loop unrolling), 0.6% fewer instructions in total, 0.25% total fewer cycles. i would expect your change to have similar results, maybe a less change overall because of the slightly less aggressive loop unrolling.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the help! I didn't notice, let me try to convince LLVM to fuse the store similar to what you're doing. Your numbers sound right - they're exactly what I would estimate based on the performance improvement to this function alone. Let me look at instruction/cycle counts, it may be more useful than wall clock as there's just too much noise to measure < 1% improvements consistently. If you'd like to share thoughts/work together on completing this fix, let me know!
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On 10 bit Chimera for 4000 frames, I'm getting 1.8% fewer instructions and 0.5-1.2% wall clock compared to baseline now. The former seems to be reproducible/consistent, latter is still affected by noise. Seems better than yours which is suspicious, will take a closer look. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think those observations make sense for the 10-bit video, and in particular on a 16-core processor. i noticed this weekend that dav1d and rav1d both struggle to keep threads busy past ~12 threads and for some inputs more threads past 18 start to harm throughput. haven't looked too much into that since the overall behavior is very consistent between the two, but it definitely seems like the overall work-scheduling situation could be.... much better
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think I found a way to do this simpler and safer. I'll open a PR for just that part. Thanks for finding out the 2 |
||
| } | ||
| level_cache_off += b4_stride; | ||
| } | ||
|
|
@@ -490,12 +492,13 @@ pub(crate) fn rav1d_create_lf_mask_intra( | |
|
|
||
| let mut level_cache_off = (by >> ss_ver) * b4_stride + (bx >> ss_hor); | ||
| for _y in 0..cbh4 { | ||
| let idx = 4 * level_cache_off; | ||
| let lvl = &mut *level_cache.index_mut((idx + 0.., ..4 * cbw4)); | ||
| let lvl: &mut [[u8; 4]] = bytemuck::cast_slice_mut(lvl); | ||
| for x in 0..cbw4 { | ||
| let idx = 4 * (level_cache_off + x); | ||
| // `2.., ..2` is for UV | ||
| let lvl = &mut *level_cache.index_mut((idx + 2.., ..2)); | ||
| lvl[0] = filter_level[2][0][0]; | ||
| lvl[1] = filter_level[3][0][0]; | ||
| lvl[x][2] = filter_level[2][0][0]; | ||
| lvl[x][3] = filter_level[3][0][0]; | ||
| } | ||
| level_cache_off += b4_stride; | ||
| } | ||
|
|
@@ -550,12 +553,13 @@ pub(crate) fn rav1d_create_lf_mask_inter( | |
| if bw4 != 0 && bh4 != 0 { | ||
| let mut level_cache_off = by * b4_stride + bx; | ||
| for _y in 0..bh4 { | ||
| let idx = 4 * level_cache_off; | ||
| let lvl = &mut *level_cache.index_mut((idx + 0.., ..4 * bw4)); | ||
| let lvl: &mut [[u8; 4]] = bytemuck::cast_slice_mut(lvl); | ||
| for x in 0..bw4 { | ||
| let idx = 4 * (level_cache_off + x); | ||
| // `0.., ..2` is for Y | ||
| let lvl = &mut *level_cache.index_mut((idx + 0.., ..2)); | ||
| lvl[0] = filter_level[0][r#ref][is_gmv]; | ||
| lvl[1] = filter_level[1][r#ref][is_gmv]; | ||
| // 0, 1 is for Y | ||
| lvl[x][0] = filter_level[0][r#ref][is_gmv]; | ||
| lvl[x][1] = filter_level[1][r#ref][is_gmv]; | ||
| } | ||
| level_cache_off += b4_stride; | ||
| } | ||
|
|
@@ -599,12 +603,13 @@ pub(crate) fn rav1d_create_lf_mask_inter( | |
|
|
||
| let mut level_cache_off = (by >> ss_ver) * b4_stride + (bx >> ss_hor); | ||
| for _y in 0..cbh4 { | ||
| let idx = 4 * level_cache_off; | ||
| let lvl = &mut *level_cache.index_mut((idx + 0.., ..4 * cbw4)); | ||
| let lvl: &mut [[u8; 4]] = bytemuck::cast_slice_mut(lvl); | ||
| for x in 0..cbw4 { | ||
| let idx = 4 * (level_cache_off + x); | ||
| // `2.., ..2` is for UV | ||
| let lvl = &mut *level_cache.index_mut((idx + 2.., ..2)); | ||
| lvl[0] = filter_level[2][r#ref][is_gmv]; | ||
| lvl[1] = filter_level[3][r#ref][is_gmv]; | ||
| // 2, 3 is for UV | ||
| lvl[x][2] = filter_level[2][r#ref][is_gmv]; | ||
| lvl[x][3] = filter_level[3][r#ref][is_gmv]; | ||
| } | ||
| level_cache_off += b4_stride; | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already using
zerocopy, so if you can do these operations withzerocopyinstead ofbytemuckthat would be much preferred.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if zerocopy can do the same thing. If not it's also a "relatively safe" unsafe
mem::transmutecall.