Skip to content

fix: ensure read_10x_mtx returns CSR format#4017

Open
ernestprovo23 wants to merge 2 commits intoscverse:mainfrom
ernestprovo23:fix/issue-4016-read-10x-mtx-csr-format
Open

fix: ensure read_10x_mtx returns CSR format#4017
ernestprovo23 wants to merge 2 commits intoscverse:mainfrom
ernestprovo23:fix/issue-4016-read-10x-mtx-csr-format

Conversation

@ernestprovo23
Copy link
Copy Markdown

Summary

Ensure read_10x_mtx() returns an AnnData object with a CSR sparse matrix instead of CSC.

Motivation

Fixes #4016

When _read_10x_mtx() transposes the matrix read by anndata.read_mtx(), scipy automatically converts the CSR matrix to CSC format (this is how scipy sparse transpose works — CSR.T → CSC). Most downstream scanpy operations expect CSR format.

Changes

  • Add adata.X = adata.X.tocsr() after the .T transpose in _read_10x_mtx() (src/scanpy/readwrite.py)
  • Add CSR format assertion to test_read_10x (tests/test_read_10x.py)

Testing

  • All 4 test_read_10x parametrized cases pass locally (v1.2.0 + v3.0.0, with and without prefix)
  • New assertion verifies the matrix is CSR after reading

Notes for reviewers

  • First contribution to scanpy
  • Single-line fix + test assertion, minimal scope

Transposing a CSR matrix in scipy produces CSC format. Add explicit
tocsr() conversion after the .T transpose in _read_10x_mtx() and
add a format assertion to the test suite.

Fixes scverse#4016
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 57.69231% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.27%. Comparing base (2ddfcc7) to head (d3f4030).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/scanpy/readwrite.py 57.69% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4017      +/-   ##
==========================================
- Coverage   78.34%   78.27%   -0.07%     
==========================================
  Files         117      117              
  Lines       12782    12807      +25     
==========================================
+ Hits        10014    10025      +11     
- Misses       2768     2782      +14     
Flag Coverage Δ
hatch-test.low-vers 77.57% <57.69%> (-0.07%) ⬇️
hatch-test.pre 77.23% <57.69%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/readwrite.py 76.69% <57.69%> (-1.97%) ⬇️

@flying-sheep
Copy link
Copy Markdown
Member

flying-sheep commented Mar 30, 2026

Hi, thank you!

So currently scipy.io.mmread returns a coo_matrix, which anndata then converts into a csr_matrix. Scanpy then transposes it (knowing that 10x matrices are stored in a var×obs shape instead of anndata’s obs×var). And a transposed csr_matrix is a csc_matrix in order to not waste cycles reshaping it.

Anndata’s read_mtx is trivial and therefore doesn’t need to be reused. If we want to change this, we should just

  1. copy this code: https://github.com/scverse/anndata/blob/e49e1272f122360a41b710e0cfd495b0b77662c1/src/anndata/_io/read.py#L321-L339
  2. maybe fix that TODO about dtypes,
  3. add a parameter controlling the format the coo_matrix is converted to. It should default to 'csc' (current behavior) and should allow Literal['csr', 'csc', 'coo'].
  4. convert the matrix 0–1 times, not 2 times

Per reviewer feedback, replace the read() + tocsr() approach with a new
_read_mtx() helper that converts COO→CSC directly (one conversion).
Transposing CSC then yields CSR with no extra work.

- Add _read_mtx() with sparse_format parameter (Literal['csr','csc','coo'])
- Inline cache handling in _read_10x_mtx (preserves existing cache behavior)
- Use CSRBase from _compat for type assertions per project conventions
- Pre-commit clean (ruff, formatting)

Fixes scverse#4016
@ernestprovo23
Copy link
Copy Markdown
Author

Thanks for the detailed feedback @flying-sheep — updated to follow your suggested approach:

  1. New _read_mtx() helper — inlines the mmread → sparse conversion from anndata's read_mtx, with a sparse_format parameter (Literal['csr', 'csc', 'coo'], defaults to 'csc')
  2. Dtype handling — skips .astype() if dtype already matches (the TODO you mentioned)
  3. One conversionCOO → CSC directly, then .T yields CSR with no extra work
  4. Cache preserved — inlined the cache read/write logic from _read() so cache=True still works
  5. Pre-commit clean — uses lowercase vars per ruff N806, CSRBase for type assertions

Let me know if the cache handling approach looks right or if you'd prefer it structured differently.

@flying-sheep
Copy link
Copy Markdown
Member

Hi, please don’t repeat all the caching code. As I said, read_mtx should be replaced, not the call to read. Also please limit your LLM use to a minimum, I don’t like reading comments or code created by a machine unless I asked the machine myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

read_10x_mtx saves count data in CSC format

2 participants