[python] Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch by johnkerl · Pull Request #355 · single-cell-data/TileDB-SOMA

johnkerl · 2022-10-02T21:22:06Z

This is the first in a group of three related PRs:

This PR: conform to https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md by having read returnpyarrow.Table not pyarrow.RecordBatch (as in an outdated version of that spec)
[python] Use true ASCII attributes in dataframes #359: Forward-port [python] Update ASCII storage for dataframes #273 from main-old which will truly have ASCII columns, obviating the need for our util_arrow.ascii_to_unicode_pyarrow_readback
[python/c++] Connect libtiledbsoma to tiledbsoma readers for dataframes [WIP] #360: Drop in C++ acceleration-library support for read methods, which will go in cleanly now
- The C++ code returns pyarrow.Table and with the first PR our unit tests will be ready to go
- When the C++ code reads Unicode cells it returns them as pyarrow.LargeBinaryArray (needing decode) but when we are properly writing ASCII cells via the Python write path then the C++ code will read ASCII cells and return them as strings (no longer needing decoding)

These three changesets could be done in a single PR, but that would be unmerciful to the reviewers.

apis/python/src/tiledbsoma/soma_dataframe.py

apis/python/src/tiledbsoma/soma_indexed_dataframe.py

apis/python/src/tiledbsoma/util_arrow.py

apis/python/src/tiledbsoma/soma_sparse_nd_array.py

bkmartinjr

Comments:

SOMA*DataFrame - need unit tests for the full read/write API (eg, there are no Pandas unit tests for SOMADataFrame.read_as_pandas, etc).
remove util_arrow.concat_tables and use pyarrow.concat_tables directly. They are sematically equivalent, but the pyarrow implementation is potentially more efficient for very large queries, as it does not require the pre-alloc of all tables before concatenating.
Also, I'd like to re-review after main has been merged/reconcilied with this branch, as there are a lot of pending changes to the NdArray code.

johnkerl · 2022-10-03T19:36:53Z

Converting to draft while I rebase -- lots of great activity to merge in from the last day or two of commits :)

johnkerl · 2022-10-04T04:11:34Z

rebased

johnkerl · 2022-10-04T14:01:59Z

SOMA*DataFrame - need unit tests for the full read/write API (eg, there are no Pandas unit tests for SOMADataFrame.read_as_pandas, etc).

@bkmartinjr 100% agreed but this PR (Arrow RecordBatch -> Table) isn't the right place for it -- I created #372 to track this.

ready for re-review

johnkerl · 2022-10-04T20:21:25Z

@bkmartinjr ready for round two :)

thetorpedodog

This is pretty straightforward, and everything looks good style-wise, so I leave it to others to judge the content of the change. LGTM but not going to hit the "approve" button so that it doesn't spuriously show up as submittable. Please bump me if you want any more feedback!

bkmartinjr · 2022-10-04T21:12:08Z

apis/python/src/tiledbsoma/soma_dataframe.py

        # TODO: partition,
        # TODO: platform_config,
-    ) -> Iterator[pa.RecordBatch]:
+    ) -> Iterator[pa.Table]:


why not type the ids (aka coordinates) parameter to something more specific? See for example SOMADenseNdArray.

I think you can literally use the SOMADenseCoordinates type (in types.py) as this is a dense dataframe, which only takes a scalar or slice.

Also, we need to decide on whether or not ids is optional. The behavior is inconsistent between DataFrame and NdArray - in the former, None is "all", in the latter, None is not accepted (only the explicit slice(None) is accepted).

CC: @thetorpedodog - if you want to propose a standard convention, we can pull it through the entire package.

If we want to make it not an Optional value, we can easily preserve the "get everything if unspecified" behavior by doing

ids: Union[Sequence[int], slice] = slice(None)

I like this solution because it ends up with an explicit default with already-specified semantics.

@bkmartinjr @thetorpedodog thanks!

Optionality of ids is orthogonal to this PR which is solely pa.RecordBatch -> pa.Table, and has been open long enough, and rebased enough times -- I created #374 to track that

bkmartinjr · 2022-10-04T21:18:41Z

apis/python/src/tiledbsoma/soma_dataframe.py

+                # Context: https://github.com/single-cell-data/TileDB-SOMA/issues/99.
+                #
+                # Also: don't materialize these on read
+                # TODO: get the arrow syntax for drop


I'm not sure I understand this last TODO comment - we aren't dropping columns, but rather transcoding them to unicode. Maybe clarify the comment?

yeah it's old & i'll remove it

not sure how it's showing up as "new", must be a copypasta

bkmartinjr · 2022-10-04T21:20:43Z

apis/python/src/tiledbsoma/soma_dataframe.py

+    ) -> pa.Table:
        """
-        This is a convenience method around ``read``. It iterates the return value from ``read`` and returns a concatenation of all the record batches found. Its nominal use is to simply unit-test cases.
+        This is a convenience method around ``read``. It iterates the return value from ``read`` and returns a concatenation of all the table-pieces found. Its nominal use is to simply unit-test cases.


is this comment still valid? The PyArrow documentation describes concat_tables, so I'm not sure we need to recapitulate it here.

Or perhaps reword to describe the intent, eg, "Concatenate all partial read results into a single Table"?

bkmartinjr · 2022-10-04T21:21:20Z

apis/python/src/tiledbsoma/soma_dataframe.py

        # TODO: result_order,
        # TODO: platform_config,
-    ) -> pa.RecordBatch:
+    ) -> pa.Table:


ids parameter needs a type. See comment on read

bkmartinjr · 2022-10-04T21:30:27Z

apis/python/src/tiledbsoma/util_arrow.py

-        old_field = record_batch[name]
-        if isinstance(old_field, pa.LargeBinaryArray):
+        old_field = table[name]
+        if len(old_field) > 0 and isinstance(old_field[0], pa.LargeBinaryScalar):


I think the preferred pyarrow approach is to use the pyarrow.types helper functions rather than isinstance. in this case, I believe that would be pyarrow.types.is_large_binary

@bkmartinjr will do, but recall this is going away on the very next PR in this stack (#359)

bkmartinjr

comments:

print's in the test_soma_indexed_dataframe.py that should be removed (eg, line 108-113)
the two dataframe objects would benefit from having some tests that check if their write methods raise the expected error when the value param is unexpected (eg, not a Table, or does not match the DataFrame schema).

bkmartinjr

The conversion from RecordBatch -> Table looks fine. Other refinement suggestions left inline which I believe would help the code. I leave it to you if you want to fix in this PR, in another PR, or file bugs.

aaronwolen

🚀

…355) * Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch * remove util_arrow.concat_tables * read/write Table * code-review feedback * fix unit tests

johnkerl requested review from bkmartinjr, gspowley and thetorpedodog October 2, 2022 21:22

johnkerl force-pushed the kerl/pyarrow-table branch from ee41473 to c846356 Compare October 2, 2022 21:27

johnkerl marked this pull request as ready for review October 2, 2022 21:28

johnkerl force-pushed the kerl/pyarrow-table branch from c846356 to 304a644 Compare October 2, 2022 21:29

johnkerl commented Oct 2, 2022

View reviewed changes

apis/python/src/tiledbsoma/soma_dataframe.py Show resolved Hide resolved

johnkerl commented Oct 2, 2022

View reviewed changes

apis/python/src/tiledbsoma/soma_indexed_dataframe.py Show resolved Hide resolved

johnkerl changed the title ~~Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch [WIP]~~ Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch Oct 3, 2022

This was referenced Oct 3, 2022

[python] Use true ASCII attributes in dataframes #359

Merged

[python/c++] Connect libtiledbsoma to tiledbsoma readers for dataframes [WIP] #360

Closed

bkmartinjr reviewed Oct 3, 2022

View reviewed changes

apis/python/src/tiledbsoma/util_arrow.py Outdated Show resolved Hide resolved

bkmartinjr reviewed Oct 3, 2022

View reviewed changes

apis/python/src/tiledbsoma/util_arrow.py Show resolved Hide resolved

bkmartinjr reviewed Oct 3, 2022

View reviewed changes

apis/python/src/tiledbsoma/soma_sparse_nd_array.py Outdated Show resolved Hide resolved

bkmartinjr previously requested changes Oct 3, 2022

View reviewed changes

johnkerl force-pushed the kerl/pyarrow-table branch from f2a63bf to 0a7bc2b Compare October 3, 2022 19:35

johnkerl marked this pull request as draft October 3, 2022 19:36

johnkerl force-pushed the kerl/pyarrow-table branch 6 times, most recently from 4fa2285 to 8193c21 Compare October 4, 2022 03:10

johnkerl marked this pull request as ready for review October 4, 2022 04:00

johnkerl requested a review from bkmartinjr October 4, 2022 04:00

johnkerl mentioned this pull request Oct 4, 2022

[python] Improve unit-test coverage for Pandas I/O #372

Closed

johnkerl force-pushed the kerl/pyarrow-table branch 3 times, most recently from 5ccef0e to 9d2a8f4 Compare October 4, 2022 15:19

johnkerl added 3 commits October 4, 2022 11:51

Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch

0937b6a

remove util_arrow.concat_tables

ea52355

read/write Table

5e7f336

johnkerl force-pushed the kerl/pyarrow-table branch from 9d2a8f4 to 5e7f336 Compare October 4, 2022 15:51

thetorpedodog reviewed Oct 4, 2022

View reviewed changes

bkmartinjr reviewed Oct 4, 2022

View reviewed changes

bkmartinjr approved these changes Oct 4, 2022

View reviewed changes

johnkerl mentioned this pull request Oct 5, 2022

Regularize optionality and typing of ids between dataframe and ndarray #374

Closed

code-review feedback

deb30ca

johnkerl force-pushed the kerl/pyarrow-table branch from dcbcb90 to deb30ca Compare October 5, 2022 03:40

fix unit tests

5f65c0e

johnkerl merged commit d2e8c31 into main Oct 5, 2022

johnkerl deleted the kerl/pyarrow-table branch October 5, 2022 04:43

aaronwolen reviewed Oct 5, 2022

View reviewed changes

This was referenced Oct 12, 2022

[python/c++] Connect C++ reader for dataframes #400

Merged

[python] Unicode data generates error upon write to dataframe #415

Closed

johnkerl changed the title ~~Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch~~ [python] Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch Oct 26, 2022

Conversation

johnkerl commented Oct 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkmartinjr left a comment

Choose a reason for hiding this comment

Uh oh!

johnkerl commented Oct 3, 2022

Uh oh!

johnkerl commented Oct 4, 2022

Uh oh!

johnkerl commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnkerl commented Oct 4, 2022

Uh oh!

thetorpedodog left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkmartinjr left a comment

Choose a reason for hiding this comment

Uh oh!

bkmartinjr left a comment

Choose a reason for hiding this comment

Uh oh!

aaronwolen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

johnkerl commented Oct 2, 2022 •

edited

Loading

johnkerl commented Oct 4, 2022 •

edited

Loading