[python] Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch#355
[python] Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch#355
Conversation
ee41473 to
c846356
Compare
c846356 to
304a644
Compare
bkmartinjr
left a comment
There was a problem hiding this comment.
Comments:
- SOMA*DataFrame - need unit tests for the full read/write API (eg, there are no Pandas unit tests for SOMADataFrame.read_as_pandas, etc).
- remove util_arrow.concat_tables and use pyarrow.concat_tables directly. They are sematically equivalent, but the pyarrow implementation is potentially more efficient for very large queries, as it does not require the pre-alloc of all tables before concatenating.
Also, I'd like to re-review after main has been merged/reconcilied with this branch, as there are a lot of pending changes to the NdArray code.
f2a63bf to
0a7bc2b
Compare
|
Converting to draft while I rebase -- lots of great activity to merge in from the last day or two of commits :) |
4fa2285 to
8193c21
Compare
|
rebased |
@bkmartinjr 100% agreed but this PR (Arrow |
5ccef0e to
9d2a8f4
Compare
9d2a8f4 to
5e7f336
Compare
|
@bkmartinjr ready for round two :) |
thetorpedodog
left a comment
There was a problem hiding this comment.
This is pretty straightforward, and everything looks good style-wise, so I leave it to others to judge the content of the change. LGTM but not going to hit the "approve" button so that it doesn't spuriously show up as submittable. Please bump me if you want any more feedback!
| # TODO: partition, | ||
| # TODO: platform_config, | ||
| ) -> Iterator[pa.RecordBatch]: | ||
| ) -> Iterator[pa.Table]: |
There was a problem hiding this comment.
why not type the ids (aka coordinates) parameter to something more specific? See for example SOMADenseNdArray.
I think you can literally use the SOMADenseCoordinates type (in types.py) as this is a dense dataframe, which only takes a scalar or slice.
There was a problem hiding this comment.
Also, we need to decide on whether or not ids is optional. The behavior is inconsistent between DataFrame and NdArray - in the former, None is "all", in the latter, None is not accepted (only the explicit slice(None) is accepted).
CC: @thetorpedodog - if you want to propose a standard convention, we can pull it through the entire package.
There was a problem hiding this comment.
If we want to make it not an Optional value, we can easily preserve the "get everything if unspecified" behavior by doing
ids: Union[Sequence[int], slice] = slice(None)
I like this solution because it ends up with an explicit default with already-specified semantics.
There was a problem hiding this comment.
@bkmartinjr @thetorpedodog thanks!
Optionality of ids is orthogonal to this PR which is solely pa.RecordBatch -> pa.Table, and has been open long enough, and rebased enough times -- I created #374 to track that
| # Context: https://github.com/single-cell-data/TileDB-SOMA/issues/99. | ||
| # | ||
| # Also: don't materialize these on read | ||
| # TODO: get the arrow syntax for drop |
There was a problem hiding this comment.
I'm not sure I understand this last TODO comment - we aren't dropping columns, but rather transcoding them to unicode. Maybe clarify the comment?
There was a problem hiding this comment.
yeah it's old & i'll remove it
not sure how it's showing up as "new", must be a copypasta
| ) -> pa.Table: | ||
| """ | ||
| This is a convenience method around ``read``. It iterates the return value from ``read`` and returns a concatenation of all the record batches found. Its nominal use is to simply unit-test cases. | ||
| This is a convenience method around ``read``. It iterates the return value from ``read`` and returns a concatenation of all the table-pieces found. Its nominal use is to simply unit-test cases. |
There was a problem hiding this comment.
is this comment still valid? The PyArrow documentation describes concat_tables, so I'm not sure we need to recapitulate it here.
Or perhaps reword to describe the intent, eg, "Concatenate all partial read results into a single Table"?
| # TODO: result_order, | ||
| # TODO: platform_config, | ||
| ) -> pa.RecordBatch: | ||
| ) -> pa.Table: |
There was a problem hiding this comment.
ids parameter needs a type. See comment on read
| old_field = record_batch[name] | ||
| if isinstance(old_field, pa.LargeBinaryArray): | ||
| old_field = table[name] | ||
| if len(old_field) > 0 and isinstance(old_field[0], pa.LargeBinaryScalar): |
There was a problem hiding this comment.
I think the preferred pyarrow approach is to use the pyarrow.types helper functions rather than isinstance. in this case, I believe that would be pyarrow.types.is_large_binary
There was a problem hiding this comment.
@bkmartinjr will do, but recall this is going away on the very next PR in this stack (#359)
bkmartinjr
left a comment
There was a problem hiding this comment.
comments:
- print's in the test_soma_indexed_dataframe.py that should be removed (eg, line 108-113)
- the two dataframe objects would benefit from having some tests that check if their write methods raise the expected error when the
valueparam is unexpected (eg, not a Table, or does not match the DataFrame schema).
bkmartinjr
left a comment
There was a problem hiding this comment.
The conversion from RecordBatch -> Table looks fine. Other refinement suggestions left inline which I believe would help the code. I leave it to you if you want to fix in this PR, in another PR, or file bugs.
dcbcb90 to
deb30ca
Compare
…355) * Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch * remove util_arrow.concat_tables * read/write Table * code-review feedback * fix unit tests
This is the first in a group of three related PRs:
readreturnpyarrow.Tablenotpyarrow.RecordBatch(as in an outdated version of that spec)main-oldwhich will truly have ASCII columns, obviating the need for ourutil_arrow.ascii_to_unicode_pyarrow_readbackreadmethods, which will go in cleanly nowpyarrow.Tableand with the first PR our unit tests will be ready to gopyarrow.LargeBinaryArray(needing decode) but when we are properly writing ASCII cells via the Python write path then the C++ code will read ASCII cells and return them as strings (no longer needing decoding)These three changesets could be done in a single PR, but that would be unmerciful to the reviewers.