Skip to content

[python] Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch#355

Merged
johnkerl merged 5 commits intomainfrom
kerl/pyarrow-table
Oct 5, 2022
Merged

[python] Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch#355
johnkerl merged 5 commits intomainfrom
kerl/pyarrow-table

Conversation

@johnkerl
Copy link
Copy Markdown
Contributor

@johnkerl johnkerl commented Oct 2, 2022

This is the first in a group of three related PRs:

These three changesets could be done in a single PR, but that would be unmerciful to the reviewers.

@johnkerl johnkerl force-pushed the kerl/pyarrow-table branch from ee41473 to c846356 Compare October 2, 2022 21:27
@johnkerl johnkerl marked this pull request as ready for review October 2, 2022 21:28
@johnkerl johnkerl force-pushed the kerl/pyarrow-table branch from c846356 to 304a644 Compare October 2, 2022 21:29
@johnkerl johnkerl changed the title Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch [WIP] Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch Oct 3, 2022
Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments:

  • SOMA*DataFrame - need unit tests for the full read/write API (eg, there are no Pandas unit tests for SOMADataFrame.read_as_pandas, etc).
  • remove util_arrow.concat_tables and use pyarrow.concat_tables directly. They are sematically equivalent, but the pyarrow implementation is potentially more efficient for very large queries, as it does not require the pre-alloc of all tables before concatenating.
    Also, I'd like to re-review after main has been merged/reconcilied with this branch, as there are a lot of pending changes to the NdArray code.

@johnkerl johnkerl force-pushed the kerl/pyarrow-table branch from f2a63bf to 0a7bc2b Compare October 3, 2022 19:35
@johnkerl johnkerl marked this pull request as draft October 3, 2022 19:36
@johnkerl
Copy link
Copy Markdown
Contributor Author

johnkerl commented Oct 3, 2022

Converting to draft while I rebase -- lots of great activity to merge in from the last day or two of commits :)

@johnkerl johnkerl force-pushed the kerl/pyarrow-table branch 6 times, most recently from 4fa2285 to 8193c21 Compare October 4, 2022 03:10
@johnkerl johnkerl marked this pull request as ready for review October 4, 2022 04:00
@johnkerl johnkerl requested a review from bkmartinjr October 4, 2022 04:00
@johnkerl
Copy link
Copy Markdown
Contributor Author

johnkerl commented Oct 4, 2022

rebased

@johnkerl
Copy link
Copy Markdown
Contributor Author

johnkerl commented Oct 4, 2022

SOMA*DataFrame - need unit tests for the full read/write API (eg, there are no Pandas unit tests for SOMADataFrame.read_as_pandas, etc).

@bkmartinjr 100% agreed but this PR (Arrow RecordBatch -> Table) isn't the right place for it -- I created #372 to track this.

@johnkerl johnkerl force-pushed the kerl/pyarrow-table branch 3 times, most recently from 5ccef0e to 9d2a8f4 Compare October 4, 2022 15:19
@johnkerl johnkerl force-pushed the kerl/pyarrow-table branch from 9d2a8f4 to 5e7f336 Compare October 4, 2022 15:51
@johnkerl
Copy link
Copy Markdown
Contributor Author

johnkerl commented Oct 4, 2022

@bkmartinjr ready for round two :)

Copy link
Copy Markdown
Contributor

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty straightforward, and everything looks good style-wise, so I leave it to others to judge the content of the change. LGTM but not going to hit the "approve" button so that it doesn't spuriously show up as submittable. Please bump me if you want any more feedback!

# TODO: partition,
# TODO: platform_config,
) -> Iterator[pa.RecordBatch]:
) -> Iterator[pa.Table]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not type the ids (aka coordinates) parameter to something more specific? See for example SOMADenseNdArray.

I think you can literally use the SOMADenseCoordinates type (in types.py) as this is a dense dataframe, which only takes a scalar or slice.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we need to decide on whether or not ids is optional. The behavior is inconsistent between DataFrame and NdArray - in the former, None is "all", in the latter, None is not accepted (only the explicit slice(None) is accepted).

CC: @thetorpedodog - if you want to propose a standard convention, we can pull it through the entire package.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to make it not an Optional value, we can easily preserve the "get everything if unspecified" behavior by doing

ids: Union[Sequence[int], slice] = slice(None)

I like this solution because it ends up with an explicit default with already-specified semantics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkmartinjr @thetorpedodog thanks!

Optionality of ids is orthogonal to this PR which is solely pa.RecordBatch -> pa.Table, and has been open long enough, and rebased enough times -- I created #374 to track that

# Context: https://github.com/single-cell-data/TileDB-SOMA/issues/99.
#
# Also: don't materialize these on read
# TODO: get the arrow syntax for drop
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this last TODO comment - we aren't dropping columns, but rather transcoding them to unicode. Maybe clarify the comment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's old & i'll remove it

not sure how it's showing up as "new", must be a copypasta

) -> pa.Table:
"""
This is a convenience method around ``read``. It iterates the return value from ``read`` and returns a concatenation of all the record batches found. Its nominal use is to simply unit-test cases.
This is a convenience method around ``read``. It iterates the return value from ``read`` and returns a concatenation of all the table-pieces found. Its nominal use is to simply unit-test cases.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still valid? The PyArrow documentation describes concat_tables, so I'm not sure we need to recapitulate it here.

Or perhaps reword to describe the intent, eg, "Concatenate all partial read results into a single Table"?

# TODO: result_order,
# TODO: platform_config,
) -> pa.RecordBatch:
) -> pa.Table:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ids parameter needs a type. See comment on read

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#374 also

old_field = record_batch[name]
if isinstance(old_field, pa.LargeBinaryArray):
old_field = table[name]
if len(old_field) > 0 and isinstance(old_field[0], pa.LargeBinaryScalar):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the preferred pyarrow approach is to use the pyarrow.types helper functions rather than isinstance. in this case, I believe that would be pyarrow.types.is_large_binary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkmartinjr will do, but recall this is going away on the very next PR in this stack (#359)

Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments:

  • print's in the test_soma_indexed_dataframe.py that should be removed (eg, line 108-113)
  • the two dataframe objects would benefit from having some tests that check if their write methods raise the expected error when the value param is unexpected (eg, not a Table, or does not match the DataFrame schema).

Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion from RecordBatch -> Table looks fine. Other refinement suggestions left inline which I believe would help the code. I leave it to you if you want to fix in this PR, in another PR, or file bugs.

@johnkerl johnkerl force-pushed the kerl/pyarrow-table branch from dcbcb90 to deb30ca Compare October 5, 2022 03:40
@johnkerl johnkerl merged commit d2e8c31 into main Oct 5, 2022
@johnkerl johnkerl deleted the kerl/pyarrow-table branch October 5, 2022 04:43
Copy link
Copy Markdown
Member

@aaronwolen aaronwolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

aaronwolen pushed a commit that referenced this pull request Oct 6, 2022
…355)

* Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch

* remove util_arrow.concat_tables

* read/write Table

* code-review feedback

* fix unit tests
@johnkerl johnkerl changed the title Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch [python] Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants