Will order be preserved when writing/reading a parquet file with ordered dictionaries? #49508
-
|
If I write a table to a parquet that has a dictionary with ordered=True and the dictionary does not exceed the size limit will all of the categories always be read back in in the same order? If the parquet file is written in multiple row groups will each dictionary of the row groups be the same or does it remove values that are not present in that row group? Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
This kind of question makes me think about how much behavior we rely on from the underlying format vs how data is packaged. Will categories always be read back in the same order?
If the file has multiple row groups, will each row group dictionary be the same?
If you need a rock-solid guarantee of category order across files/readers, try storing the category list explicitly in metadata or a companion schema artifact. One thing I’ve run into is that once data leaves formats like Parquet and gets bundled into archives (zip/tar/etc.), we lose a lot of these guarantees around selective reads and ordering. You often end up decompressing everything just to access one piece. Are people just avoiding archives entirely in these workflows, or is there a pattern for preserving efficient access once data is packaged? |
Beta Was this translation helpful? Give feedback.
This kind of question makes me think about how much behavior we rely on from the underlying format vs how data is packaged.
Will categories always be read back in the same order?
If the file has multiple row groups, will each row group dictionary be the same?
If you need a rock-solid guarantee of category order across files/reade…