-
Notifications
You must be signed in to change notification settings - Fork 80
Open
Description
Summary
Follow-up to #502. The data conversion layer now supports LargeListArray (64-bit offsets) via ProjectRecordBatch, but the Parquet reader's schema validation still rejects LARGE_LIST types. Additionally, the reader needs to expose Arrow's list_type property to allow users to request LargeListArray output.
Problem
ValidateParquetSchemaEvolutioninparquet_schema_util.cc:177-180only accepts::arrow::Type::LIST:
case TypeId::kList:
if (arrow_type->id() == ::arrow::Type::LIST) {
return {};
}
break;- Arrow's Parquet reader defaults to
Type::LISToutput. Without exposingArrowReaderProperties::set_list_type(), users cannot requestLargeListArrayoutput.
Proposed Solution
1. Update schema validation to accept both list types
case TypeId::kList:
if (arrow_type->id() == ::arrow::Type::LIST ||
arrow_type->id() == ::arrow::Type::LARGE_LIST) {
return {};
}
break;2. Add kListType to ReaderProperties
Expose a property to configure the Arrow list type preference.
3. Pass through to Arrow reader
In ParquetReader::Impl::Open(), call arrow_reader_properties.set_list_type() with the configured value.
Why This Is Safe
- Iceberg's
ListTypedoesn't distinguish between LIST and LARGE_LIST - The projection layer (
ProjectRecordBatch) already handles both via templatedProjectListArrayImpl<> - Both represent the same logical "list" concept, just with different offset sizes
Files to Change
src/iceberg/parquet/parquet_schema_util.cc- UpdateValidateParquetSchemaEvolutionsrc/iceberg/parquet/parquet_reader.cc- Passlist_typetoArrowReaderPropertiessrc/iceberg/reader.h- AddkListTypetoReaderPropertiessrc/iceberg/test/parquet_test.cc- Add integration tests
Related
- Closes the remaining work from Add support for Arrow LargeListArray in Parquet data projection #502
Metadata
Metadata
Assignees
Labels
No labels