-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add tests and fixes for schema resolution bug #9237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ay items and unions.
mzabaluev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works! The changes look good, I have few comments on code reuse.
arrow-avro/src/codec.rs
Outdated
| let null_position = reader_variants | ||
| .iter() | ||
| .position(|x| x == &Schema::TypeName(TypeName::Primitive(PrimitiveType::Null))); | ||
| if let (2, Some(null_idx)) = (reader_variants.len(), null_position) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be good to reuse the nullable_union_variants helper here?
| let manifest_dir = env!("CARGO_MANIFEST_DIR"); | ||
| let gz_path = format!("{manifest_dir}/test/data/bad-varint-bug.avro.gz"); | ||
| let gz_file = File::open(&gz_path).expect("test file should exist"); | ||
| let mut decoder = GzDecoder::new(gz_file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually had the file unpacked in my test suite branch, it's quite small. I only gzipped it to be able to post to GitHub. But since there's already a dependency on flate2, might as well use it.
arrow-avro/src/codec.rs
Outdated
| let non_null_idx = 1 - null_idx; | ||
| let non_null_branch = &reader_variants[non_null_idx]; | ||
| let mut dt = | ||
| self.make_data_type(writer_non_union, Some(non_null_branch), namespace)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call resolve_type directly here because that's what it always amounts to.
Co-authored-by: Mikhail Zabaluev <mikhail.zabaluev@flarion.io>
|
@mzabaluev Thank you for the review! I just pushed up a commit that addresses your comments. Let me know what you think when you get a chance. |
6f13808 to
b2f475d
Compare
b2f475d to
919bdf4
Compare
mzabaluev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Which issue does this PR close?
ParseError("bad varint")on reading array field with nullable elements in the reader schema #9231.Rationale for this change
Avro schema resolution allows a reader schema to represent “nullable” values using a two-branch union (
["null", T]or[T, "null"]) while still reading data written with the non-union schemaT(i.e. without union discriminants in the encoded data).In
arrow-avro, resolving a non-union writer type against a reader union (notably for array/list item schemas likeitems: ["null", "int"]) could incorrectly treat the encoded stream as a union and attempt to decode a union discriminant. This would misalign decoding and could surface asParseError("bad varint")for certain files (see #9231).What changes are included in this PR?
Fix schema resolution when the writer schema is non-union and the reader schema is a union:
nullto be treated as “nullable” (capturing whethernullis first or second), and resolve against the non-null branch.Add regression coverage for the bug and the fixed behavior:
test_resolve_array_writer_nonunion_items_reader_nullable_items(schema resolution / codec)test_array_decoding_writer_nonunion_items_reader_nullable_items(record decoding; ensures correct byte consumption and decoded values)test_bad_varint_bug_nullable_array_items(end-to-end reader regression using a small Avro fixture)Add a small compressed Avro fixture under
arrow-avro/test/data/bad-varint-bug.avro.gzused by the regression test.Are these changes tested?
Yes. This PR adds targeted unit/integration tests that reproduce the prior failure mode and validate correct schema resolution and decoding for nullable-union array items.
Are there any user-facing changes?
Yes (bug fix): reading Avro files with arrays whose element type is represented as a nullable union in the reader schema (e.g.
items: ["null", "int"]) now succeeds instead of failing withParseError("bad varint"). No public API changes are intended.