Skip to content

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Avro schema resolution allows a reader schema to represent “nullable” values using a two-branch union (["null", T] or [T, "null"]) while still reading data written with the non-union schema T (i.e. without union discriminants in the encoded data).

In arrow-avro, resolving a non-union writer type against a reader union (notably for array/list item schemas like items: ["null", "int"]) could incorrectly treat the encoded stream as a union and attempt to decode a union discriminant. This would misalign decoding and could surface as ParseError("bad varint") for certain files (see #9231).

What changes are included in this PR?

  • Fix schema resolution when the writer schema is non-union and the reader schema is a union:

    • Special-case two-branch unions containing null to be treated as “nullable” (capturing whether null is first or second), and resolve against the non-null branch.
    • Improve matching for general reader unions by attempting to resolve against each union variant, preferring a direct match, and constructing the appropriate union resolution mapping for the selected branch.
    • Ensure promotions are represented at the union-resolution level (avoiding nested promotion resolution on the selected union child).
  • Add regression coverage for the bug and the fixed behavior:

    • test_resolve_array_writer_nonunion_items_reader_nullable_items (schema resolution / codec)
    • test_array_decoding_writer_nonunion_items_reader_nullable_items (record decoding; ensures correct byte consumption and decoded values)
    • test_bad_varint_bug_nullable_array_items (end-to-end reader regression using a small Avro fixture)
  • Add a small compressed Avro fixture under arrow-avro/test/data/bad-varint-bug.avro.gz used by the regression test.

Are these changes tested?

Yes. This PR adds targeted unit/integration tests that reproduce the prior failure mode and validate correct schema resolution and decoding for nullable-union array items.

Are there any user-facing changes?

Yes (bug fix): reading Avro files with arrays whose element type is represented as a nullable union in the reader schema (e.g. items: ["null", "int"]) now succeeds instead of failing with ParseError("bad varint"). No public API changes are intended.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Jan 21, 2026
Copy link
Contributor

@mzabaluev mzabaluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works! The changes look good, I have few comments on code reuse.

Comment on lines 1532 to 1535
let null_position = reader_variants
.iter()
.position(|x| x == &Schema::TypeName(TypeName::Primitive(PrimitiveType::Null)));
if let (2, Some(null_idx)) = (reader_variants.len(), null_position) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good to reuse the nullable_union_variants helper here?

let manifest_dir = env!("CARGO_MANIFEST_DIR");
let gz_path = format!("{manifest_dir}/test/data/bad-varint-bug.avro.gz");
let gz_file = File::open(&gz_path).expect("test file should exist");
let mut decoder = GzDecoder::new(gz_file);
Copy link
Contributor

@mzabaluev mzabaluev Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had the file unpacked in my test suite branch, it's quite small. I only gzipped it to be able to post to GitHub. But since there's already a dependency on flate2, might as well use it.

let non_null_idx = 1 - null_idx;
let non_null_branch = &reader_variants[non_null_idx];
let mut dt =
self.make_data_type(writer_non_union, Some(non_null_branch), namespace)?;
Copy link
Contributor

@mzabaluev mzabaluev Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call resolve_type directly here because that's what it always amounts to.

Co-authored-by: Mikhail Zabaluev <mikhail.zabaluev@flarion.io>
@jecsand838
Copy link
Contributor Author

@mzabaluev Thank you for the review! I just pushed up a commit that addresses your comments. Let me know what you think when you get a chance.

Copy link
Contributor

@mzabaluev mzabaluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParseError("bad varint") on reading array field with nullable elements in the reader schema

3 participants