Skip to content

Conversation

@mdboom
Copy link
Contributor

@mdboom mdboom commented Jan 29, 2026

See background discussion here.

It turns out that instead of

cdef cydriver.CUtensorMapInterleave cyinterleave = interleave.value

doing this:

cdef cydriver.CUtensorMapInterleave cyinterleave = int(interleave)

reduces the function call overhead of cuTensorMapEncodeTiled by about 25%. Not all API functions have that many enum arguments (5), so I wouldn't expect such a huge win across the board, but seems very worth doing nonetheless.

The reason this works is because IntEnum inherits from int, so there is a good fast path in C to grab its value quickly. enum.value has to do both a dictionary lookup for value and run the Python-written value property code.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 29, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mdboom
Copy link
Contributor Author

mdboom commented Jan 29, 2026

/ok to test

@kkraus14 kkraus14 added the to-be-backported Trigger the bot to raise a backport PR upon merge label Jan 29, 2026
@github-actions

This comment has been minimized.

@kkraus14
Copy link
Collaborator

@mdboom great find. Is this something where the Cython generated cpp code could be optimized in the currently slow case to avoid all of those Python calls?

@mdboom
Copy link
Contributor Author

mdboom commented Jan 30, 2026

@mdboom great find. Is this something where the Cython generated cpp code could be optimized in the currently slow case to avoid all of those Python calls?

Currently cuda-bindings' enums are based on the stdlib enum.IntEnum, which is implemented in Python and therefore we generate Python classes (not cdef classes) to define our enums. That means that all uses of them in Cython must go through Python, and additionally, stdlib enum is notoriously slow. They could be replaced with a more efficient C extension type. For the common case of just accessing the values in the enum, this would be quite easy to replace. However stdlib enum has a bunch of other features (like iterating over the values) that our users might be relying on that we would have to recreate (or methodically decide which to deprecate etc.). I think we should definitely do that at some point, as it's likely to be a big win (especially on import time).

@mdboom mdboom merged commit a7284ca into NVIDIA:main Jan 30, 2026
88 checks passed
@github-actions
Copy link

Backport failed for 12.9.x, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 12.9.x
git worktree add -d .worktree/backport-1543-to-12.9.x origin/12.9.x
cd .worktree/backport-1543-to-12.9.x
git switch --create backport-1543-to-12.9.x
git cherry-pick -x a7284ca5f37abd2ee6267bbf73a83ba63317686e

@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

@jakirkham
Copy link
Collaborator

Cython does have a cpdef enum, which has a good mix of the C-level efficiency with the Python Enum behavior. Maybe worth considering

https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#structs-unions-enums

@leofang
Copy link
Member

leofang commented Feb 1, 2026

@mdboom it'd be nice to regenerate and backport this PR to 12.9.x.

@jakirkham Thanks for reminder. We avoided cpdef enum because of the namespace pollution, but it seems fixed since Cython 3.1.0. Would be a good idea to revisit this approach indeed. We've spent too much time hunting down Python Enum/IntEnum related perf issues (last occurrence: #439 (comment)).

@mdboom
Copy link
Contributor Author

mdboom commented Feb 2, 2026

Cython does have a cpdef enum, which has a good mix of the C-level efficiency with the Python Enum behavior. Maybe worth considering

https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#structs-unions-enums

Unfortunately, cpdef enum is just syntactic sugar for the Python stdlib enum.Enum, and generating a C-level enum at the same time. We already have a C-level enum in our cy* layer, and the Python side is where all of the performance issues are, which Cython cpdef enum unfortunately doesn't address. Measurement of a prototype shows this actually regresses import times.

More info here: #1557

mdboom added a commit to mdboom/cuda-python that referenced this pull request Feb 2, 2026
@jakirkham
Copy link
Collaborator

Thanks Michael and Leo! 🙏

Michael have you shared these issues with the Cython team?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

to-be-backported Trigger the bot to raise a backport PR upon merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants