Skip to content

Conversation

@taalexander
Copy link
Collaborator

@taalexander taalexander commented Jan 4, 2026

This PR adds macOS support for CUDA-Q, addressing platform-specific differences in linking, symbol visibility, shell compatibility, and library handling. The changes should enable building and running CUDA-Q on macOS with Apple Silicon (arm64) and Intel (x86_64) architectures with the test suite passing (outside of several minor limitations noted below).

This is a large PR and full Python support also requires Python wheels and CI enablement. I have structured this PR such that it should not impact existing Linux builds. I recommend we treat this as phase one and merge after review/passing CI. We will then follow up with Python wheel and CI PRs to complete support.

update: This PR is now based on #3693 to prepare for it's imminent merger

PRs:

I've tried my best to summarize the contents of the PR below:

1. Build System (CMake)

Platform Detection & Configuration

  • Added macOS sysroot detection via xcrun --show-sdk-path for C++ stdlib headers
  • Configured platform-appropriate linker flags (--no-as-needed for Linux, alternatives for macOS)
  • Added CUDAQ_LIBCXX_PATH and CUDAQ_SYSROOT_PATH configuration for cross-platform header discovery
  • Updated rpath handling: @executable_path on macOS vs $ORIGIN on Linux
  • Changed CMAKE_INSTALL_RPATH to use semicolon separators on macOS

Library Linking Changes

  • Moved MLIR/LLVM dependencies from PUBLIC to PRIVATE in cudaq-mlir-runtime to reduce symbol visibility issues due to two-level namespace
  • Updated plugin extension handling: .dylib on macOS vs .so on Linux
  • Added platform-conditional linking for circular dependencies (--start-group not available on Apple ld)
  • Force two-level namespace linking for cudaq-common to prevent symbol collisions with OpenSSL

New cudaq-utils Library

Created a new low-level utilities library which resolves a circular dependency where cudaq-operator needs complex_matrix functions but is built before libcudaq


2. Two-Level Namespace Workarounds

macOS uses two-level namespace linking by default, where symbols are bound to specific libraries. This causes issues with LLVM/MLIR's static initializer pattern (PassRegistry, TargetRegistry, cl::Options).

Workarounds Implemented

Workaround Location Purpose
flat_namespace linker flag CMakeLists.txt Global symbol visibility
force_load for LLVM CodeGen cmake/BuildHelpers.cmake Ensures static initializers run in correct library context
add_lib_loading_macos_workaround cmake/BuildHelpers.cmake Helper for Python extension targets to ensure proper library loading order
Symbol unexport list lib/Support/Config/CMakeLists.txt Hides LLVM/MLIR symbols from CUDAQTargetConfigUtil to prevent symbol collisions
Explicit InitializeNativeTarget CUDAQuantumExtension.cpp macOS-only target registration for Python extension to workaround issues where the targets were registered to the wrong registry copy
Execution manager override API execution_manager.h Allows explicit manager setting across library boundaries to manage behaviour with execution manager default symbol resolution

Future Removal Pathway

In later versions of clang the DYLIB linking issues have been fixed to ensure MLIR library links will all be rerouted against the dylibs that are built. We should consider moving to these single MLIR/LLVM dylibs at this point to avoid multiple linkage issues.


3. Platform Portability Fixes

Type Size Differences

  • unsigned long is 8 bytes on macOS/arm64 but 4 bytes on some Linux systems so we switch usages to std::uint64_t
  • Updated CCTypes.cpp to use explicit size types where needed

Library Path Handling

  • TargetConfig.cpp: Handle .dylib vs .so extensions
  • fixup-linkage.cpp: Added handling for define weak linkage (macOS clang emits some functions with weak linkage)

Shell Compatibility (POSIX)

Replaced bash commands with POSIX-compatible alternatives:

  • |&2>&1 | for stderr piping
  • Fixed mktemp template usage (macOS requires XXXXXX suffix)
  • Updated shebang and array handling in shell scripts

Standard Library Differences

  • std::vector<bool> has different internal layout between libc++ and libstdc++
  • Added explicit extern char **environ declaration in MQPUUtils.cpp (POSIX requires explicit declaration on macOS)

Other

  • Updated Stim CMakeLists.txt to use platform-appropriate symbol hiding syntax

4. Third-Party Patches

Added patches in tpls/customizations/ for compatibility:

Patch File Purpose
LLVM idempotent option category llvm/idempotent_option_category.diff Makes cl::OptionCategory registration idempotent to handle multiple LLVM copies registering the same category (avoids assertion failures)
Pybind11 LTO flag fix pybind11/pybind11Common.cmake.diff Fixes pybind/pybind11#5098 - incorrect -flto= flag generation for Clang

Xtensor xio.hpp Workaround

  • Removed #include <xtensor/xio.hpp> in molecule.cpp and replaced with manual printing
  • Workaround for clang 17-18 template ambiguity with svector's rebind_container (LLVM #91504)

5. (Pre-existing) Bug Fixes

Note most of these would have been caught by static code analysis. The majority of the bugs were likely a result of more aggressive allocator on OSX.

File Fix
RegToMem.cpp Added proper WalkResult return after op->erase() to prevent iterator invalidation
LoopUnrollPatterns.inc Fixed iterator handling in loop unrolling pattern
ResetBeforeReuse.cpp Fixed stale pointer bug caused by canonicalization in Quantinuum pass
CombineMeasurements.cpp Clarified success return for erased unused measurements
QuakeToLLVM.cpp Fixed variadic argument duplication in controlled rotation codegen - rotation parameters were passed twice to invokeRotationWithControlQubits, causing crashes on ARM64 Darwin (masked on x86_64 due to ABI differences in variadic float handling)

6. Python Bindings

  • Added #ifdef __APPLE__ conditional for InitializeNativeTarget calls
  • Updated CMake to use add_lib_loading_macos_workaround for Python extension targets
  • Fixed Python virtual environment stdlib availability on macOS

7. Documentation & Developer Setup

  • Updated Dev_Setup.md for developer environment setup
  • Added requirements-dev.txt for Python development dependencies
  • Updated Building.md with platform-specific notes
  • Updated scripts/install_toolchain.sh and scripts/install_prerequisites.sh for macOS build instructions
  • Updated scripts/build_cudaq.sh with improved macOS build support
  • Changed OpenSSL build to use CMake on macOS to avoid pkg-config resolution issues with flat namespace

8. Test Updates

  • Updated test RUN lines to use platform-appropriate flags (calling_convention.cpp, infinite_loop.cpp, kernel exec transform tests)
  • Replaced .so with %cudaq_plugin_ext substitution for cross-platform tests
  • Added DISCOVERY_TIMEOUT 120 to backend unit tests (likely just required for my slow machine)
  • Split qvector_init_from_vector.cpp to separate large array test (qvector_init_large_array.cpp) which is skipped on macOS due to stack size
  • Fixed cudaq-qpud linking to enforce two-level namespace for braket backend tests

Known Limitations

Stack Size

macOS has a smaller default stack size (8MB) compared to Linux. Some tests with large stack allocations (e.g., large array initializations) may fail. The qvector_init_large_array.cpp test is currently skipped on macOS for this reason. Future work may address this via ulimit -s or code refactoring.

flat_namespace

The flat_namespace linker flag can cause symbol collisions with system libraries. We should work toward removing this in a follow-up PR.

LLVM cl::OptionCategory Duplicate Registration (Requires LLVM Patch)

Both libcudaq and cudaq-mlir-runtime link LLVM and therefore each contain their own copy of LLVM's cl::OptionCategory static globals. When both libraries are loaded, LLVM's default behavior asserts on duplicate category names.

The idempotent_option_category.diff patch makes registration idempotent, allowing the same category to be registered multiple times without assertion failures. This is a workaround—the proper fix would be to restructure the libraries so only one contains LLVM command-line infrastructure, but that requires more significant refactoring.

C++ Exception Handling in JIT-compiled Code (macOS ARM64)

On macOS ARM64 (Apple Silicon), C++ exceptions thrown from JIT-compiled code
cannot be caught by user code. The exception will terminate the program instead
of unwinding to the catch block. This is could be due to improper exception handling
and for now we have xfailed targettests/execution/estimate_resources_sample_in_choice.cpp
which explicitly tests this capability.


Testing

All tests passing on x86 with ctest --output-on-failure except one related to exception handling as detailed above and has been marked XFAIL.

schweitzpgi and others added 30 commits December 8, 2025 09:08
end is Python.

Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
The kernel builder implementation is still assuming it can just call
some function whenever there is an apply_call, which is incorrect.
As the apply_call could be calling a decorator, all the preconditions
of a decorator call *must* be met, which entails resolving any lambda
lifted arguments in the immediate context.

Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
…ls.decorator

[python redesign] Teach kernel builder how to call kernel decorators.
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Bettina Heim <heimb@outlook.com>
…omain-update

[features/python.redesign] Refactor `cudaq.vqe` in `test_chemistry`
…match

[features/python.redesign] Port some of the old logic to handle simulator precision
For some reason, the test directories were split into two separate directory
structures. This makes it confusing for maintenance and is just plain silly.
This PR merges the two redundant subtrees.

In the future, any PR that introduces new redundant subdirectories should be
met with a "changes requested".

Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
This patch fixed a codegen issue for when a closure contained a single
value when lowering to QIR as the transport layer.

Add regression test.

Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
a downstream cascade of side-effects that results in a loop analysis
failure and apply specialization being pessimistic.

Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
…tion test.

Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Bettina Heim <heimb@outlook.com>
…zation-fix

[features/python.redesign] Fixes for subtle segfaults
Signed-off-by: Bettina Heim <heimb@outlook.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Eric Schweitz <eschweitz@nvidia.com>
C++ exceptions thrown from JIT-compiled code cannot be caught on macOS
ARM64 (Apple Silicon) due to a known upstream LLVM bug in libunwind.

This affects features like estimate_resources when used with callbacks
that invoke JIT-compiled kernels. Tests are marked XFAIL/UNSUPPORTED
until the upstream issue is resolved.

Upstream issue: llvm/llvm-project#49036

Added documentation of this limitation in Building.md.
The rotation parameter was being passed twice to invokeRotationWithControlQubits
and invokeU3RotationWithControlQubits: once as a fixed argument, then again
as part of the variadic arguments via funcArgs.append(instOperands.begin(), ...).
The symptom ultimately that gave this away was encoded PI being observed as a pointer location.

This caused crashes on ARM64 Darwin where all variadic args go on the stack -
the extra parameter shifted every subsequent argument, causing va_arg to read
the parameter's raw bits as pointers.

On x86_64, this bug was masked because the ABI stores floating-point and
integer variadic args in separate areas, so the extra double didn't affect
pointer argument retrieval.

The fix is to skip the already-added parameter(s) when appending variadic operands.

A regression test has been added to ensure the parameter arguments are not
added twice to the function call.
The Python redesign branch refactored pipeline APIs:
- Removed createStatePreparation()
- Renamed createLambdaLiftingPass() to createLambdaLifting()
- Renamed createPreDeviceCodeLoaderPipeline() to createPythonAOTPipeline()

Using the upstream version to match the new APIs.
- Update SimulationState.h: cudaq/utils/matrix.h -> cudaq/operators/matrix.h
- Use upstream kernel_utils.h with common/DeviceCodeRegistry.h include
DeleteStates, ReplaceStateWithKernel, and StatePreparation passes were
removed in the Python redesign branch. Remove their test files as well.
@taalexander taalexander force-pushed the osx-cuda-quantum-support branch from 755e37f to 7b99d9a Compare January 13, 2026 14:16
The Python redesign fixed the threading issue that caused pthread
exhaustion on macOS. Dynamics tests now pass without skip markers.

Signed-off-by: Thomas Alexander <talexander@nvidia.com>
@taalexander
Copy link
Collaborator Author

I have pulled out the bugs that were identified and fixed into separate PRs #3748 #3752 #3755 #3761.

@schweitzpgi
Copy link
Collaborator

schweitzpgi commented Jan 15, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: FLTO command line typo results in clang error

5 participants