-
Notifications
You must be signed in to change notification settings - Fork 575
Description
Problem Description
Hi HIP maintainers,
In ROCm 7.1.0 (hipcc reports HIP 7.1.25424, Clang 20.0.0git) the combination of
-fgpu-rdc(relocatable device code)-flto(ThinLTO)- a static library containing device functions in separate translation units
still fails to link, even with --whole-archive.
What works
# Direct source linking (single hipcc invocation) → works with full LTO inlining
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a device_func1.hip device_func2.hip kernel.hip -o worksWhat fails (separate compilation + static lib)
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -c device_func1.hip -o device_func1.o
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -c device_func2.hip -o device_func2.o
llvm-ar cru libDeviceFuncs.a device_func1.o device_func2.o
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -c kernel.hip -o kernel.o
# Variant 1 – normal -l
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -L. -lDeviceFuncs kernel.o -o fail1
→ undefined hidden symbol: __hip_gpubin_handle_… (and __hip_fatbin_…)
# Variant 2 – --whole-archive
hipcc -O2 -fgpu-rdc -flto --offload-arch=gfx90a -Wl,--whole-archive libDeviceFuncs.a -Wl,--no-whole-archive kernel.o -o fail2
→ ld.lld: error: libDeviceFuncs.a(device_func1.o): not an ELF fileFull verbose logs for both failing cases are attached / shown below (they contain the offload-bundler and lld invocations).
This has been a long-standing problem (see #2203, LLVM #77018, LLVM #778018). ROCm 7.1 was expected to improve parallel LTO with -fgpu-rdc, but the static-library path is still unusable for modular device libraries with LTO.
Many large codebases (including ours) need separate compilation + static device libraries + full LTO inlining for performance. The current workaround is either (a) drop LTO completely, or (b) concatenate all device code into one file. The latter is not an option for us.
Could you please confirm whether this is intended to be fixed in an upcoming 7.1.x point release or only in 7.2? Any known workarounds beyond the ones above would also be very welcome.
Thanks!
Bill Jones
NASA Langley Research Center
Operating System
Oracle Linux Server 8.6
CPU
AMD EPYC 7313 16-Core Processor
GPU
AMD Instinct MI210 (amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-)
ROCm Version
7.1.0 and previous
ROCm Component
llvm-project
Steps to Reproduce
Minimal reproducible example (any gfx target, here gfx90a)
// device_func1.hip
__device__ int add1(int x) { return x + 1; }
// device_func2.hip
__device__ int mul2(int x) { return x * 2; }
// kernel.hip
#include <hip/hip_runtime.h>
__global__ void mykernel(int *out, int in) {
*out = mul2(add1(in));
}
int main() {
int *d, h = 0;
hipMalloc(&d, sizeof(int));
hipLaunchKernelGGL(mykernel, 1, 1, 0, 0, d, 5);
hipMemcpyDtoH(&h, d, sizeof(int));
printf("Result: %d\n", h); // expected 12
hipFree(d);
return 0;
}(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module is loaded
HSA System Attributes
Runtime Version: 1.18
Runtime Ext Version: 1.14
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: NO
VMM Support: NO
==========
HSA Agents
Agent 1
Name: AMD EPYC 7313 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7313 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3000
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131729152(0x7da0700) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: AMD EPYC 7313 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7313 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3000
BDFID: 0
Internal Node ID: 1
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 132113132(0x7dfe2ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 3
Name: gfx90a
Uuid: GPU-ed7d74702f8ea3a7
Marketing Name: AMD Instinct MI210
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29711(0x740f)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1700
BDFID: 50432
Internal Node ID: 2
Compute Unit: 104
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 82
SDMA engine uCode:: 8
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
Agent 4
Name: gfx90a
Uuid: GPU-02d164651de1b058
Marketing Name: AMD Instinct MI210
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 3
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29711(0x740f)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1700
BDFID: 34048
Internal Node ID: 3
Compute Unit: 104
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 82
SDMA engine uCode:: 8
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
log_works.txt
log_fail_normal.txt
log_fail_whole_archive.txt