Skip to content

Conversation

@bedroge
Copy link
Collaborator

@bedroge bedroge commented Sep 15, 2025

Attempt to make it possible to rerun the playbook on top of an existing compat layer. Will require some changes in the installation script, as it now bind mounts an empty host dir as /cvmfs/software.eessi.io, instead it should probably just fusemount the CVMFS repo inside the container.

This PR includes the commits from PR #227 as a test case.

ocaisa and others added 3 commits September 9, 2025 11:45
Fixes EESSI#226 

Since this is pretty relevant to security, I am inclined to point these variable symlinks to `/dev/null` by default but that does not actually address the problem being discussed in EESSI#226 (having to harass the admins to link the CUDA drivers). If we can have logic in our CVMFS configuration then maybe we can address that.
@bedroge
Copy link
Collaborator Author

bedroge commented Sep 15, 2025

Tried it with a bind mount from the host, but that doesn't work, as the files are then owned by the cvmfs user. This leads to all sorts of permission errors in the Prefix environment, e.g.

OSError: [Errno 95] Operation not supported: '/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/var/lib/portage'

@bedroge bedroge force-pushed the update_compat_layer branch from d9288eb to 3d15aa0 Compare September 15, 2025 18:56
@bedroge
Copy link
Collaborator Author

bedroge commented Dec 16, 2025

bot: build repo:eessi.io-2025.06-compat instance:eessi-bot-mc-aws for:arch=x86_64/generic

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Dec 16, 2025

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-compat
Building on: generic
Building for: x86_64/generic
Job dir: /project/def-users/SHARED/jobs/2025.12/pr_229/113486

date job status comment
Dec 16 13:23:07 UTC 2025 submitted job id 113486 awaits release by job manager
Dec 16 13:23:36 UTC 2025 released job awaits launch by Slurm scheduler
Dec 16 13:29:19 UTC 2025 running job 113486 is running
Dec 16 16:16:54 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-113486.out
❌ some task failed
✅ found tarball
Artefacts
eessi-2025.06-compat-linux-x86_64-1765901624.tar.gzsize: 1617 MiB (1696161570 bytes)
entries: 180916
Dec 16 16:16:54 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 23/23 test case(s) from 23 check(s) (6 failure(s), 0 expected failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-113486.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Collaborator Author

bedroge commented Jan 29, 2026

bot: build repo:eessi.io-2025.06-compat instance:eessi-bot-mc-aws for:arch=x86_64/generic

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Jan 29, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-compat
Building on: generic
Building for: x86_64/generic
Job dir: /project/def-users/SHARED/jobs/2026.01/pr_229/125993

date job status comment
Jan 29 12:18:08 UTC 2026 submitted job id 125993 awaits release by job manager
Jan 29 12:18:18 UTC 2026 released job awaits launch by Slurm scheduler
Jan 29 12:24:51 UTC 2026 running job 125993 is running
Jan 29 12:48:57 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-125993.out
❌ some task failed
✅ found tarball
Artefacts
eessi-2025.06-compat-linux-x86_64-1769690443.tar.gzsize: 1837 MiB (1926953344 bytes)
entries: 196053
Jan 29 12:48:57 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
Failed for unknown reason
Details
✅ job output file slurm-125993.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

casparvl
casparvl previously approved these changes Feb 3, 2026
Copy link
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed Bob would add some comments for clarity, but otherwise this looks fine. Deploying!

@casparvl casparvl added bot:deploy Ask bot to deploy built tarballs for compat layer and removed bot:deploy Ask bot to deploy built tarballs for compat layer labels Feb 3, 2026
@casparvl casparvl merged commit 93df65c into EESSI:main Feb 3, 2026
1 check passed
@casparvl
Copy link
Contributor

casparvl commented Feb 3, 2026

Just to summarize, the plan is to do the following to follow up on this PR:

  • This PR makes sure that
/cvmfs/{{ cvmfs_repository }}/versions/{{ eessi_version }}/compat/{{ eessi_host_os }}/{{ eessi_host_arch }}/lib/override
/cvmfs/{{ cvmfs_repository }}/versions/{{ eessi_version }}/compat/{{ eessi_host_os }}/{{ eessi_host_arch }}/lib/nvidia
/cvmfs/{{ cvmfs_repository }}/versions/{{ eessi_version }}/compat/{{ eessi_host_os }}/{{ eessi_host_arch }}/lib/amd

are on the runtime linker search path. We'll add those dirs as variant symlinks, which by default point to a single, version-aspecific location (this location has to be inside the /cvmfs/software.eessi.io repository, since we also want to make that a variant symlink):

/cvmfs/{{ cvmfs_repository }}/versions/{{ eessi_version }}/compat/{{ eessi_host_os }}/{{ eessi_host_arch }}/lib/override -> $(EESSI_{{eessi_version}_LIB_OVERRIDE:-/cvmfs/{{ cvmfs_repository }}/defaults/override)
/cvmfs/{{ cvmfs_repository }}/versions/{{ eessi_version }}/compat/{{ eessi_host_os }}/{{ eessi_host_arch }}/lib/nvidia -> $(EESSI_{{eessi_version}_NVIDIA_OVERRIDE:-/cvmfs/{{ cvmfs_repository }}/defaults/nvidia)
/cvmfs/{{ cvmfs_repository }}/versions/{{ eessi_version }}/compat/{{ eessi_host_os }}/{{ eessi_host_arch }}/lib/amd -> $(EESSI_{{eessi_version}_AMD_OVERRIDE:-/cvmfs/{{ cvmfs_repository }}/defaults/override)

These default locations are again variant symlinks that point to /dev/null for security:

/cvmfs/{{ cvmfs_repository }}/defaults/override -> $(EESSI_LIB_OVERRIDE_DEFAULT:-/dev/null).
/cvmfs/{{ cvmfs_repository }}/defaults/nvidia -> $(EESSI_NVIDIA_OVERRIDE_DEFAULT:-/dev/null).
/cvmfs/{{ cvmfs_repository }}/defaults/amd -> $(EESSI_AMD_OVERRIDE_DEFAULT:-/dev/null).

With this, sites can choose if they want their nvidia, amd and override symlinks to be version specific (in which case they'd set EESSI_{{eessi_version}_LIB_OVERRIDE, EESSI_{{eessi_version}_NVIDIA_OVERRIDE or EESSI_{{eessi_version}_AMD_OVERRIDE in their CVMFS configuration), or if they want them to point to the same directory in host-injections for all versions (in which case they'd only set EESSI_LIB_OVERRIDE_DEFAULT, EESSI_NVIDIA_OVERRIDE_DEFAULT, EESSI_AMD_OVERRIDE_DEFAULT).

The latter (i.e. only setting the defaults) is probably a good starting point - and then sites can set the version-specific symlinks only if they have a good reason for it. Using the defaults means you only have to symlink the nvidia drivers once, and not once per EESSI version.

@boegel
Copy link
Contributor

boegel commented Feb 3, 2026

Just to summarize, the plan is to do the following to follow up on this PR:

...

@casparvl Please make that an issue, don't try to keep track of things in merged PRs (I need to mention that as bad advice in a future talk somewhere...)

@casparvl
Copy link
Contributor

casparvl commented Feb 3, 2026

Well, it's basically just a clarification of #227 (comment) (which was made in a PR that got replaced by this PR).

Essentially, the work is done in this PR. It's just about providing context about why these new trusted dirs were introduced - and that's context I would like to have exactly in the PR that introduced them ;-) Honestly, I considered putting it in the opening post as motivation why we even made this PR.

The fact that this requires some follow-up to actually exploit this new feature is a detail - and no need to log it, I'm working on that already ;-)

@casparvl
Copy link
Contributor

casparvl commented Feb 3, 2026

The more important place where to put this is in the documentation. But that's also on my TODO list (asap), but only worth once the change to link_nvidia_host_injections.sh is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:deploy Ask bot to deploy built tarballs for compat layer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants