Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

rootfs: make pivot_root(2) dance handle initramfs case#4434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
cyphar wants to merge6 commits intoopencontainers:main
base:main
Choose a base branch
Loading
fromcyphar:bind-pivot-root

Conversation

@cyphar
Copy link
Member

@cypharcyphar commentedOct 10, 2024
edited
Loading

Whilepivot_root(2) normally refuses to pivot a mount if you are running
with/ as initramfs (because initramfs doesn't have a parent mount), you
can create a bind-mount of/ and make that your new root to work around
this problem. This does usechroot(2), but this is only done temporarily
to setcurrent->fs->root to the new mount. Oncepivot_root(2) finishes,
thechroot(2) and/ are gone.

Variants of this hack are fairly well known and is used all over the
place (see1,2) but until now we have forced users to have a far less
secure configuration with--no-pivot. This is a slightly modified
version that uses the container rootfs as the temporary spot for the/
clone -- this allows runc to continue working with read-only image-based
OS images.

Signed-off-by: Aleksa Saraicyphar@cyphar.com

@cypharcyphar modified the milestones:1.2.1,1.3.0Oct 21, 2024
@cypharcyphar added the backport/1.2-todoA PR in main branch which needs to be backported to release-1.2 labelOct 23, 2024
@cypharcyphar modified the milestones:1.3.0,1.2.1Oct 25, 2024
@cyphar
Copy link
MemberAuthor

Okay, I managed to test this and this version definitely works. The setup is not too complicated, but I'm not sure if we could practically test this within our CI (what would be a nice way of verifying the container ran inside the VM?).

Script used to create the initramfs (openSUSE)
#!/bin/bashset -Eeuo pipefail#sudo zypper in -y busybox syslinux skopeo umocisudo rm -rf boot-img/[-e runc ]|| curl -sSL"https://github.com/opencontainers/runc/releases/download/v1.2.0/runc.amd64" -o runc[-d opensuse ]|| skopeo copy docker://opensuse/tumbleweed:latest oci:opensuse:tumbleweed[-d bundle ]|| sudo umoci unpack --image opensuse:tumbleweed ./bundlemkdir -vp boot-img/pushd boot-imgmkdir -p ./usr/binln -sv usr/bin ./bin# Copy runc.ln ../runc ./usr/bin/runc# Copy the rootfs bundle.mkdir -p runsudo cp -aR ../bundle ./run/bundle# openSUSE makes /usr/bin/busybox non-static, and you can't ask busybox.install# to install the static version. So install busybox using symlinks and then# replace busybox with busybox-static.busybox.install. --symlinkscp -v /usr/bin/busybox-static ./usr/bin/busybox# Boot into a shell.cat>./init<<EOF#!/bin/shecho "HELLO WORLD"mkdir -p /procmount -t proc proc /procmkdir -p /sysmount -t sysfs sysfs /sysmkdir -p /sys/fs/cgroupmount -t cgroup2 cgroup2 /sys/fs/cgroupmkdir -p /tmpmount -t tmpfs tmpfs /tmpmkdir -p /devmount -t devtmpfs devtmpfs /devmkdir -p /dev/ptsmount -t devpts -o newinstance devpts /dev/ptsmkdir -p /dev/shmmount --bind /tmp /dev/shm/bin/shEOFchmod +x ./init# Build our init.cpio.sudo find.| sudo cpio -o -H newc> ../init.cpiopopd

And you can then just doqemu-system-x86_64 -kernel /boot/vmlinuz -initrd ./init.cpio -m 2G -nographic -append console=ttyS0 to run a VM with this setup. You can verify this new version works by just doingrunc run -b /run/bundle ctr.

@cypharcyphar marked this pull request as ready for reviewOctober 25, 2024 07:08
@cypharcypharforce-pushed thebind-pivot-root branch 2 times, most recently from47dffa5 toe27cce8CompareOctober 25, 2024 07:24
@cyphar
Copy link
MemberAuthor

@kolyshkin@AkihiroSuda Do you want me to try to come up with a CI test for this case? I'm not really sure if there is a nice way of testing the output of qemu (doesn't GitHub block nested VMs as well?). Then again, maybe we could make the init script run runc and just parse the-nographic -append console=ttyS0 output?

@kolyshkin
Copy link
Contributor

kolyshkin commentedOct 28, 2024
edited
Loading

I'm afraid it's going to be tough. In case qemu is not working in GHA (last time I checked it was only working on Mac OS X instances, but it was ~2y ago), try cirrus-ci, it uses GCP and with this instance it's possible to use nested virt (which we do to run vagrant-libvirt):

runc/.cirrus.yml

Lines 22 to 29 in4ad9f7f

compute_engine_instance:
image_project:cirrus-images
image:family/docker-kvm
platform:linux
nested_virtualization:true
# CPU limit: `16 / NTASK`: see https://cirrus-ci.org/faq/#are-there-any-limits
cpu:4
# Memory limit: `4GB * NCPU`

@AkihiroSuda
Copy link
Member

AkihiroSuda commentedOct 28, 2024
edited
Loading

The default Linux instances of GHA now supports nested virt.
(Used in the CI of containerd, Lima, etc.)

@cypharcypharforce-pushed thebind-pivot-root branch 3 times, most recently from5a291f2 to0dfe91bCompareOctober 28, 2024 07:22
@cypharcypharforce-pushed thebind-pivot-root branch 3 times, most recently from3c366c2 to5cd283aCompareOctober 28, 2024 07:33
iferr!=nil {
// Always try to do pivot_root(2) because it's safe, and only fallback
// to the unsafe MS_MOVE+chroot(2) dance if pivot_root(2) fails.
logrus.Warnf("your container failed to start with pivot_root(2) (%v) -- please open a bug report to let us know about your usecase",err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

End-users may not understand who are "us", as they don't execute runc directly.

We may linkhttps://github.com/opencontainers/runc/issues for explicitness, but we may potentially get a report about some third-party product that we have never heard of.

Copy link
MemberAuthor

@cypharcypharOct 28, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We already provide a link to the tracking issue for this deprecation if you pass--no-pivot (regardless of whetherpivot_root(2) works), I don't think we need to provide the same link twice?

@cypharcypharforce-pushed thebind-pivot-root branch 5 times, most recently frome3acf8d tobeaf64aCompareOctober 28, 2024 08:34
kolyshkin
kolyshkin previously requested changesOct 31, 2024
run.go Outdated
cli.BoolFlag{
Name:"no-pivot",
Usage:"do not use pivot root to jail process inside rootfs. This should be used whenever the rootfsison top of a ramdisk",
Usage:"do not use pivot root to jail process inside rootfs to work around limitations when running in an initramfs (this optionisdeprecated, insecure, and unnecessary now that pivot_root works with initramfs -- see <https://github.com/opencontainers/runc/issues/4435>)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Need the same change forcreate.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The (slightly more bold) alternative is to make the option hidden (in this case the description can be removed).

Copy link
MemberAuthor

@cypharcypharOct 31, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I was thinking of hiding it for1.3.0 but keeping it visible for1.2.z. I've added a new patch deprecating it, and I'll exclude that from the backport.

Despite the hardenings we've added to the MS_MOVE+chroot dance over theyears like commit28a697c ("rootfs: umount all procfs and sysfswith --no-pivot"), --no-pivot is fundamentally insecure and the primaryreason why people use it (to run containers from initramfs) can now bedone safely with pivot_root(2).So we should always try to pivot_root(2) and give a warning to the userthat their configuration is insecure if we have to use the --no-pivotfallback (users should not see this message in practice, because theprimary users that couldn't use pivot_root(2) now can and willtransparently use it if possible).Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This includes bats_pipe and some other nice features we can use in ourtests.Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The runc wrapper we have had for a long time is useful for debuggingerrors (because bats's built-in "run" doesn't provide the output of theprogram in case of an error). However, it might be necessary to runother programs in a similar wrapper.In addition, it might be needed to add timeouts or use the bats_pipefeature (bats v1.10.0) with similar wrapping. Adding support for this inthe sane_run wrapper makes it a little easier to use these things whenwriting tests.This is somewhat adapted from umoci's sane_run helper.Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The only way to run in the proper initramfs is to start a VM using acustom initrd that runs runc. This should be a fairly reasonablesmoke-test that matches what minikube and kata do.Unfortunately, running the right qemu for the native architecture onvarious distros is a little different, so we need a helper function toget it to work on both Debian and AlmaLinux.Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Existing users will still be able to use it, but new users won't betempted into using this flag (and their containers should be able to runwithout issue without this flag anyway).Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@cypharcyphar dismissedkolyshkin’sstale reviewOctober 31, 2024 03:29

added new text to runc create

return&os.PathError{Op:"pivot_root",Path:".",Err:err}
pivotErr:=unix.PivotRoot(".",".")
iferrors.Is(pivotErr,unix.EINVAL) {
// If pivot_root(2) failed with -EINVAL, one of the possible reasons is
Copy link
Member

@lifubanglifubangOct 31, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are six reasons will cause-EINVAL:

       EINVAL new_root is not a mount point.       EINVAL put_old is not at or underneath new_root.       EINVAL The current root directory is not a mount point (because              of an earlier [chroot(2)](https://man7.org/linux/man-pages/man2/chroot.2.html)).       EINVAL The current root is on the rootfs (initial ramfs) mount;              see NOTES.       EINVAL Either the mount point at new_root, or the parent mount of              that mount point, has propagation type MS_SHARED.       EINVAL put_old is a mount point and has the propagation type              MS_SHARED.

This PR tries to deal with the forth one, does this pr would cause some unexpected behaviors for the fifth one? (Maybe need to check once have a time)
The user case:
For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed asMS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.
Before this patch, it will return an error immediately, but after this patch, it will do some magic operation to letpivot_root(2) work, I think we should be careful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The user case:
For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed asMS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.

For this case, this patch has the same security issue like--no-pivot, even this flag is not provided. I can escape from the container.
So, I think we should split this PR into two:

  1. Deprecate the flagno-pivot, this should be inmilestone 1.2.1;
  2. Find a new way to handle initramfs case, this one can be moved to the next version if we can't fix it in a short time.

Copy link
MemberAuthor

@cypharcypharOct 31, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For this case, this patch has the same security issue like --no-pivot, even this flag is not provided. I can escape from the container.

How? Afterpivot_root(2) the whole host mount tree is gone, no?chroot_fs_refs should move thecurrent->fs->root to thepivot_root(2)'d path so the intermediatechroot(2) should not have any effect on the final mount setup AFAICS. Can you share a reproducer?

For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed as MS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.

We also set the parent toMS_PRVIATE inrootfsParentMountPrivate. (It already should not be the case that we get-EINVAL for that reason because we explicitly make sure that it works withrootfsParentMountPrivate.) Is that not sufficient?

The bind-mount of/ keeps the propagation settings the same as well. So if the propagation was actually wrong, the secondpivot_root(2) would still fail. I also don't see how you could use that to break out of the container -- AFAIK the reason the kernel requires the mount to be!shared is so that the mount tree changes duringpivot_root(2) don't propagate to the host (and break it).

Find a new way to handle initramfs case, this one can be moved to the next version if we can't fix it in a short time.

The "magic" done is just creating a bind-mount of the root that can be used forpivot_root(2) (in anMS_PRIVATE mount so the mount itself will beMS_PRIVATE, to answer your question).

This is the only way of doing it (I spoke to the VFS maintainers a few weeks ago, and this is the only "right" way of doing it -- FWIW Lennart implied that systemd does thisbut they don't it seems).

It's a bit hard to understand what the problem is without an actual reproducer.

Deprecate the flag no-pivot, this should be in milestone 1.2.1;

We can't deprecate it without having a solution for minikube and kata...

Copy link
Member

@lifubanglifubangNov 1, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We can't deprecate it without having a solution for minikube and kata...

Mark it deprecate not means we remove it right now, it's only to remind users that they should not use it except there is no way to run their case, and to inform users that we can't guarantee the container's security with--no-privot.
But yes, it's a little strange do it in the situation that we still have no solution for minikube and kata or other projects. I think we can find a solution to fix it. I suggest to split into two because maybe we need more time to find out this solution, but I think we should releasev1.2.1 ASAP, then we can let runcv1.2.* could be tested by more and more users, especially in some downstream rc versions.

We also set the parent toMS_PRVIATE inrootfsParentMountPrivate. (It already should not be the case that we get-EINVAL for that reason because we explicitly make sure that it works withrootfsParentMountPrivate.) Is that not sufficient?

No. I have concern about it for a long time after#4417 merged, the commit is:13a6f56. This commit has changed the behavior ofrootfsParentMountPrivate:
1. Before this commit, we make/ toMS_PRIVATE mount;
2. After this commit, if therootfs is a overlay mount, we will only makerootfs toMS_PRIVATE mount.
@kolyshkin I don't know this behavior change will cause some issues, but we will not set parent toMS_PRIVATE is a fact, but maybe makerootfsMS_PRIVATE is enough.

EDIT: Sorry, I'm wrong, the parent mount point has always been changed toMS_PRIVATE byrootfsParentMountPrivate; So this bug is not related to this, I have sent the email to@cyphar, if you don't receive it, please let me know.

It's a bit hard to understand what the problem is without an actual reproducer.

I'll send you a email ASAP.

@cypharcyphar marked this pull request as draftNovember 1, 2024 07:51
@rata
Copy link
Member

rata commentedNov 1, 2024

I'm assuming we won't include this in 1.2.1 as this is still a draft. We can include it later, though.

@AkihiroSudaAkihiroSuda modified the milestones:1.2.1,1.3.0Nov 1, 2024
@kolyshkin
Copy link
Contributor

I'm assuming we won't include this in 1.2.1 as this is still a draft. We can include it later, though.

This looks more like a "new feature" than a "bug fix" for me, and so it is probably 1.3 material.

@cyphar
Copy link
MemberAuthor

Yeah, 1.2.1 was the wrong thing to label it. (Especially since there are some subtleties I need to look into...)

@kolyshkinkolyshkin removed the backport/1.2-todoA PR in main branch which needs to be backported to release-1.2 labelNov 13, 2024
@kolyshkin
Copy link
Contributor

This looks more like a "new feature" than a "bug fix" for me, and so it is probably 1.3 material.

Removed thebackport/1.2-todo label.

@cyphar is this still a draft?

@cyphar
Copy link
MemberAuthor

@kolyshkin Yes, I still need to figure out how to fix the issue that@lifubang found. With the right setup you can end up with the wrong root being mounted in the container, and it's not obvious looking at the source ofpivot_root why that's happening...

@kolyshkinkolyshkin modified the milestones:1.3.0,1.3.0-rc.1Feb 25, 2025
@cyphar
Copy link
MemberAuthor

This won't be ready for 1.3.0. I didn't manage to debug the issue that@lifubang found yet.

@cypharcyphar modified the milestones:1.3.0-rc.1,1.4.0-rc.1Feb 26, 2025
@cypharcyphar modified the milestones:1.4.0-rc.1,1.5.0-rc.1Aug 27, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@lifubanglifubanglifubang left review comments

@AkihiroSudaAkihiroSudaAwaiting requested review from AkihiroSuda

@ratarataAwaiting requested review from rata

@kolyshkinkolyshkinAwaiting requested review from kolyshkin

At least 2 approving reviews are required to merge this pull request.

Assignees

No one assigned

Projects

None yet

Milestone

1.5.0-rc.1

Development

Successfully merging this pull request may close these issues.

5 participants

@cyphar@kolyshkin@AkihiroSuda@rata@lifubang

[8]ページ先頭

©2009-2025 Movatter.jp