n7on/docker-internalsPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star4

License

GPL-3.0 license

4 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Docker Internals

Docker is a way to isolate a process from the rest of system using kernel features such asnamespaces,cgroups,capabilities andpivot_root. When these features are used in conjunction to create an isolated environment, it's called a container.

When a container is spawn usingDocker Engine,docker client connects todocker daemon wich pulls aDocker Image and connects container to network usingDocker Networking. The image is added toDocker Filesystem. Theimage configuration in the image is used by theDocker Runtime to create the process and the filesystem, and isolates it from the host.

namespaces

Namespaces are used to isolate processes. So that users, hostname, network, pid's etc only are visible from it's namespaces. This is the main concept of containers. Namespaces have 8 different types:

net. Network interfaces namespace.
mnt. Mount namespace.
uts. Hostname namespace.
pid. Process namespace.
user. User namespace (doesn't require privileged account).
time. System time namespace.
ipc. Inter-Process Communication namespace. Like shared memory, message queues and semaphores.
cgroup. Cgroup namespace.

Each namespace is held by at least 1 process. And a process can only belong to one namespace for each type at a given time. And all processes per default actually belong to one of each types in thedefault namespaces, which are held by the systemsPID 1. So in that sense,the host is also a container. This can be visualized usinglsns:

# list the init process namespacessudo lsns -p 1> 4026531834time       70   1 root /init> 4026531835 cgroup     70   1 root /init> 4026531837 user       70   1 root /init> 4026531840 net        70   1 root /init> 4026532266 ipc        70   1 root /init> 4026532277 mnt        67   1 root /init> 4026532278 uts        68   1 root /init> 4026532279 pid        70   1 root /init

All other processes will inherit namespaces from it's parent process. So if we do same thing for the current shell process, we get exactly same namespace id's as withinit:

lsns -p$$| awk'{print $1,$2}'> 4026531834time> 4026531835 cgroup> 4026531837 user> 4026531840 net> 4026532266 ipc> 4026532277 mnt> 4026532278 uts> 4026532279 pid

We could start a new shell with newuts namespace usingunshare command, which is a command that is a wrapper of thesyscall with same name. And it's used in order to un-share a process fromdefault namespaces. So we could start bash usingunshare (to un-share fromuts default namespace) and list it's newuts namespace id:

Note thatunshare need root to create all types of namespaces exceptuser. And this is also why Docker need root.

sudo unshare --uts bash# now running bash in new uts namespacelsns -p$$# and these namespaces is same as for init> 4026531834time       71     1 root /sbin/init> 4026531835 cgroup     71     1 root /sbin/init> 4026531837 user       71     1 root /sbin/init> 4026531840 net        71     1 root /sbin/init> 4026532266 ipc        71     1 root /sbin/init> 4026532277 mnt        68     1 root /sbin/init> 4026532279 pid        71     1 root /sbin/init# but uts namespace have a new id> 4026536218 uts         2 74238 root bash

Another way a process namespaces could be viewed is by exploring the/proc filesystem. Which are apseudo filesystem provided by the kernel. For each process we have/proc/<pid>/ns/<namespace>, so we could list the namespaces ofinit (PID 1) using filesystem as well.

sudo ls -l /proc/1/ns> lrwxrwxrwx 1 root root 0 Feb  4 16:26 cgroup ->'cgroup:[4026531835]'> lrwxrwxrwx 1 root root 0 Feb  4 16:26 ipc ->'ipc:[4026532266]'> lrwxrwxrwx 1 root root 0 Feb  4 16:26 mnt ->'mnt:[4026532277]'> lrwxrwxrwx 1 root root 0 Feb  4 16:26 net ->'net:[4026531840]'> lrwxrwxrwx 1 root root 0 Feb  4 16:26 pid ->'pid:[4026532279]'> lrwxrwxrwx 1 root root 0 Feb  7 18:56 pid_for_children ->'pid:[4026532279]'> lrwxrwxrwx 1 root root 0 Feb  4 16:26time ->'time:[4026531834]'> lrwxrwxrwx 1 root root 0 Feb  7 18:56 time_for_children ->'time:[4026531834]'> lrwxrwxrwx 1 root root 0 Feb  4 16:26 user ->'user:[4026531837]'> lrwxrwxrwx 1 root root 0 Feb  4 16:26 uts ->'uts:[4026532278]'# or for specific namespace, like "mnt"sudo readlink /proc/1/ns/mnt

cgroups

Control Group (also called resource controllers), is a way to manage resources like memory, disk, CPU, network etc. So that resource limits can be added to a container, and usage can be extracted.Cgroup is structured in multiple separate hierarchies under/sys/fs/cgroup. Which contains each of it's subsystems. And acgroup is isolated from host using it'scgroup namespace together withcgroups mounted from the host. When a container is started,Docker Engine will create a new child group nameddocker/<container id> on the host under each subsystem. The hostcgroup namespace will be copied, and if a limit is added it will be changed in the namespace. Following are some of the cgroup subsystem:

Note that the/sys filesystem is just like/dev a pseudo filesystem provided by the kernel.

blkio. Limits i/o on block devices.
cpu. Limits CPU usage.
cpuacct. Reports usage of CPU
cpuset. Limits individual CPUs on multicore systems.
devices. Allows or denies access to devices.
freezer. Suspend or resumes processes.
memory. Limits memory usage, and reports usage.
net_cls. Tags network packages.
net_prio. Sets priority on network traffic.
ns. Limit access to namespaces.
perf_event. Identify cgroup membership of processes.

We could list all the cgroups that can be managed usinglscgroup, which corresponds to the directories inside/sys/fs/cgroup.

We could run a Docker container with some limit to explore:

# run sh in docker container named alpine using alpine image, with a 512mb memory limitdocker run --name alpine -it --rm --memory="512mb" alpine sh# run docker stats to see it's limitdocker stats# and we have a limit> CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT   MEM %     NET I/O       BLOCK I/O   PIDS> 12cf5d22a4a2   alpine    0.00%     536KiB / 512MiB     0.10%     1.16kB / 0B   0B / 0B     1# now, on Docker host check it's max memory cgroup for the containercat /sys/fs/cgroup/memory/docker/12cf5d22a4a2c381ed23629a5da3f221f951695f699ce9d415623a8d39e5e335/memory.limit_in_bytes> 536870912# And from inside container.cat /sys/fs/cgroup/memory/memory.limit_in_bytes> 536870912

capabilities

Capabilities are used byDocker Engine to restrict permissions on a process running in a container.containerd runs as root with all capabilities (=ep). The capabilities a process currently have can be listed withgetpcaps, so we can start up a new container and inspect:

docker run -d --name nginx nginx# from where containerd is runningpid=$(ps aux| grep"nginx"| grep master| awk'{print $1}')getpcaps$pid>1426: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,>cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep

So for instance, it hascap_sys_chroot which is needed bypivot_root to change root filesystem. It also hascap_mknod which is needed by some images to create special files in/dev.cap_setuid andcap_setgid is needed to map user and groups. In fact, many of it's capabilities are actually needed byDocker Runtime in order to initialize the container.

pivot_root

Used byDocker Runtime to change root filesystem to image filesystem. This is done in the process which starts up the container (PID 1) during container initialization. Following is an example how this is done:

# needed by pivot_rootmount --bind$fs_folder$fs_folder# enter root filesystemcd$fs_folder# oldroot will be mounted by pivot_rootmkdir -p oldroot# set new rootpivot_root. oldroot# unmount oldroot, so it can be removedumount -l oldroot# remove oldrootrmdir oldroot

Doing that would make the root/ point to the filesystem inside$fs_folder.

Docker Engine

Docker Engine runs a daemon (service) calleddockerd which is used by the Docker Client executabledocker.Dockerd handles everything from creating networks to container management, but the actual containers run in another service calledcontainerd. Normally the client connects todockerd using the Docker UNIX socket file descriptor/var/run/docker.sock. When a container is started, an executable path within image is provided from client. Andcontainerd uses aDocker Runtime calledrunc to isolate the process using namespaces, mount/dev &/sys filesystems, change root of filesystem usingpivot_root and so forth. For example, runningdocker run -it nginx bash will connect todockerd, which will connect tocontainerd and send command to runbash innginx:latest image filesystem.Containerd will userunc to execute bash. And because of-it flags, a sharedTTY device will be created bycontainerd that hasSTDIN,STDOUT &STDERR from bash connected to it. And it'sTTY will be redirected bycontainerd todocker client, basically as a reverse shell.

Docker Runtime

Docker containers are created by the Docker Runtimerunc. And a container are simply an isolated environment where processes can run. Sorunc is basically a way to initialize a process with all it'snamespaces,capabilities,cgroups and topivot_root.runc need a filesystem and aruntime configuration in order to create a container. Or more correctly,containerd translates information fromOCI Image Manifest Specification toOCI runtime specification and provides that torunc. So, we could create a baseOCI runtime specification usingrunc spec that we could manually edit in similar way ascontainerd, and userunc to start up a container:

docker run --name ubuntu ubuntumkdirtest;cdtest# export rootfsdockerexport ubuntu> rootfs.tarmkdir rootfstar -xf rootfs.tar -C ./rootfs# create config.jsonrunc spec# modify# add capabilities#   * CAP_SETUID#   * CAP_SETGID# change root->readonly = false# run containerrunc run containerid

Note, that first process created inside a container is alwaysPID 1. And in a Linux system it's usuallysystemd orSysV init. So a container doesn't do any bootstrap or management of user processes. All this is handled byDocker Engine instead. And whenPID 1 is terminated, so is the container.

Docker Filesystem

Root filesystems are part of an image and contains the executables needed together with all it's dependencies (userland). When an executable on this root filesystem runs in theDocker Runtime, it's called a container. The default filesystem used by Docker is a union filesystem calledOverlayFS, and it's just like the image format based upon layers. The top layer is where the container can make changes, and the layers below belong to the image which is immutable. So, if same image are used by multiple containers they all share the layers that belongs to the image. Both container and image layers are stored under/var/lib/docker/overlay2.

OverlayFS filesystem is part if the kernel. And it concists of following pieces:

LowerDir - readonly layers.
UpperDir - read/write layer.
MergedDir- all layers merged.
WorkDir - used byOverlayFS to createMergedDir

Note, if you're using Docker Desktop and WSL2, use following container to explore Docker Filesystem :docker run -it --privileged --rm --pid=host debian nsenter -t 1 -m -u -i sh.

So we could inspect layers in an image and compare it to layers in a container.

# first, pull nginx imagedocker pull nginx:latest# and inspect it's layersdocker image inspect nginx| jq'.[0].GraphDriver.Data'>{>"LowerDir":"/var/lib/docker/overlay2/9f8aa5926b47a7a07ba55cd2ce938ae1cfce32d08557bcd4a23086ef76560bef/diff:>               /var/lib/docker/overlay2/49569d337c727a9d93a15b910c2a0fb5cb05996954a50a546002ca46231df3fd/diff:>               /var/lib/docker/overlay2/8678c30b35e2393241ecb5288f0dbaab45e9e81213078793c05b62bf21ebfe97/diff:>               /var/lib/docker/overlay2/856de74b0828e7523134b53f45de181a81e317e5eed3c6992ecd85fd281d0072/diff:>               /var/lib/docker/overlay2/0c5253794034518627d1bce63c067171ef11c16767d5f5a77aa539a1b29d8f8f/diff:>               /var/lib/docker/overlay2/a228042c51ce74cfbbae479fe7a7ceed26a45ba4a7dee392df059400202e92e6/diff",>"MergedDir":"/var/lib/docker/overlay2/5d6cb52f37dfbc060f91c708b38661558c22cbc522e232d087ef9009c9127f66/merged",>"UpperDir":"/var/lib/docker/overlay2/5d6cb52f37dfbc060f91c708b38661558c22cbc522e232d087ef9009c9127f66/diff",>"WorkDir":"/var/lib/docker/overlay2/5d6cb52f37dfbc060f91c708b38661558c22cbc522e232d087ef9009c9127f66/work">}# create nginx containerdocker run --name nginx -d nginx:latest# and inspect container layersdocker container inspect nginx| jq'.[0].GraphDriver.Data'>{>"LowerDir":"/var/lib/docker/overlay2/ca852f913a6c93a9dd97a1219804e73c4e55d3639ab5198a97ac541aed9a2e87-init/diff:>               /var/lib/docker/overlay2/5d6cb52f37dfbc060f91c708b38661558c22cbc522e232d087ef9009c9127f66/diff:>               /var/lib/docker/overlay2/9f8aa5926b47a7a07ba55cd2ce938ae1cfce32d08557bcd4a23086ef76560bef/diff:>               /var/lib/docker/overlay2/49569d337c727a9d93a15b910c2a0fb5cb05996954a50a546002ca46231df3fd/diff:>               /var/lib/docker/overlay2/8678c30b35e2393241ecb5288f0dbaab45e9e81213078793c05b62bf21ebfe97/diff:>               /var/lib/docker/overlay2/856de74b0828e7523134b53f45de181a81e317e5eed3c6992ecd85fd281d0072/diff:>               /var/lib/docker/overlay2/0c5253794034518627d1bce63c067171ef11c16767d5f5a77aa539a1b29d8f8f/diff:>               /var/lib/docker/overlay2/a228042c51ce74cfbbae479fe7a7ceed26a45ba4a7dee392df059400202e92e6/diff",>"MergedDir":"/var/lib/docker/overlay2/ca852f913a6c93a9dd97a1219804e73c4e55d3639ab5198a97ac541aed9a2e87/merged",>"UpperDir":"/var/lib/docker/overlay2/ca852f913a6c93a9dd97a1219804e73c4e55d3639ab5198a97ac541aed9a2e87/diff",>"WorkDir":"/var/lib/docker/overlay2/ca852f913a6c93a9dd97a1219804e73c4e55d3639ab5198a97ac541aed9a2e87/work">}

Ok, so if we look at LowerDir in container we see that it's same as with image, except it has 2 more layers on top of it with:

ca852f913a6c93a9dd97a1219804e73c4e55d3639ab5198a97ac541aed9a2e87-init on top
5d6cb52f37dfbc060f91c708b38661558c22cbc522e232d087ef9009c9127f66 below. Which is same as UpperDir on image.

And we can also see that:ca852f913a6c93a9dd97a1219804e73c4e55d3639ab5198a97ac541aed9a2e87 (without -init) is UpperDir in container.

Which all makes sense given that image is used in container, but readonly. And TheUpperDir in container is where all changes are made.

So how doesDocker Runtime make this behave like a normal filesystem? It mounts it all using theoverlay mount type! So we could do same thing asDocker Runtime, but mount it somewhere else:

mkdir -p /mnt/testingmount -t overlay -o lowerdir=/var/lib/docker/overlay2/ca852f913a6c93a9dd97a1219804e73c4e55d3639ab5198a97ac541aed9a2e87-init/diff:\/var/lib/docker/overlay2/5d6cb52f37dfbc060f91c708b38661558c22cbc522e232d087ef9009c9127f66/diff:\/var/lib/docker/overlay2/9f8aa5926b47a7a07ba55cd2ce938ae1cfce32d08557bcd4a23086ef76560bef/diff:\/var/lib/docker/overlay2/49569d337c727a9d93a15b910c2a0fb5cb05996954a50a546002ca46231df3fd/diff:\/var/lib/docker/overlay2/8678c30b35e2393241ecb5288f0dbaab45e9e81213078793c05b62bf21ebfe97/diff:\/var/lib/docker/overlay2/856de74b0828e7523134b53f45de181a81e317e5eed3c6992ecd85fd281d0072/diff:\/var/lib/docker/overlay2/0c5253794034518627d1bce63c067171ef11c16767d5f5a77aa539a1b29d8f8f/diff:\/var/lib/docker/overlay2/a228042c51ce74cfbbae479fe7a7ceed26a45ba4a7dee392df059400202e92e6/diff,\upperdir=/var/lib/docker/overlay2/ca852f913a6c93a9dd97a1219804e73c4e55d3639ab5198a97ac541aed9a2e87/diff,\workdir=/var/lib/docker/overlay2/ca852f913a6c93a9dd97a1219804e73c4e55d3639ab5198a97ac541aed9a2e87/work \overlay /mnt/testing

And this is exact same filesystem that the nginx container uses. Which we can confirm:

echo"Hello!"> /mnt/testing/hellodockerexec -it nginx bash# in containercat /hello> Hello!# cleanup in hostumount /mnt/testingrmdir /mnt/testing/

Docker Image

Docker images implements theOCI Image Manifest Specification which basically is a manifest file that contains a list oflayers bundled together with it'simage configuration file. Each layer is built upon previous layers and it fits together perfectly with the defaultDocker Filesystem called "OverlayFS". The image layers are usually located under/var/lib/docker/overlay2/ and ech layer are represented as a folder.

Layers could be thought of as tarballs, if these tarballs are extracted in correct order to disk, you'll get the image root filesystem.

Theimage configuration hold information about exposed ports, environment variables, which executable to run as default etc. Which is later used by theDocker Engine to create theOCI Runtime Specification used by theDocker Runtime.

To view where all layers in an image are located, and all other image related information, we can do following:

# inspect nginx imagedocker image inspect nginx| jq# layer information are found under `GraphDriver.Data`

Docker Networking

Note, if you're using Docker Desktop and WSL2, use following container to explore Docker Networking:docker run -it --privileged --pid=host --rm ubuntu nsenter -t 1 -n bash. Also, following packages are needed:apt update; apt -y install iproute2 tcpdump iptables bridge-utils

Docker Engine is responsible for the setup of networks. And Docker has four built-in network drivers:

Bridge - The default network. With connectivity to an Docker bridge interface.
Host - Allows access to same interfaces as the host.
Macvlan - Allows for access to an interface on the host.
Overlay - Allows for networks between different host running Docker, usually Docker Swarm clusters.

The default network bridge isdocker0, and we can view some more information about it:

# first start a containerdocker run --name nginx -p 80:80 -d nginxbrctl show docker0>bridge name     bridge id               STP enabled     interfaces>docker0         8000.02429f7dbd2f       no              veth0c35011# and we see that it have one veth interfaces are attached to it.# we can also see that this interface exist on the hostip link show>20: veth0c35011@if19:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default>    link/ether ce:d7:73:6c:90:31 brd ff:ff:ff:ff:ff:ff link-netnsid 1# and if we run iptables to see it's rulesiptables -L# we'll see that it has this rule in the DOCKER chain>Chain DOCKER (1 references)>target     prot optsource               destination>ACCEPT     tcp  --  anywhere             172.17.0.2           tcp dpt:http# we can double check that this is in fact the container ipdocker inspect -f'{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' nginx>172.17.0.2

So what does this tell us? Well,containerd runs on the host and we can from this deduce thatcontainerd creates the bridgedocker0, and when we run a container,containerd creates aveth interface that belongs to the container network namespace. And the network is opened up to container ip usingiptable rules in theDOCKER chain.

About

No description, website, or topics provided.

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Docker Internals

namespaces

cgroups

capabilities

pivot_root

Docker Engine

Docker Runtime

Docker Filesystem

Docker Image

Docker Networking

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Movatterモバイル変換

License

n7on/docker-internals

Folders and files

Latest commit

History

Repository files navigation

Docker Internals

namespaces

cgroups

capabilities

pivot_root

Docker Engine

Docker Runtime

Docker Filesystem

Docker Image

Docker Networking

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Packages