So far we have discussed about runc, containerd in detail and their relative counterparts. We now need to take a look at the component called shim. The shim is integral to the implementation of daemonless containers and separating the low-level container runtimes such as crun or runc from high-level container runtimes such as containerd. As we discussed earlier that containerd uses runc to create new containers. In fact, it forks a new instance of runc for every container it creates. However, once each container is created, the parent runc process exits. This means we can run hundreds of containers without having to run hundreds of runc instances.
Coming back to original architecture of components laid out:
Once a container’s parent runc process exits, the associated containerd-shim process becomes the container’s parent. Some of the responsibilities the shim performs as a container’s parent include:
- It keeps the STDIO and other fds open for the container in case containerd and/or docker both die. If the shim was not running then the parent side of the pipes or the TTY master would be closed and the container would exit.
- It allows the container’s exit status to be reported back to a higher level tool like docker without having the be the actual parent of the container’s process and do a wait for
Decoupling of low-level Container Runtime
It may be easy to ignore that shim allowed decoupling of low-level container runtimes such as crun or runc from high-level container runtimes such as containerd. However, this played a very important part in the long term goals such as being platform agnostic and building resilience in the docker engine. In words of Docker Inc, this allowed the envision of other container runtimes , for other platforms, philosophies and goals in mind:
For lack of better naming, they were named as runX, runY, etc at that time. Some of these eventually lead to development of other container runtimes such as nabla, gVisor, Clear Containers, Kata containers, runV etc. A more detailed read on the container runtimes is available at https://www.capitalone.com/tech/cloud/container-runtime/. Below is TL;DR of above low-level runtimes.
gVisor and Nabla are sandboxed runtimes, which provide further isolation of the host from the containerized process. Instead of sharing the host kernel, the containerized process runs on a unikernel or kernel proxy layer, which then interacts with the host kernel on the container’s behalf. Because of this increased isolation, these runtimes have a reduced attack surface and make it less likely that a containerized process can have a mal-effect on the host.
runV, Clear, and Kata are virtualized runtimes. They are implementations of the OCI Runtime spec that are backed by a virtual machine interface rather than the host kernel. runV and Clear have been deprecated and their feature sets absorbed by Kata. They all can run standard OCI container images, although they do it with stronger host isolation. They start a lightweight virtual machine with a standard Linux kernel image and run the containerized process in that virtual machine.
Do note that in contrast to native runtimes, sandboxed and virtualized runtimes have performance impacts through the entire life of a containerized process. In sandboxed containers, there is an extra layer of abstraction: the process runs on the sandbox unikernel/proxy, which relays instructions to the host kernel. In virtualized containers, there is a layer of virtualization: the process runs entirely in a virtual machine, which is inherently slower than running natively. However, these are comparatively more secure due to reduced attack surface area and more layers of abstraction and isolation.