Selecting a container runtime for use with Kubernetes
Kubelet can be configured to use docker, rkt (deprecated), or any CRI compatible container api using the
- Kubernetes Interfaces
- OCI (Open Container Initiative) Compatible Runtimes
This article is written with the bias that virtual machines, though providing some additional isolation, are too resource restrictive and require too much overhead to be considered the path forward for production software deployments. Though mentioned in passing for completeness, hypervisor based solutions will not be completely described nor analyzed.
Docker is by far the most popularly used container engine, mostly due to it’s repository and community. It is a combination of dockerd and containerd, containerd-shim and runc as it’s runtime. As kubernetes is moving away from directly supporting any non-cri interface, this native support may not be supported for much longer.
As of kubernetes 1.10.0, the direct rkt (or rktnetes) support has been deprecated.
The Container Runtime Interface (CRI), came about because CoreOS wanted to add rkt support to kubernetes. Kubernetes originated with support for docker and the patch to add rkt support was really rather ugly. Lots of “if this do rkt else do docker”. Adding support for any additional runtimes that might come up would be impossible to maintain so the CRI was born.
The CRI is a protobuf api that includes two gRPC services, ImageService, and RuntimeService. The ImageService provides RPCs to pull an image from a repository, inspect, and remove an image. The RuntimeService contains RPCs to manage the lifecycle of the pods and containers, as well as calls to interact with containers (exec/attach/port-forward).
cri-containerd is a service to add CRI support to
containerd, which is the runtime manager and image service created by Docker and donated to the CNCF. It was split off from Docker to decouple the runtime manager from the rest of the docker tools in an effort to get the (at the time) growing ecosystem of container management tools to standardize on the docker api. This helped grow the docker community and helped create the wealth of containers in the docker registry.
cri-containerd is in beta as of kubernetes 1.9
rktlet is a Kubernetes Container Runtime Interface implementation using rkt as the main container runtime.
When kubelet requests a pod to be created, rktlet will start a systemd service which will in turn create a new rkt sandbox by running the rkt binary. After a sandbox is created, kubelet can request containers to be added/removed/etc. to the pod. rktlet will then run the rkt binary with the corresponding rkt app commands (app add, app rm, app start, or app stop). The rest of the CRI methods are also serviced by executing the rkt binary.
rktlet is considered alpha as of kubernetes 1.9.
cri-o is a CRI implementation that was designed to do only that one thing, kubernetes CRI. It provides a minimal set of tools and interfaces to download, extract, and manage images, maintain the container lifecycle, and provide monitoring and logging required to satisfy the CRI. It can use any OCI runtime that implements the OCI runtime spec and defaults to using runc.
cri-o uses the configured runtime (runc by default) to create a sandbox container (pod), then uses the runtime again to create any containers within that pod.
cri-o is considered stable as of kubernetes 1.9.
Frakti lets Kubernetes run pods and containers directly inside hypervisors via runV.
OCI (Open Container Initiative) Compatible Runtimes
bwrap-oci is a mostly oci compatible wrapper around bubblewrap, an unprivileged container tool. Per Giuseppe Scrivano,
bwrap-oci misses a lot of the features that are needed by the e2e Kubernetes tests, e.g. there is no way to specify the options for a bind mount, or to limit the resources via cgroups.
crun is an OCI runtime spec implementation written in C. Combined with cri-o Giuseppe Scrivano has used it to pass the complete e2e suite. It is smaller and lighter than any other tool in this space.
railcar is a rust implementation of the OCI runtime spec. It is similar to the reference implementation runc. In general, railcar is very similar to runc, but some of the runc commands are not supported. As of publication the list of unsupported commands is: checkpoint, events, exec, init, list, pause, restore, resume, spec. Railcar always runs an init process separate from the container process. The development of railcar has uncovered some deficiencies in the OCI runtime spec. By writing railcar in rust, the authors were able to eliminate the need for the C shims that are used in the go implementation.
I asked the author of that article about the deficiencies he uncovered and whether or not there were issues filed, he did not respond.
rkt is a CLI tool written in go to run a container in linux. It uses a multi-layered design allowing the runtime or runtimes to be changed-out based on the desires of the implementer. By default, rkt uses
systemd-nspawn in combination to create containers. systemd-nspawn is used to manage the namespace in which systemd is executed to manage the cgroups. The container applications are run as systemd units. By using different “stage 1” images, the tools used to run the application can be changed from the systemd tools to nearly anything else.
rkt includes the same functionality as runc but strays from the OCI standard by not using the OCI runspec, a standard spec file used to define a container. Instead rkt provides a command line interface to provide a similar set of functionality.
runc is a CLI tool written primarily in go (with some C shims for things go cannot do) to run a container in linux according to the OCI specification. It is the most popular OCI runtime and is default used by containerd, and cri-o.
For the most part, all
runc does is configure the namespace and cgroups while spawning a process. This is all a container really is, a namespace and cgroups.
runc depends on and tracks the runtime-spec repository, ensuring that runc and the OCI specification major versions stay in lockstep. This means that runc 1.0.0 implements the 1.0 version of the specification.
runlxc is Alibaba’s soon to be open sourced oci compatible runtime. As of the time of this writing, it has not yet been released. It is used with pouch
runv is a hypervisor-based OCI runtime, spawning the OCI image in KVM, Xen, or QEMU. It does not spawn containers.
containerd is the most complex of the CRI implementations with 3 components needed to provide the image and runtime services. It is written in Go.
The components consist of two separate projects cri which provides the cri daemon in 19,000 lines of Go and containerd which provides the runtime daemon and shim in 112,000 lines. containerd defaults to using it’s own fork of runc, but any OCI runtime spec compatible implementation can be configured.
Docker/containerd CRI support is in beta.
I was not able to get cri to compile in Arch Linux. Due to the limited amount of time available, I did not spend a lot of time on this. The build instructions in the README.md would not successfully complete the
make install.deps stage.
rkt requires two components, rkt and rktlet. It doesn’t follow standards, though, and instead does things it’s own way despite it’s github description stating, “It is composable, secure, and built on standards.” It’s also written in Go.
rktlet CRI support is in alpha.
Sitting in the middle is cri-o. It doesn’t try at all to be anything for any container orchestrators but Kubernetes. With its single focus, it’s quickly gained “stable” status and does so with a mere 14,000 lines of Go. It interfaces with any OSI runtime spec implementation, and defaults to the upstream runc.
cri-o CRI support is stable.
An obvious factor in choosing a CRI implementation is usability. You should be able to run a complete e2e test and be able to run any OCI container. If something goes wrong, you should be able to diagnose the problem quickly.
cri-o can be installed as a package in most distros. Using cri-o is very simple. The defaults can be used by just changing the kubelet flags to use the cri-o socket. Additional configuration can be applied to add additional features such as repo restrictions, selinux, apparmor seccomp, image signing, etc.
Debugging problems is simple as there are only two pieces, the cri-o daemon and the
conmon console monitor.
Testing cri-o with kubernetes 1.9 using runc I was easily able to pass a full e2e test.
Since I was not able to get this to compile, I was not able to test its usability.
rktlet can be installed as a package in most distros. Using rkt can require a bit of a learning curve. There are a number of valid ways to configure it, and learning which one is appropriate for your use case is not very straightforward. Rkt works with “stages” and there is very little documentation about what stage1 image is used for what purpose. You can create your own stage1 image, but it seems that having proper documentation might prevent that need.
It has only three pieces and diagnosing a problem is generally quite simple as it integrates with systemd leaving all output in the journal.
In attempting to use rktlet as my CRI provider, I was unable to get some containers to work properly. Unfortunately, I had a time constraint so I was not able to diagnose this properly. I was hoping that using rkt would be as easy as just changing the kubelet flags.
As members of the open source community, the health of the community is also important. A good community should seek active participation, review contributions quickly and have a well documented contribution process. It should have an active and helpful user base in which to ask questions and get answers.
There have been 1354 pull requests since September 10, 2016 and there are 23 of those that are open, the oldest being created on September 19, 2017 and was last commented on 9 days ago. There has been 75 contributors, 20 of them active in the last month. 75% of the last 30 days commits are from Red Hat employees. During the last 30 days, there has been 5,141 lines of go code added and 694 deleted. The 5,141 lines represents a change of 37% of the code. cri-o has an active contributor base, most of which are Red Hat employees, but there’s representation from outside of Red Hat. Pull requests take a median average of 4 days to be merged. Issues are about the same. There is an IRC channel with developers answering user questions with about a 4 minute response time during North America business hours.
There have been 484 pull requests to containerd/cri since April 14, 2017, 6 of those are open. The oldest open PR was created on December 8, 2017 and was last commented on January 18, 2018. There has been 31 contributors, 8 of them active in the last month. There has been an even mix of contributions from IBM, Docker, ZTE Corporation, Google, and Intel in the last 30 days. During the last 30 days, there has been 2,123 lines of go code added and 1,217 deleted. The 2,123 lines represents a change of 11% of the code. There is not a very active contributor base for containerd/cri but what there is represents a healthier cross-section of the community.
There have been 1,662 pull requests to containerd/containerd since December 7, 2015, 13 of those are open. The oldest open PR was created on August 24, 2017 and was last commented on September 19, 2017. There has been 120 contributors, 12 of them active in the last month. 46% of the last 30 days commits are from Docker employees with NTT and IBM splitting another 31%. During the last 30 days, there has been 4,444 lines of go code added and 1,724 lines deleted. The 4,444 lines represents a change of 3% of the code. There is not a very active contributor base for containerd but it’s still relatively diverse.
Getting support for either of these requires registering with docker as a community member.
There have been 2,406 pull requests to rkt since November 13, 2014, 49 of those are open. The oldest open PR was created on August 25, 2015 and was last commented on August 9, 2016. It is my opinion that the rkt maintainers could do a better job of curating their open pull requests. PRs that are accepted are usually reviewed and merged within a week. There has been 195 contributors, with no activity in the last month. rkt’s activity has fallen off dramatically. I’m not sure if this is a dead project.
After looking at rkt’s inactivity, I have chosen not to even look at the health of the rktlet community.
The use of standards and apis leaves the greatest flexibility and helps prevent the lock-in that accumulates technical debt. By using the Open Container Initiative’s container and runtime standards and Kubernetes’ Container Runtime Interface API, any specific choice can be changed as the tools evolve without having to redesign the entire stack.
To that end, the kubernetes native docker support should be deprecated and only the CRI should be used.
The hypervisor based solutions I chose not to evaluate as I stated above and as of this writing runlxc has not yet been open sourced.
For the CRI interfaces, rktlet/rkt does not have the community support necessary for long-term project health and that shows in the slow progression rktlet is making toward even beta status. cri-containerd doesn’t have a good build process and no packaging for the distro I’m running on my personal cluster so I was unable to test. It’s also still only a beta and getting support has barriers. Cri-o is stable, builds quickly and easily, is packaged for every distro that is on our potential list of supported distros and also for my own and worked using the default configuration in my tests. My conclusion is that cri-o is the best technology, is positioned to be the most stable in production, has the best community, and is therefore my recommendation.
For the runtime, though railcar and crun may be technically superior, I did not have the time to test them. Since they are drop-in replacements, it’s not critical to choose the best technology for this tool up front. I recommend using the current default, runc, until such time as they can be more thoroughly evaluated or a use case exposes a need.