Best Practices for Writing Dockerfiles

Over the last few years, adoption of Docker and Kubernetes has grown in leaps and bounds. Vast majority of developers is developing microservices and deploying them into containers. One of the most important aspect that people do not realize is that, the containers needs to be lightweight in nature. Also, while building containers, one needs to account for certain aspects like reducing build time while doing incremental builds, produce images in consistent ways, performing clean builds, maintain them properly, etc. To achieve all this, one needs to follow certain practices while writing Dockerfiles.

While anyone can write Dockerfiles, writing them in efficient way requires some learning. In this blog post, we are going to discuss some of these practices.

Order changes from least to most frequent

Docker has inbuilt mechanism for caching docker layers while building the docker images. Each step in the Dockefile becomes a caching layer once the step is completed. Next step utilizes the previously cached layer and run instruction on top of it and once its completed, its cached again. These layers are not discarded unless all docker cache is cleared. So if we arrange our steps in the Dockerfile from least changing to most changing, we can minimize the number of steps to create Docker images.

Consider below Dockerfile:

order-matters-for-caching-wrong

If we run the apt- commands after copying source code, whenever source code changes, the apt-commands will need to be run on a new layer. We can avoid this by running apt- commands first, whose output will not vary much. Thus the COPY step can be run on the cached image from apt- commands:

order-matters-for-caching-right

Avoid using wildcard while using COPY directive

Another common pitfall is to use wildcard to COPY files from local directory to the image:

avoid-wildcard-for-copy-files-wrong

This will cause cached layer to be discarded every time there is a minor changes in the files present in the Repository. Generally, the source code is restricted to specific directory and documentation etc reside in their own directories. So we should copy only specific directories and files pertaining to the requirement:

avoid-wildcard-for-copy-files-right

Alternatively, one could also make use of .dockerignore files which will help avoiding matching files and directories while using COPY directive.

Group RUN instructions together

Each RUN instruction can be seen as a cacheable unit of execution. Too many of them can be unnecessary, while chaining all commands into one RUN instruction can bust the cache easily, hurting the development cycle. Instead of creating multiple layers, we can group related RUN instructions together by using && and we can place them on separate lines using \ operator.

Below is one of the examples for doing it correctly:

group-run-instructions-right

In above we want to update apt index and install the java package in one single command. This would also help avoiding attempt to install java package on outdated index from cached layer.

Again, care should be taken to group only related RUN instructions together.

Remove unnecessary dependencies and tools

While it is tempting to install dependencies and debugging tools, so that they can be helpful in debugging image, they increase the image size unnecessarily. Also, we can always create a separate docker image containing all dependencies and tools required to debug the issues with the code.

Below is one of the wrong dockerfile from previous example:

avoid-installing-dependencies-and-tools-wrong

Apt has the ––no–install–recommends flag which ensures that dependencies that are not actually needed, are not installed. If they are needed, add them explicitly:

avoid-installing-dependencies-and-tools-right

Remove package manager cache

This is another common issue seen with the majority of the Dockerfiles. So not only we need to make sure that we are not installing any unnecessary dependencies and packages, we also need to clean the package manager cache in the subsequent steps. Else this cache will keep occupying storage space and increase the size of the output image.

Building on our previous dockerfile, we can use command:

rm -rf /var/lib/apt/lists/*

to clear the package cache for apt. Below is the modified dockerfile code:

clean-package-manager-cache-right

Note that removing it in another RUN instruction would not reduce the image size.

Use official images whenever possible

Instead of going through the pain of installing Java onto a native ubuntu container and keeping track of all best practices, one can simply use officially available images for same. Official images can save a lot of time spent on maintenance because all the installation steps are done and best practices are applied. If you have multiple projects, they can share those layers because they use exactly the same base image.

So we can modify our Dockerfile to look like below:

use-official-images-right

Official images are already available of most of the software packages that you’ll need.

Use more specific tags while using base images

Avoid using the latest tag as much as possible. It has the convenience of always being available for official images on Docker Hub or any other official repository, but there can be breaking changes over time. Depending on how far apart in time you rebuild the Dockerfile without cache, you may have failing builds.

Instead, use more specific tags for your base images. In our previous dockerfile, instead of using openjdk:latest, we switched to using openjdk:8. You can always check the documentation provided by the image vendor for referring various available tags.

This is also true for the images that you’ll generate for your docker image as well. If you are deploying based on the latest tag, then it will always deploy image containing latest tag and in case of rollback, you would not be able to switch to previous version, unless you are doing it manually.

Look for minimal flavors of the Image

Some of the image tags may contain versions such as slim or alpine. While slim image is based on stripped down version of the linux and has a smaller size than the usual one, alpine images are further smaller in size. This is due to Alpine variant being based on the even smaller Alpine Linux distribution image.

For most cases, the output image based on slim or alpine version should work for you, but in some cases, it may also create compatibility issues. If alpine images or slim images are not causing any issues, go for those flavors first.

Also, prefer for image which contains runtime, not the development kit. So instead of using openjdk:8 image, we can use openjdk:8-jre image which is further small in size. Again, instead of using openjdk:8-jre image as starting point, we can use the openjdk:8-jre-alpine image.

Build inside Docker container for clean builds

So far the Dockerfiles above have assumed that your build artifact was built on the host. This is not ideal because the host machine may contain libraries from other builds and those inconsistencies might get reflected in your build. A great way for doing clean builds is to build inside docker containers. This will also provide consistent build environments.

We need to start by identifying all that’s needed to build our application. Our simple Java application requires Maven and the JDK, so we will base our Dockerfile off of a specific minimal official maven image from Docker Hub, that includes the JDK. If you needed to install more dependencies, you could do so in a RUN step.

Below is the modified Dockerfile for same:

build-inside-docker-container

Use Multi-Stage Dockerfiles

Although we were able to generate clean builds and create an image, that has created two problems. First, everytime pom.xml file changes, it will fetch all the dependencies again. Also, all the build time dependencies are still in the image generated. We need to have only jar file for our microservice to work. So we can create an multi-stage dockerfile to tackle above problems:

use-multi-stage-dockerfiles-right

Multi-stage builds can be identified by the use of multiple FROM statements. Each FROM starts a new stage. They can be named with the AS keyword which we use to name our first stage BUILD to be referenced later. It will include all our build dependencies in a consistent environment.

The second stage is our final stage which will result in the final image. It will include the strict necessary for the runtime, in this case a minimal JRE (Java Runtime) based on Alpine. The intermediary builder stage will be cached but not present in the final image. In order to get build artifacts into our final image, use COPY ––from=STAGE_NAME. So we have used ––from=BUILD in our dockerfile.

Summary and Notes

There are multiple ways to write Dockerfiles, but to write them in efficient way, we need to follow certain practices. In this post, we have discussed some of the practices for the same.

The source code used in this blog post can be found here on GitHub under master and blog/8460 branches.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s