Docker cheatsheet | 2018-08-15

2018-08-15

镜像的分层框架 image-layering framework

最底端是引导文件系统 bootfs. 当一个容器启动后，它将被移到内存，而 bootfs 会被 unmount，以留出更多内存供 initrd 磁盘镜像使用

第二层是根文件系统 rootfs

传统 Linux 引导过程，rootfs 会首先只读方式加载、引导结束并完成完整性检查后切换为读写模式
Docker 中 rootfs 永远是只读状态，Docker 利用 union mount 技术会在 rootfs 上加载更多的 RO 文件系统

联合加载 ( union mount ) - 一次同时加载多个文件系统，但在外面看起来只能看到一个文件系统；但各层文件会叠加、最终 FS 包含所有的文件和目录

Docker 将这样的文件系统称为镜像。（基于镜像栈的关系，有 parent image / base image 概念）

最终当从一个镜像启动容器时，Docker 会在镜像的最顶层加载一个读写文件系统，我们想在镜像中运行的程序、就是在这个 RW 层执行的。当文件系统变化时（如修改一个文件），会从下层的 RO 层复制到该 RW 层，即写时复制 ( copy on write )

Dockerfile

Dockerfile 所在目录为构建上下文 ( build context )，Docker 会在构建时将文件和目录上传到 Docker daemon，因此通过 .dockerignore 减少不必要的文件拷贝，另外 Dockerfile 中的 Copy 等命令无法访问 context 之外的文件。

Dockerfile 中指令是自上而下执行，每条指令会创建一个新的镜像层 ( Layer ) 并提交（并返回一个 Image ID），这样每一层同时也被缓存。缓存特性会加速大部分情况下的镜像构建的，但对于类似 apt-get update 指令需要指定 --no-cache 以忽略

修改模板中的 ENV 环境变量，后续指令会重置缓存

FROM ubuntu 14:04
EVN REFRESHED_AT 2014-07-01
RUN apt-get -qq update

EXPOSE

告诉 Docker 该容器内的程序会使用容器的指定端口，但不会自动打开、需要 docker run 运行容器时 -P 公开所有通用 EXPOSE 指令公开的端口（或 -p 指定同宿主机的端口映射），通过 docker port <container> <port> 查看

CMD

容器启动时运行的命令（不同于 RUN 是容器构建时的命令），只能指定一条 CMD 指令；如果需要启动多进程或执行多指令，考虑使用 Supervisor 一类的服务管理工具；此外 docker run 命令会覆盖 CMD 指令

CMD ["/bin/ls", "-l"]

ENTRYPOINT

比较容易同 CMD 混淆；docker run 命令行的参数会传递给 ENTRYPOINT 指令中指定的命令

ENTRYPOINT ["/usr/sbin/nginx"]
# ENTRYPOINT ["/usr/sbin/nginx", "-g", "daemon off;"]
# docker run -it <image> -g "daemon off;"

组合使用 ENTRYPOINT 和 CMD，可以构建一个镜像，既可以运行默认指令、也支持通过 docker run 来覆盖

ENTRYPOINT ["/usr/sbin/nginx"]
CMD ["-h"]

ENV

构建过程的环境变量，可在后续指令中使用。也会持久化到创建的容器中，运行时环境变量可被 docker run -e 标志覆盖

VOLUME

向基于镜像创建的容器添加卷，一个卷是可以存在于一个或多个容器内的特定目录，这个目录可以绕过 Union FileSystem，并提供如下数据持久化功能：

卷可以在容器间共享和重用（而不需要提交到镜像）
对卷的修改立即生效
对卷的修改不会对更新镜像产生影响

通过 docker run -v 指令指定容器卷，docker run --volumes-from <container> 把指定容器的卷加入新创建的容器；容器文件复制使用 docker cp 命令；

一个日志容器：

docker run -d --name logstash \
    --volumes-from redis \
    --volumes-from nodeapp \
    logstash
docker logs -f logstash

其他指令

WORKDIR: 为 Dockerfile 中后续指令设置工作目录
USER: 指定启动容器的用户身份（如 USER user:group, USER uid:gid），默认为 root
COPY / ADD: 将构建上下文中的文件和目录复制到镜像，COPY 不会做文件解压，COPY 的目的位置必须是容器内部一个绝对路径，由 COPY 创建的文件及目录的 UID 和 GID 都会设置为 0
LABEL: 为镜像添加元数据（键值对），可以用 docker inspect 指令查看
ARG: 在 docker build 时传递给构建运行时的变量，可以设定默认值

Dockerfile best practices

# Replace latest with a pinned version tag from: https://hub.docker.com/_/alpine
# We suggest using the <major>.<minor> tag, not <major>.<minor>.<patch>
FROM alpine:latest
# Non-root user for security purposes
# UIDs below 10,000 are a security risk, as a container breakout could result in
# the container being ran as a more priviledged user on the host kernel with the same UID.
#
# Static GID/UID is also useful for chown'ing files outside the container
# where such a user does not exists.
RUN addgroup -g 10001 -S nonroot && adduser -u 10000 -S -G nonroot -h /home/nonroot nonroot
# --- Install packages here with `apk add --no-cache`, copy your binaries into /sbin/, etc ---
# Use tini as entrypoint:
# 1. It protects you from software that accidentally created zombie processes
# 2. It ensures that the default signal handlers work for the software you run in your Docker image.
#     For example, with Tini, SIGTERM propertly terminates your process even if
#     you didn't explicitly install a signal handler for it.
RUN apk add --no-cache tini
# Only store arguments in `CMD`.
ENTRYPOINT ["/sbin/tini", "--", "myapp"]    # replace 'myapp' with your binary
# bind-tools is needed for DNS resolution to work in some Docker networks.
# This applies to nslookup, Go binaries, etc.
RUN apk add --no-cache bind-tools
USER nonroot
# Default arguments for your app (remove if you have none)
CMD ["--foo", "1", "--bar=2"]

制作容器最佳实践

避免使用 latest Tag 以避免上游非兼容性更新造成影响
在 entrypoint 脚本中使用 EXEC 启动应用，如 exec java $JAVA_OPTS -jar $JAR 来启动 Java 进程、确保进程可以处理容器运行时发送的信号（如 TERM / SIGKILL）
声明端口：在 Dockerfile 中使用 EXPOSE 指令显式声明；暴露的端口可以通过 docker ps / docker inspect 查看
声明环境变量：便于不查看 Dockerfile 时了解关键的信息，如产物版本、关键参数
Non-Root：如果容器被攻破，容器 Root 逃逸出来就是宿主机上的 Root
多阶段构建：临时构建镜像安装编译所需的依赖项，生产就绪的镜像可以尽可能精简
避免将文件放入 /tmp：对一些应用程序，会将某些缓存信息或心跳检测写如 /tmp、这对 /tmp 读写性能要求较高，某些 Linux 发行版 /tmp 通过 tmpfs 文件系统存储在内存，但 Docker 容器默认为 /tmp 使用标准的 Docker OverlayFS（docker run --rm -it nginx df）；可以使用 /dev/shm（shm 文件系统，共享内存和内存文件系统）

# multi-stage build
FROM python:3.6 as base
COPY requirements.txt /
RUN pip wheel --no-cache-dir --no-deps --whell-dir /wheels -r requirements.txt
FROM python:3.6-alpine
COPY --from=base /wheels /wheels
COPY --from=base requirements.txt .
RUN pip install --no-cache /wheels/*
WORKDIR /app
COPY . /app

Java Dockerfile Cheatsheet

https://merikan.com/2019/04/jvm-in-a-container/

JVM 默认从 /proc 查找可用内存信息（Until Java8u131 and Java9 the JVM did not recognize memory or cpu limits set by the container. First implementation was a experimental feature and had its flaws but in Java10, memory limits are automatically recognized and enforced. This feature was then backported to Java8u191）
因此 Java8u191 (Update 191) 以上，使用 -XX:+UseContainerSupport（及 MaxRAMPercentage / MinRAMPercentage 注意不是 MaxRAMFraction）而非 -XX:+UseCGroupMemoryLimitForHeap，即告诉 JVM 从 /sys/fs/cgroup/memory/memory.limit_in_bytes 查找可用内存（从 cgroup 查找）
-Xmx 指定堆内存大小，一般设定为 k8s memory limit 的 50% ~ 80%（除了堆内存、JVM 还有各种堆外数据结构的内存开销，如 java threads, metaspace, native memory, socket buffers, etc.）
不同于其他容器类 VM 参数，-Xmx 支持所有的 JVM 版本；是一个明确指定的大小、而不是估算大小（VM Settings: Estimeated）
通过 java -XshowSettings:vm -XX:+PrintFlagsFinal -version 查看生效参数

# (Java8u131) JVM maxHeapSize=1.73G, which has no clue that it's running in container with 100MB avaiable mem
➜ docker run -m 100MB openjdk:8u131 java -XshowSettings:vm -version
VM settings:
    Max. Heap Size (Estimated): 1.73G
    Ergonomics Machine Class: server
    Using VM: OpenJDK 64-Bit Server VM
# JVM can check the cgroup memory limit and calcuate a maxHeapSize (44M heapSize /100M maxRAM)
➜ docker run -m 100MB openjdk:8u131 java \
  -XX:+UnlockExperimentalVMOptions \
  -XX:+UseCGroupMemoryLimitForHeap \
  -XshowSettings:vm -version
VM settings:
    Max. Heap Size (Estimated): 44.50M
    ...
➜ docker run -m 100MB openjdk:8u131 java \
      -XX:+UnlockExperimentalVMOptions \
      -XX:+PrintFlagsFinal -version \
      | grep -E "UnlockExperimentalVMOptions | UseCGroupMemoryLimitForHeap | MaxRAMFraction | InitialRAMPercentage | MaxRAMPercentage | MinRAMPercentage"
uintx MaxRAMFraction                            = 4                                   {product}
 bool UnlockExperimentalVMOptions              := true                                {experimental}
 bool UseCGroupMemoryLimitForHeap               = false                               {experimental}

其他

Nginx：根据 CPU Limit 设定 worker processes 数（默认 auto 会根据宿主机 CPU 核数启动进程数）

Docker 网络

docker network create <network-name>
docker network ls
docker network inspect <network-name>
docker network connect <network-name> <container>
docker run --add-host=docker:10.0.0.1 --name app ...
# 在 /etc/hosts 文件中添加名为 docker，地址为 10.0.0.1 的宿主机记录；容器内 env 查看环境变量

镜像管理

docker images
docker history <image>
docker inspect <image>
# 安装 `dive` ( _brew info dive_ ) 查看镜像的层次结构
dive <image>
# 删除
docker rmi <image>

容器管理

查看容器

docker stats
docker ps —no-trunc
docker top <container>

启动容器

# 特权模式启动容器，对宿主机有 root 访问权限
docker run --priviledged ...
# 对于用完即抛的容器，--rm 自动删除
docker run --rm ...

连接容器，使用卷或网络接口而非通过 SSH

docker exec -it <container> /bin/sh
docker kill -s <signal> <container>

Docker Machine

Docker 在 Linux 操作系统上，提供一个额外的软件抽象层，以及操作系统层虚拟化的自动管理机制。Docker 利用 Linux 核心中的资源分离机制，例如 cgroups，以及 Linux 核心命名空间 ( namespace )，来建立独立的软件容器 ( containers ). Docker 以 C/S 模式执行，最初基于同一个 Binary 启动 client 及 daemon 的，新版本中已分离。

Client 和 Daemon 建立请求的方式：

tcp://host:port
unix://path/to/sock
fd://socket_fd

Docker Remote API

// TBD

容器编排

服务发现是分布式应用程序之间管理相互关系的一种机制。服务发现允许某个组件在想要与其他组件交互时，自动找到对方。由于这些应用本身是分布式的，服务发现机制也需要是分布式的。而且，服务发现作为分布式应用不同组件之间的“胶水”，其本身还需要足够动态、可靠，适应性强，而且可以快速且一致地共享关于这些服务的数据。

另外，Docker 主要关注分布式应用以及面向服务架构与微服务架构。这些关注点很适合与某个服务发现工具集成。每个 Docker 容器可以将其中运行的服务注册到服务发现工具里。注册的信息可以是 IP 地址或者端口，或两者都有，以便服务之间进行交互。

Consul 是一个使用 Raft 一致性算法的特殊数据存储器；Consul 暴露了键值存储系统和服务分类系统，并提供高可用性、高容错能力，并保证强一致性。Consul 还提供了根据 API 进行服务分类，代替了大部分传统服务发现工具的键值对存储。

Kubernetes

在容器环境中，编排通常涉及到三个方面:

资源编排: 负责资源的分配，如限制 Namespace 的可用资源，Scheduler 针对资源的不同调度策略
工作负载编排: 负责在资源之间共享工作负载，如 k8s 通过不同的 Controller 将 Pod 调度到合适的 Node 上，并且负责管理它们的生命周期
服务编排: 负责服务发现和高可用等，如 k8s 中可用通过 Service 来对内暴露服务，通过 Ingress 来对外暴露服务

在 k8s 中有 5 种我们经常会用到的控制器来帮助我们进行容器编排: Deployment, StatefulSet, DaemonSet, CronJob, Job. 其中 Deployment 经常被作为无状态实例控制器使用；StatefulSet 是一个有状态实例控制器；DaemonSet 可以指定在选定的 Node 上跑，每个 Node 上会跑一个副本，它有一个特点是它的 Pod 的调度不经过调度器，在 Pod 创建的时候就直接绑定 NodeName；最后一个是定时任务，它是一个上级控制器，和 Deployment 有些类似，当一个定时任务触发的时候，它会去创建一个 Job，具体的任务实际上是由 Job 来负责执行的。

Kubernetes 已经为我们对大量常用的基础资源进行了抽象和封装，我们可以非常灵活地组合、使用这些资源来解决问题，同时它还提供了一系列自动化运维的机制，如 HPA, VPA, Rollback, Rolling Update 等帮助我们进行弹性伸缩和滚动更新，而且上述所有的功能都可以用 YAML 声明式进行部署。

但是这些抽象还是在容器层面的，对于一个大型的应用而言，需要组合大量的 Kubernetes 原生资源，需要非常多的 Services, Deployments, StatefulSets 等，这里面用起来就会比较繁琐，而且其中服务之间的依赖关系需要用户自己解决，缺乏统一的依赖管理机制。

应用编排

什么是应用？一个对外提供服务的应用，首先它需要一个能够与外部通讯的网络，其次还需要能运行这个服务的载体 (Pods)，如果这个应用需要存储数据，这还需要配套的存储，所以我们可以认为: 应用单元 = 网络 + 服务载体 + 存储

那么我们很容易地可以将 Kubernetes 的资源联系起来，然后将他们划分为 4 种类型的应用：

无状态应用 = Services + Volumes + Deployment
有状态应用 = Services + Volumes + StatefulSet
守护型应用 = Services + Volumes + DaemonSet
批处理应用 = Services + Volumes + CronJob/Job

8 Docker UseCases

Simplifying Configuration
Developer Productivity
Server Consolidation
Multi-tenancy
Code Pipeline Management
App Isolation
Debugging Capabilities
Rapid Deployment

Using Docker in production

Challenges

Ops plane integrations (logging, metrics, monitoring)
Optimized build and deploy time
Security (vulnerabilities and capabilities that can be exploited)
Handling secrets properly
Persisting data
Resource limits for collocated containers

Practice

Init an express app:

# Using express application generator:
npx express-generator
# Install dependencies and run application
npm i && npm start

Dockerfile v1:

FROM node:14
WORKDIR /usr/src/app
# using . instead of * to keep directory structure
COPY . ./
RUN npm install
EXPOSE 3000
CMD ["node", "bin/www"]

Test and run:

# add `node_modules/` to .dockerignore
docker build --tag node.local.v1 --file dockerfile-v1 .
docker run -p 3000:3000 node.local.v1
# dive node.local.v1

Dockerfile v1 Problems:

Big container image
- Large build contexts resut in slow builds and bigger images
Problems with build caching
- Cache busting instruction combinations cause slow builds
Running as root
Only the major version of node is pinned
- Lack of version pinning turns into reproducibility problems
Not handling signals and not handling orphaned processes

Dockerfile v2

FROM node:14.16.1 as builder
WORKDIR /usr/src/app
COPY package*.json ./
RUN npm ci
COPY . ./
FROM gcr.io/distroless/nodejs:14
WORKDIR /usr/src/app
COPY --from=builder /usr/src/app .
EXPOSE 3000
USER nonroot
CMD ["bin/www"]

Using docker-slim for resize image

Summary

Container:

package & deploy services
allow for process isolation
immutability
efficient resource utilization
are lightweight in creation

Container orchestration:

integrate and orchestrate these modular parts
scale up & scale down
fault tolerant
provide communication across a cluster

k8s features:

Horizontal infra scaling
Auto-scaling: automatically change the number of running containers, based on CPU utilization or other app-provided metrics
Replication controller: RC makes sure your cluster has an equal amount of Pods running. (terminates the extra pods / starts more pods if there are too few)
Health checks and self-healing (auto-replacement)
Traffic routing and load balancing
Automated rollouts and rollbacks: handles rollouts for new version or updates without downtime while monitoring the containers’ health. In case the rollout doesn’t go well, it auto rolls back.

Adoption:

Without k8s, large teams would have to manually script their own deployment workflows. Containers, combined with an orchestration tool, provide management of machines and services — improving the reliability of your application while reducing the amount of time and resources spent on DevOps.
k8s has built-in features like self-healing and automated rollouts/rollbacks, effectively managing the containers for you.
Use k8s only when your application uses a micro-service arch.