githubEdit

Docker

Docker

Introduction

Linux Namespace

Overview

Linux Namespace is a kernel-level environment isolation mechanism provided by Linux. Unix has a system call called chroot (which jails users into a specific directory by modifying the root directory). chroot provides a simple isolation model: the filesystem inside chroot cannot access external content. Linux Namespace builds on this concept, providing isolation mechanisms for UTS, IPC, mount, PID, network, and User.

The super parent process PID in Linux is 1. Similar to chroot, if we can jail a user process space into a certain process branch and make the processes underneath see the super parent PID as 1, we achieve resource isolation (processes in different PID namespaces cannot see each other).

Three main system calls:

  • clone() – Creates a new process with isolation by setting namespace parameters.

  • unshare() – Detaches a process from a namespace.

  • setns() – Attaches a process to an existing namespace.

Linux Namespace Typesarrow-up-right

Type
System Call Flag
Kernel Version

Mount namespaces

CLONE_NEWNS

UTS namespaces

CLONE_NEWUTS

IPC namespaces

CLONE_NEWIPC

PID namespaces

CLONE_NEWPID

clone() System Call

Compile and run to verify

UTS Namespace

After running the program, the child process hostname becomes container

IPC Namespace

IPC (Inter-Process Communication) is a communication method between processes in Unix/Linux, including shared memory, semaphores, and message queues. To achieve isolation, IPC must be isolated so that only processes within the same Namespace can communicate. IPC requires a global ID, and Namespace must isolate this ID from other Namespaces.

To enable IPC isolation, add the CLONE_NEWIPC flag when calling clone

First create an IPC Queue with global Queue ID 0

Run the program to verify IPC Queue isolation

PID Namespace

Run the program to verify

Role of PID 1: The process with PID 1 is init, which has a special role. As the parent of all processes, it has many privileges (e.g., signal masking) and monitors all process states. If a child process is orphaned (parent didn't wait for it), init reclaims resources and terminates it. To achieve process space isolation, we need to create a process with PID 1, ideally making the child process PID appear as 1 inside the container.

However, running ps, top, etc. in the child shell still shows all processes, indicating incomplete isolation. This is because commands like ps and top read from the /proc filesystem, which is shared between parent and child processes. Therefore, filesystem isolation is also needed.

Mount Namespace

Enable mount namespace and remount /proc filesystem in the child process

Run the program to verify

User Namespace

User Namespace uses the CLONE_NEWUSER flag. After enabling it, the internal UID and GID differ from external ones, defaulting to 65534 because the container cannot find its real UID and falls back to the maximum UID (defined in /proc/sys/kernel/overflowuid).

To map container UIDs to real system UIDs, modify /proc/pid/uid_map and /proc/pid/gid_map. The format of these files is:

Where:

  • The first field ID-inside-ns represents the UID or GID displayed inside the container,

  • The second field ID-outside-ns represents the real UID or GID mapped outside the container.

  • The third field represents the mapping range, typically 1 for one-to-one mapping. For example, mapping real uid=1000 to container uid=0

Another example: mapping uid starting from 0 inside the namespace to uid starting from 0 outside, with the maximum range of unsigned 32-bit integer

Notes:

  • The process writing these files needs CAP_SETUID (CAP_SETGID) capability in this namespace (see Capabilitiesarrow-up-right)

  • The writing process must be in a parent or child user namespace of this user namespace.

  • Additionally, one of the following conditions must be met: 1) The parent maps its effective uid/gid to the child user namespace, 2) If the parent has CAP_SETUID/CAP_SETGID, it can map to any uid/gid in the parent process.

The program above uses a pipe to synchronize parent and child processes. This is necessary because the child process calls execv, which replaces the entire process space. We need to complete the user namespace uid/gid mapping before execv, so that /bin/bash launched by execv will show the # prompt due to the inside-uid being set to 0.

Run the program

Although the container shows root, the /bin/bash process actually runs as a regular ubuntu user, improving container security. User Namespace runs as a regular user, but other Namespaces require root privileges. To use multiple Namespaces simultaneously, first create a User Namespace as a regular user, map that user to root, then create other Namespaces as root inside the container.

Network Namespace

Network Namespaces are typically created using the ip command. Note: The host may be a VM, and the physical NIC may be a virtual NIC capable of routing IPs.

In a Docker container, use ip link show or ip addr show to view the host network

How to simulate the above scenario:

Docker networking differs from the above in two ways:

  • Docker resolv.conf uses Mount Namespace instead of the above method

  • Docker uses the process PID as the Network Namespace name.

Add a new NIC to a running Docker container, e.g., add an eth1 NIC with a static externally-accessible IP address.

The external "physical NIC" must be set to promiscuous mode so that eth1 can broadcast its MAC address via ARP. The external switch then forwards packets for this IP to the "physical NIC". In promiscuous mode, eth1 receives the relevant data, enabling Docker container network connectivity with the external network.

Linux Cgroup

Linux CGroup (Linux Control Group) is a Linux kernel feature for limiting, controlling, and isolating resource usage (CPU, memory, disk I/O, etc.) of process groups.

Linux CGroup allows you to allocate resources — such as CPU time, system memory, network bandwidth, or combinations thereof — to user-defined groups of tasks (processes). You can monitor configured cgroups, deny cgroups access to certain resources, and dynamically reconfigure cgroups on a running system. Main features:

  • Resource limitation: Limit resource usage, such as memory caps and filesystem cache limits.

  • Prioritization: Priority control for CPU utilization and disk I/O throughput.

  • Accounting: Auditing and statistics, primarily for billing purposes.

  • Control: Suspend and resume processes.

Using cgroups, system administrators can more precisely control the allocation, prioritization, denial, management, and monitoring of system resources, improving overall efficiency by better distributing hardware resources based on tasks and users.

  • Isolate a set of processes (e.g., all nginx processes) and limit their resource consumption, such as CPU core binding.

  • Allocate sufficient memory for the process group

  • Allocate appropriate network bandwidth and disk storage limits for the process group

  • Restrict access to certain devices (via device whitelisting)

View cgroup mount in Ubuntu

Or use the lssubsys command

If not available, mount manually

Create directories under /sys/fs/cgroup subdirectories

CPU Limit

Simulate a CPU-intensive program

Limit CPU for a custom cgroup

Thread code example

Memory Limit

Simulate a memory-intensive program (allocating 512 bytes each time, sleeping 1 second between allocations)

Limit memory

IO Limit

Test simulated I/O speed

Create a blkio (block device I/O) cgroup

Limit process I/O speed

Cgroup Subsystem

  • blkio — Sets I/O limits for block devices such as physical devices (disk, SSD, USB, etc.).

  • cpu — Uses the scheduler to provide cgroup task access to CPU.

  • cpuacct — Generates automatic reports on CPU usage by tasks in a cgroup.

  • cpuset — Assigns dedicated CPUs (on multi-core systems) and memory nodes to cgroup tasks.

  • devices — Allows or denies access to devices for tasks in a cgroup.

  • freezer — Suspends or resumes tasks in a cgroup.

  • memory — Sets memory limits for tasks in a cgroup and generates memory usage reports.

  • net_cls — Tags network packets with a class identifier (classid), allowing the Linux traffic controller (tc) to identify packets from specific cgroups.

  • net_prio — Sets network traffic priority

  • hugetlb — Limits HugeTLB (huge page filesystem) usage.

Cgroup Terminology

  • Tasks: A system process.

  • Control Group: A group of processes classified by certain criteria. Resource control in cgroups is implemented per control group. A process can join a control group, and resource limits are defined on the group. Simply put, a cgroup is represented as a directory with a set of configurable files.

  • Hierarchy: Control groups can be organized hierarchically as a tree (directory structure). Child nodes inherit attributes from parent nodes. Simply put, a hierarchy is a cgroups directory tree on one or more subsystems.

  • Subsystem: A resource controller, e.g., the CPU subsystem controls CPU time allocation. A subsystem must be attached to a hierarchy to take effect. Once attached, all control groups in that hierarchy are governed by the subsystem.

Docker Engine

Install

Storage

Overview

Volumes

Bind mounts

tmpfs mounts

Storage drivers

Btrfs

OverlayFS

ZFS

containerd snapshotters

Networking

Overview

Networking drivers

Bridge

Overlay

Host

IPvlan

Macvlan

None

Container

Custom Network Mode

Daemon

Docker Build

Build images

Multi-stage builds

Use multi-stage builds

Name build stages

Dockerfile

Example

others

Docker Compose

Archery Docker Compose

Reference:

Last updated