Docker
Docker
Introduction
Linux Namespace
Overview
Linux Namespace is a kernel-level environment isolation mechanism provided by Linux. Unix has a system call called chroot (which jails users into a specific directory by modifying the root directory). chroot provides a simple isolation model: the filesystem inside chroot cannot access external content. Linux Namespace builds on this concept, providing isolation mechanisms for UTS, IPC, mount, PID, network, and User.
The super parent process PID in Linux is 1. Similar to chroot, if we can jail a user process space into a certain process branch and make the processes underneath see the super parent PID as 1, we achieve resource isolation (processes in different PID namespaces cannot see each other).
Three main system calls:
clone()– Creates a new process with isolation by setting namespace parameters.unshare()– Detaches a process from a namespace.setns()– Attaches a process to an existing namespace.
clone() System Call
Compile and run to verify
UTS Namespace
After running the program, the child process hostname becomes container
IPC Namespace
IPC (Inter-Process Communication) is a communication method between processes in Unix/Linux, including shared memory, semaphores, and message queues. To achieve isolation, IPC must be isolated so that only processes within the same Namespace can communicate. IPC requires a global ID, and Namespace must isolate this ID from other Namespaces.
To enable IPC isolation, add the CLONE_NEWIPC flag when calling clone
First create an IPC Queue with global Queue ID 0
Run the program to verify IPC Queue isolation
PID Namespace
Run the program to verify
Role of PID 1: The process with PID 1 is init, which has a special role. As the parent of all processes, it has many privileges (e.g., signal masking) and monitors all process states. If a child process is orphaned (parent didn't wait for it), init reclaims resources and terminates it. To achieve process space isolation, we need to create a process with PID 1, ideally making the child process PID appear as 1 inside the container.
However, running ps, top, etc. in the child shell still shows all processes, indicating incomplete isolation. This is because commands like ps and top read from the /proc filesystem, which is shared between parent and child processes. Therefore, filesystem isolation is also needed.
Mount Namespace
Enable mount namespace and remount /proc filesystem in the child process
Run the program to verify
User Namespace
User Namespace uses the CLONE_NEWUSER flag. After enabling it, the internal UID and GID differ from external ones, defaulting to 65534 because the container cannot find its real UID and falls back to the maximum UID (defined in /proc/sys/kernel/overflowuid).
To map container UIDs to real system UIDs, modify /proc/pid/uid_map and /proc/pid/gid_map. The format of these files is:
Where:
The first field ID-inside-ns represents the UID or GID displayed inside the container,
The second field ID-outside-ns represents the real UID or GID mapped outside the container.
The third field represents the mapping range, typically 1 for one-to-one mapping. For example, mapping real uid=1000 to container uid=0
Another example: mapping uid starting from 0 inside the namespace to uid starting from 0 outside, with the maximum range of unsigned 32-bit integer
Notes:
The process writing these files needs CAP_SETUID (CAP_SETGID) capability in this namespace (see Capabilities)
The writing process must be in a parent or child user namespace of this user namespace.
Additionally, one of the following conditions must be met: 1) The parent maps its effective uid/gid to the child user namespace, 2) If the parent has CAP_SETUID/CAP_SETGID, it can map to any uid/gid in the parent process.
The program above uses a pipe to synchronize parent and child processes. This is necessary because the child process calls execv, which replaces the entire process space. We need to complete the user namespace uid/gid mapping before execv, so that /bin/bash launched by execv will show the # prompt due to the inside-uid being set to 0.
Run the program
Although the container shows root, the /bin/bash process actually runs as a regular ubuntu user, improving container security. User Namespace runs as a regular user, but other Namespaces require root privileges. To use multiple Namespaces simultaneously, first create a User Namespace as a regular user, map that user to root, then create other Namespaces as root inside the container.
Network Namespace
Network Namespaces are typically created using the ip command. Note: The host may be a VM, and the physical NIC may be a virtual NIC capable of routing IPs.
In a Docker container, use ip link show or ip addr show to view the host network
How to simulate the above scenario:
Docker networking differs from the above in two ways:
Docker resolv.conf uses Mount Namespace instead of the above method
Docker uses the process PID as the Network Namespace name.
Add a new NIC to a running Docker container, e.g., add an eth1 NIC with a static externally-accessible IP address.
The external "physical NIC" must be set to promiscuous mode so that eth1 can broadcast its MAC address via ARP. The external switch then forwards packets for this IP to the "physical NIC". In promiscuous mode, eth1 receives the relevant data, enabling Docker container network connectivity with the external network.
Linux Cgroup
Linux CGroup (Linux Control Group) is a Linux kernel feature for limiting, controlling, and isolating resource usage (CPU, memory, disk I/O, etc.) of process groups.
Linux CGroup allows you to allocate resources — such as CPU time, system memory, network bandwidth, or combinations thereof — to user-defined groups of tasks (processes). You can monitor configured cgroups, deny cgroups access to certain resources, and dynamically reconfigure cgroups on a running system. Main features:
Resource limitation: Limit resource usage, such as memory caps and filesystem cache limits.
Prioritization: Priority control for CPU utilization and disk I/O throughput.
Accounting: Auditing and statistics, primarily for billing purposes.
Control: Suspend and resume processes.
Using cgroups, system administrators can more precisely control the allocation, prioritization, denial, management, and monitoring of system resources, improving overall efficiency by better distributing hardware resources based on tasks and users.
Isolate a set of processes (e.g., all nginx processes) and limit their resource consumption, such as CPU core binding.
Allocate sufficient memory for the process group
Allocate appropriate network bandwidth and disk storage limits for the process group
Restrict access to certain devices (via device whitelisting)
View cgroup mount in Ubuntu
Or use the lssubsys command
If not available, mount manually
Create directories under /sys/fs/cgroup subdirectories
CPU Limit
Simulate a CPU-intensive program
Limit CPU for a custom cgroup
Thread code example
Memory Limit
Simulate a memory-intensive program (allocating 512 bytes each time, sleeping 1 second between allocations)
Limit memory
IO Limit
Test simulated I/O speed
Create a blkio (block device I/O) cgroup
Limit process I/O speed
Cgroup Subsystem
blkio — Sets I/O limits for block devices such as physical devices (disk, SSD, USB, etc.).
cpu — Uses the scheduler to provide cgroup task access to CPU.
cpuacct — Generates automatic reports on CPU usage by tasks in a cgroup.
cpuset — Assigns dedicated CPUs (on multi-core systems) and memory nodes to cgroup tasks.
devices — Allows or denies access to devices for tasks in a cgroup.
freezer — Suspends or resumes tasks in a cgroup.
memory — Sets memory limits for tasks in a cgroup and generates memory usage reports.
net_cls — Tags network packets with a class identifier (classid), allowing the Linux traffic controller (tc) to identify packets from specific cgroups.
net_prio — Sets network traffic priority
hugetlb — Limits HugeTLB (huge page filesystem) usage.
Cgroup Terminology
Tasks: A system process.
Control Group: A group of processes classified by certain criteria. Resource control in cgroups is implemented per control group. A process can join a control group, and resource limits are defined on the group. Simply put, a cgroup is represented as a directory with a set of configurable files.
Hierarchy: Control groups can be organized hierarchically as a tree (directory structure). Child nodes inherit attributes from parent nodes. Simply put, a hierarchy is a cgroups directory tree on one or more subsystems.
Subsystem: A resource controller, e.g., the CPU subsystem controls CPU time allocation. A subsystem must be attached to a hierarchy to take effect. Once attached, all control groups in that hierarchy are governed by the subsystem.
Docker Engine
Install
Storage
Overview
Volumes
Bind mounts
tmpfs mounts
Storage drivers
Btrfs
OverlayFS
ZFS
containerd snapshotters
Networking
Overview
Networking drivers
Bridge
Overlay
Host
IPvlan
Macvlan
None
Container
Custom Network Mode
Daemon
Docker Build
Build images
Multi-stage builds
Use multi-stage builds
Name build stages
Dockerfile
Example
others
Docker Compose
Archery Docker Compose
Reference:
Last updated