,

Contents · Containers (namespaces, cgroups)


What is a container?

  • Containers are processes with isolated views of system resources and constrained usage.
  • Two pillars: namespaces (isolation) and cgroups (resource control).
  • Images provide a root filesystem; runtime wires namespaces, mounts, network, and limits.

Namespaces

  • pid: container sees its own PIDs; init of the namespace is PID 1 for that view.
  • mount (mnt): separate mount table; pivot_root/chroot to set rootfs.
  • net: virtual network stack; veth pairs, bridges, overlay networks.
  • uts: hostname and NIS domain isolation.
  • ipc: System V IPC and POSIX message queues isolation.
  • user: map container root to unprivileged host UID/GID (UID 0 in container != host root).
  • time: time namespaces for per-container clocks (newer kernels).
// Minimal containerization steps (conceptual)
function startContainer(){
  // 1) create namespaces (clone/unshare: pid, mnt, net, uts, ipc, user)
  // 2) set up rootfs mounts (bind mounts, pivot_root)
  // 3) configure network (veth, bridge, IP, routes)
  // 4) drop caps, set seccomp/LSM profiles
  // 5) exec init process in the new namespace
}

cgroups v1 vs v2

  • Limit and account CPU, memory, I/O, pids; apply policies per group of processes.
  • v1: controllers are separate hierarchies; v2: unified hierarchy with delegation model and consistent APIs.
  • Key knobs: cpu.max, memory.max/memory.swap.max, io.max, pids.max, cpuset.cpus/mems.
// Estimate CPU quota behavior
function cpuShare(cores, quotaUs, periodUs){
  return Math.min(cores, quotaUs/periodUs);
}

Rootfs, mounts, and images

  • Rootfs often layered: overlayfs/unionfs combine base image + writable layer.
  • Bind mounts project host paths into containers; read-only mounts for secrets/config.
  • tmpfs for ephemeral dirs; dedicated mounts for /proc and /sys.

Container networking

  • CNIs configure veth → bridge (docker0) or routed overlays (Flannel, Calico, Cilium).
  • Port mapping via NAT; service discovery with DNS/IPAM; policies with iptables/eBPF.
  • Host network mode shares host namespace; fewer copies but less isolation.

Security: capabilities, seccomp, LSMs

  • Capabilities: fine-grained privileges; drop CAP_SYS_ADMIN and others by default.
  • seccomp-bpf: filter dangerous syscalls; default Docker/K8s profiles restrict attack surface.
  • LSMs: AppArmor/SELinux profiles confine access; rootless containers rely on user namespaces.

Runtimes and orchestration

  • OCI image and runtime specs; containerd/cri-o use runc (or crun) to spawn containers.
  • Kubernetes orchestrates pods (groups of containers) with CNI, CSI, and CRI integrations.
  • Helm/Operators automate deployments; policies via admission controllers.

Exercises

  1. Create a rootless container that runs a minimal web server with user+mount namespaces.
  2. Configure cgroup v2 CPU and memory limits; measure throttling behavior under load.
  3. Build a small overlayfs rootfs from two directories and run a chroot inside a new mount namespace.
Containers = namespaces + cgroups + well-constructed rootfs, with security layers and orchestration on top.