Linux容器核心技术之: namespace

linux的namespace是什么关于linux的namespace, 官方文档是这么说的:A namespace wraps a global system resource in an abstraction thatmakes it appear to the processes within the namespace that theyhave their own isolated i

gerrylon007

4524人浏览 · 2022-01-30 21:13:30

gerrylon007 · 2022-01-30 21:13:30 发布

linux的namespace是什么

关于linux的namespace, 官方文档是这么说的:

A namespace wraps a global system resource in an abstraction that
makes it appear to the processes within the namespace that they
have their own isolated instance of the global resource. Changes
to the global resource are visible to other processes that are
members of the namespace, but are invisible to other processes.
One use of namespaces is to implement containers.

简单翻译下就是:
namespace是对系统资源的一种封装: 可以使得进程看起来拥有独立的资源一样.
可以用namespace技术来实现容器.

小结: namespace可以隔离进程使用的资源, 可以用来做容器.

有哪些namespace?

还是参考: https://man7.org/linux/man-pages/man7/namespaces.7.html , 列个表格:

名称	查看位置	支持的版本	说明
cgroup	/proc/[pid]/ns/cgroup	since Linux 4.6	控制进程使用的资源, (如限制内存最大使用量)
IPC	/proc/[pid]/ns/ipc	since Linux 3.0	隔离进程间通信
Mount	/proc/[pid]/ns/mnt	since Linux 3.8	使得各进程仿佛有各自的文件系统, 有点像chroot()
Network	/proc/[pid]/ns/net	since Linux 3.0	进程可以有独立的网络空间
PID	/proc/[pid]/ns/pid	since Linux 3.8	隔离pid
USER	/proc/[pid]/ns/user	since Linux 3.8	隔离用户
UTS(UNIX Time-Sharing)	/proc/[pid]/ns/uts	since Linux 3.0	隔离nodename, hostname

除了上述列出的, 官方文档还给出了: pid_for_children, time等NS, 有兴趣的同学自行了解.

namespace的一些api/命令

unshare命令

上面列举了许多namespace, 其实关于namespace有一些系统调用或者命令, 举几个例子直观感受下, 同时也学习下这个namespace到底有啥用?

这里先学习一个命令: unshare, 关于unshare的作用, 它的man文档说得非常清楚:

unshare - run program with some namespaces unshared from parent

其实看名字也能猜出来, un-share, 大概就是不共享, 也就是独占/隔离的意思.

下面是一些示例:

隔离PID

$ unshare --fork --pid --mount-proc /bin/bash
$ ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 20:22 pts/0    00:00:00 /bin/bash
root        19     1  0 20:22 pts/0    00:00:00 ps -ef
$

发现当前namespace下 bash进程的pid为1, 且无其他进程信息说明当前namespace和操作系统的PID已经隔离.
这个有个好处就是, 如果要做什么实验, 可以通过PID namespace来方便地查看进程, 不受其他进程干扰.

隔离主机名

$ hostname # 先查看下主机名
myvm
$ unshare --fork --uts /bin/bash # 隔离uts namespace
$ hostname # 查看下 hostname, 发现是继承了系统的hostname
myvm
$ hostname -b test # 修改 当前namespace中的hostname
$ hostname
test # 再次查看, 发现当前namespace中的hostname已经修改
# $ exit # 此时用exit退出当前namespace 或者 另开一个shell查看系统的hostname, 发现还是原来的myvm
# exit
$ hostname
myvm

lsns & nsenter命令

lsns: 列出系统当前的ns使用情况
nsenter: 进入某个ns
实验如下, 开两个终端:
Terminal 1: 使用unshare隔离UTS ns.

$ unshare --fork --uts /bin/bash
$ hostname -b demo
$ hostname
demo
$

Terminal 2: 使用lsns命令查看新的ns, 并使用nsenter进入新的ns

$ lsns
        NS TYPE NPROCS   PID USER    COMMAND
...省略一些内容
4026532244 uts       4 23822 root    unshare --fork --uts /bin/bash
$ nsenter -t 23822 -u bash
$ hostname
demo
$

clone()使用ns示例

#define _GNU_SOURCE
#include <sys/wait.h>
#include <sys/utsname.h>
#include <sched.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                                            } while (0)

/* Start function for cloned child */
static int childFunc(void *arg)
{
    struct utsname uts;

    /* Change hostname in UTS namespace of child */
    if (sethostname(arg, strlen(arg)) == -1)
            errExit("sethostname");

    /* Retrieve and display hostname */
    if (uname(&uts) == -1)
            errExit("uname");
    printf("uts.nodename in child:  %s\n", uts.nodename);

    /* Keep the namespace open for a while, by sleeping.
           This allows some experimentation--for example, another
           process might join the namespace. */

    sleep(200);

    return 0;           /* Child terminates now */
}

#define STACK_SIZE (1024 * 1024)    /* Stack size for cloned child */

int main(int argc, char *argv[])
{
    char *stack;                    /* Start of stack buffer */
    char *stackTop;                 /* End of stack buffer */
    pid_t pid;
    struct utsname uts;

    if (argc < 2) {
            fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
            exit(EXIT_SUCCESS);
    }

    /* Allocate stack for child */
    stack = malloc(STACK_SIZE);
    if (stack == NULL)
            errExit("malloc");
    stackTop = stack + STACK_SIZE;  /* Assume stack grows downward */

    /* Create child that has its own UTS namespace;
           child commences execution in childFunc() */
    // 注意第3个参数
    pid = clone(childFunc, stackTop, CLONE_NEWUTS | SIGCHLD, argv[1]);
    if (pid == -1)
            errExit("clone");
    printf("clone() returned %ld\n", (long) pid);

    /* Parent falls through to here */
    sleep(1);           /* Give child time to change its hostname */

    /* Display hostname in parent's UTS namespace. This will be
           different from hostname in child's UTS namespace. */
    if (uname(&uts) == -1)
            errExit("uname");
    printf("uts.nodename in parent: %s\n", uts.nodename);

    if (waitpid(pid, NULL, 0) == -1)    /* Wait for child */
            errExit("waitpid");
    printf("child has terminated\n");

    exit(EXIT_SUCCESS);
}

运行验证:

$ gcc main.c
$ ./a.out xx
clone() returned 22122
uts.nodename in child:  xx # 说明已经隔离了hostname
uts.nodename in parent: myvm
^C
$

使用C语言确实非常啰嗦(底层), 但是控制力非常强.

使用go实现隔离hostname

示例参考: <<自己动手写Docker>>一书

main.go:

package main

import (
	"log"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	cmd := exec.Command("sh")

	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWUTS, // 其实就是clone()系统调用的flags参数
	}

	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	if err := cmd.Run(); err != nil {
		log.Fatal(err)
	}
}

运行验证:

$ go run main.go # 运行完这一句, UTS namespace已经生效
$ hostname
myvm
$ hostname -b xx
$ hostname
xx
$ exit
exit
$ hostname
myvm
$

uts: uts:[4026532244] 是啥?

许多文章都会写, 进程所属的ns, 可以通过 /proc/[pid]/ns/[ns]来看, 如查看当前shell所属的ns:

$ ls -al /proc/$$/ns/uts
lrwxrwxrwx 1 root root 0 Jan 31 09:18 /proc/25078/ns/uts -> uts:[4026532244]
$ ls -al /proc/$$/ns
lrwxrwxrwx 1 root root 0 Jan 31 09:40 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Jan 31 09:18 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Jan 31 09:12 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Jan 31 09:12 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Jan 31 09:12 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Jan 31 09:18 uts -> uts:[4026532244]
$ readlink /proc/$$/ns/uts
uts:[4026532244]

从proc文件系统来看(ls -al /proc/$$/ns/uts), 进程所属的ns都是软链接.
以uts为例子(readlink /proc/$$/ns/uts), uts:[4026532244] 这个是什么意思?
参考这篇文章: uts:[4026532244]中的4026532244是: nsfs(Name Space File System)中的inode.

These numbers are the inode numbers of files implemented by the nsfs filesystem, which could be opened and used with the setns(2) to associate a process with a namespace.

可以通过open & setns系统函数来使得进程关联UTS ns:

#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                                            } while (0)

int main(int argc, char *argv[])
{
    int fd;

    if (argc < 3) {
            fprintf(stderr, "%s /proc/PID/ns/FILE cmd args...\n", argv[0]);
            exit(EXIT_FAILURE);
    }

    fd = open(argv[1], O_RDONLY);   /* Get descriptor for namespace */
    if (fd == -1)
            errExit("open");
            
    if (setns(fd, 0) == -1)         /* Join that namespace */
            errExit("setns");

    execvp(argv[2], &argv[2]);      /* Execute a command in namespace */
    errExit("execvp");
}

编译, 执行:

$ gcc -Wall -o setns.out setns.c;
$ ./setns.out /proc/$$/ns/uts hostname
myvm

总结

namespace是实现容器的核心技术之一;
ns相关的系统函数有: clone, setns, unshare等;
ns相关的命令有: unshare, lsns, nsenter等;
go语言中可通过syscall.SysProcAttr来关联ns;

参考

(完)

云原生

云原生社区为您提供最前沿的新闻资讯和知识内容

更多推荐

本地Docker部署Navidrome音乐服务器与远程访问听歌详细教程

云原生

【docker系列】docker删除指定容器

云原生

【Docker系列】制作基础镜像

云原生

所有评论(0)

查看更多评论

gerrylon007

@butterfly5211314

已为社区贡献3条内容