rkt项目

rkt项目最早跟随k8s使用的运行时的组件,并且也入选过cncf的沙箱项目,但是在最后的使用中还是被抛弃了,其中主要的是croi-o和containerd两个项目的接受度更高,并且社区活跃度越来越低,最终停止维护。虽然停止维护但是也可以是一个很好的案例来学习一下rkt项目的设计思路与思想。主要的学习资料就是官网提供的 运行原理架构

rkt原理梳理

rkt 的主要界面是一个命令行工具 rkt,它不需要长时间运行的守护进程。 这种架构允许 rkt 就地更新,而不会影响当前正在运行的应用程序容器。 这也意味着可以在不同的操作之间分离特权级别。

rkt 中的所有状态都通过文件系统进行通信、文件锁定等工具用于确保 rkt 命令的并发调用之间的合作和互斥。

rkt通过分层来实现各个业务层的逻辑,从而可以在快速的迭代中保持系统的稳定。

在这里插入图片描述

官网的原理很清晰的展示了,在调用了rkt的进程之后是如何进行容器的隔离到最终的app的启动,在图中stage1就是广义上的pod,然后stage2就是通过pod运行的运行时的容器。

主要分为三步来启动。

  1. 调用进程-> stage0:调用进程使用自己的机制来调用rkt二进制文件(stage0)。 当通过常规 shell 或主管启动时,stage0 通常被派生并执行,成为调用 shell 或主管的子进程。
  2. stage0 -> stage1:使用普通的 exec(3) 将 stage0 进程替换为 stage1 入口点。 入口点由 stage1 映像清单中的 coreos.com/rkt/stage1/run 注释引用。
  3. stage1 -> stage2:stage1 入口点使用其机制来调用 stage2 应用程序可执行文件。 应用程序可执行文件由 stage2 映像清单中的 apps.app.exec 设置引用。

通过三个步骤将容器创建的过程分别进行初始化,并且这种不需要进行长后台运行的情况来保持运行,并且将运行的情况都通过文件的组织来进行管理。

stage0

当调用rkt二进制文件来进行pod的运行的时候,rkt会初始化如下的相关的任务:

  1. 获取指定的ACI(镜像),包括镜像的相关参数。
  2. 创建一个Pod的uuid。
  3. 创建一个Pod的Manifest。
  4. 为Pod创建一个文件系统。
  5. 在文件系统中为创建stage1和stage2文件夹。
  6. 解压stage1的ACI到Pod的文件系统中。
  7. 解压ACI并拷贝每个app到stage2文件夹中。

将生成符合 ACE 规范的 pod manifest,stage0 创建的文件系统应该是如下:

/pod
/stage1
/stage1/manifest
/stage1/rootfs/init
/stage1/rootfs/opt
/stage1/rootfs/opt/stage2/${app1-name}
/stage1/rootfs/opt/stage2/${app2-name}

其中:

  • pod 是Pod的manifest文件。
  • stage1 是一个可以安全读写的stage1镜像的拷贝。
  • stage1/manifest 是stage1镜像的manifest。
  • stage1/rootfs 是stage1镜像的根文件系统。
  • stage1/rootfs/init 是一个指定的可执行文件,在stage1镜像中的配置。
  • stage1/rootfs/opt/stage2 是解压的镜像的拷贝。

此时 stage0 执行 /stage1/rootfs/init 并将当前工作目录设置为新文件系统的根目录。

以rkt run app.aci为例来了解。通过该命令最终调用的就是封装好的runRun函数。

func runRun(cmd *cobra.Command, args []string) (exit int) {
	privateUsers := user.NewBlankUidRange()            // 获取权限
	err := parseApps(&rktApps, args, cmd.Flags(), true)   // 解析输入参数
	if err != nil {
		stderr.PrintE("error parsing app image arguments", err)
		return 254
	}

	if flagStoreOnly && flagNoStore {
		stderr.Print("both --store-only and --no-store specified")
		return 254
	}
	if flagStoreOnly {                               // 获取镜像的拉取方式
		flagPullPolicy = image.PullPolicyNever
	}
	if flagNoStore {
		flagPullPolicy = image.PullPolicyUpdate
	}

	if flagPrivateUsers {                       // 是否支持名称空间
		if !common.SupportsUserNS() {
			stderr.Print("--private-users is not supported, kernel compiled without user namespace support")
			return 254
		}
		privateUsers.SetRandomUidRange(user.DefaultRangeCount)
	}

	if len(flagPorts) > 0 && flagNet.None() {    // 检查port的运行端口
		stderr.Print("--port flag does not work with 'none' networking")
		return 254
	}
	if len(flagPorts) > 0 && flagNet.Host() {    
		stderr.Print("--port flag does not work with 'host' networking")
		return 254
	}

	if flagMDSRegister && flagNet.None() {
		stderr.Print("--mds-register flag does not work with --net=none. Please use 'host', 'default' or an equivalent network")
		return 254
	}

	if len(flagPodManifest) > 0 && (rktApps.Count() > 0 ||
		(*appsVolume)(&rktApps).String() != "" || (*appMount)(&rktApps).String() != "" ||
		len(flagPorts) > 0 || flagPullPolicy == image.PullPolicyNever ||
		flagPullPolicy == image.PullPolicyUpdate || flagInheritEnv ||
		!flagExplicitEnv.IsEmpty() || !flagEnvFromFile.IsEmpty()) {
		stderr.Print("conflicting flags set with --pod-manifest (see --help)")
		return 254
	}

	if flagInteractive && rktApps.Count() > 1 {   // 检查交互端口和app数量
		stderr.Print("interactive option only supports one image")
		return 254
	}

	if rktApps.Count() < 1 && len(flagPodManifest) == 0 {
		stderr.Print("must provide at least one image or specify the pod manifest")
		return 254
	}

	s, err := imagestore.NewStore(storeDir())   //  新建一个存储的目录
	if err != nil {
		stderr.PrintE("cannot open store", err)
		return 254
	}

	ts, err := treestore.NewStore(treeStoreDir(), s)
	if err != nil {
		stderr.PrintE("cannot open treestore", err)
		return 254
	}

	config, err := getConfig()
	if err != nil {
		stderr.PrintE("cannot get configuration", err)
		return 254
	}

	s1img, err := getStage1Hash(s, ts, config)     // 获取stage1的hash值
	if err != nil {
		stderr.Error(err)
		return 254
	}

	fn := &image.Finder{
		S:                  s,
		Ts:                 ts,
		Ks:                 getKeystore(),
		Headers:            config.AuthPerHost,
		DockerAuth:         config.DockerCredentialsPerRegistry,
		InsecureFlags:      globalFlags.InsecureFlags,
		Debug:              globalFlags.Debug,
		TrustKeysFromHTTPS: globalFlags.TrustKeysFromHTTPS,

		PullPolicy: flagPullPolicy,
		WithDeps:   true,
	}
	if err := fn.FindImages(&rktApps); err != nil {  // 查找镜像
		stderr.Error(err)
		return 254
	}

	p, err := pkgPod.NewPod(getDataDir())   // 新建一个Pod工作文件夹并设置状态
	if err != nil {
		stderr.PrintE("error creating new pod", err)
		return 254
	}

	// if requested, write out pod UUID early so "rkt rm" can
	// clean it up even if something goes wrong
	if flagUUIDFileSave != "" {
		if err := pkgPod.WriteUUIDToFile(p.UUID, flagUUIDFileSave); err != nil {  // 保存pod的uuid 以方便后续管理
			stderr.PrintE("error saving pod UUID to file", err)
			return 254
		}
	}

	processLabel, mountLabel, err := label.InitLabels([]string{})    // 获取标签信息
	if err != nil {
		stderr.PrintE("error initialising SELinux", err)
		return 254
	}
	p.MountLabel = mountLabel

	cfg := stage0.CommonConfig{
		DataDir:      getDataDir(),
		MountLabel:   mountLabel,
		ProcessLabel: processLabel,
		Store:        s,
		TreeStore:    ts,
		Stage1Image:  *s1img,
		UUID:         p.UUID,
		Debug:        globalFlags.Debug,
		Mutable:      false,
	} 	                     // 设置stage0的配置信息

	ovlOk := true
	if err := common.PathSupportsOverlay(getDataDir()); err != nil {   // 确保Overlayfs是否正确
		if oerr, ok := err.(common.ErrOverlayUnsupported); ok {
			stderr.Printf("disabling overlay support: %q", oerr.Error())
			ovlOk = false
		} else {
			stderr.PrintE("error determining overlay support", err)
			return 254
		}
	}

	useOverlay := !flagNoOverlay && ovlOk

	pcfg := stage0.PrepareConfig{
		CommonConfig: &cfg,
		UseOverlay:   useOverlay,
		PrivateUsers: privateUsers,
	}            // 准备的配置信息

	if len(flagPodManifest) > 0 {
		pcfg.PodManifest = flagPodManifest
	} else {
		pcfg.Ports = []types.ExposedPort(flagPorts)
		pcfg.InheritEnv = flagInheritEnv
		pcfg.ExplicitEnv = flagExplicitEnv.Strings()
		pcfg.EnvFromFile = flagEnvFromFile.Strings()
		pcfg.Apps = &rktApps
	}

	if globalFlags.Debug {
		stage0.InitDebug()
	}

	keyLock, err := lock.SharedKeyLock(lockDir(), common.PrepareLock)  // 设置锁
	if err != nil {
		stderr.PrintE("cannot get shared prepare lock", err)
		return 254
	}
	err = stage0.Prepare(pcfg, p.Path(), p.UUID)   // 设置准备的状态
	if err != nil {
		stderr.PrintE("error setting up stage0", err)
		keyLock.Close()
		return 254
	}
	keyLock.Close()

	// get the lock fd for run
	lfd, err := p.Fd()         // 获取pid
	if err != nil {
		stderr.PrintE("error getting pod lock fd", err)
		return 254
	}

	// skip prepared by jumping directly to run, we own this pod
	if err := p.ToRun(); err != nil {     // 直接运行并改变pod状态
		stderr.PrintE("unable to transition to run", err)
		return 254
	}

	rktgid, err := common.LookupGid(common.RktGroup)
	if err != nil {
		stderr.Printf("group %q not found, will use default gid when rendering images", common.RktGroup)
		rktgid = -1
	}

	DNSConfMode, DNSConfig, HostsEntries, err := parseDNSFlags(flagHostsEntries, flagDNS, flagDNSSearch, flagDNSOpt, flagDNSDomain)
	if err != nil {              // 解析网络相关
		stderr.PrintE("error with dns flags", err)
		return 254
	}

	rcfg := stage0.RunConfig{
		CommonConfig:         &cfg,
		Net:                  flagNet,
		LockFd:               lfd,
		Interactive:          flagInteractive,
		DNSConfMode:          DNSConfMode,
		DNSConfig:            DNSConfig,
		MDSRegister:          flagMDSRegister,
		LocalConfig:          globalFlags.LocalConfigDir,
		RktGid:               rktgid,
		Hostname:             flagHostname,
		InsecureCapabilities: globalFlags.InsecureFlags.SkipCapabilities(),
		InsecurePaths:        globalFlags.InsecureFlags.SkipPaths(),
		InsecureSeccomp:      globalFlags.InsecureFlags.SkipSeccomp(),
		UseOverlay:           useOverlay,
		HostsEntries:         *HostsEntries,
		IPCMode:              flagIPCMode,
	}              // 设置运行状态的配置文件

	_, manifest, err := p.PodManifest()   // 获取主要的描述文件
	if err != nil {
		stderr.PrintE("cannot get the pod manifest", err)
		return 254
	}

	if len(manifest.Apps) == 0 {
		stderr.Print("pod must contain at least one application")
		return 254
	}
	rcfg.Apps = manifest.Apps
	stage0.Run(rcfg, p.Path(), getDataDir()) // execs, never returns   执行跳转从而完成从stage0的任务

	return 254
}


...


// Run mounts the right overlay filesystems and actually runs the prepared
// pod by exec()ing the stage1 init inside the pod filesystem.
func Run(cfg RunConfig, dir string, dataDir string) {
	privateUsers, err := preparedWithPrivateUsers(dir)
	if err != nil {
		log.FatalE("error preparing private users", err)
	}

	debug("Setting up stage1")
	if err := setupStage1Image(cfg, dir, cfg.UseOverlay); err != nil {    // 创建stage1的image
		log.FatalE("error setting up stage1", err)
	}
	debug("Wrote filesystem to %s\n", dir)

	for _, app := range cfg.Apps {
		if err := setupAppImage(cfg, app.Name, app.Image.ID, dir, cfg.UseOverlay); err != nil {
			log.FatalE("error setting up app image", err)
		}
	}

	destRootfs := common.Stage1RootfsPath(dir)   // 获取根目录

	writeDnsConfig(&cfg, destRootfs)       // 写dns配置

	if err := os.Setenv(common.EnvLockFd, fmt.Sprintf("%v", cfg.LockFd)); err != nil {
		log.FatalE("setting lock fd environment", err)
	}

	if err := os.Setenv(common.EnvSELinuxContext, fmt.Sprintf("%v", cfg.ProcessLabel)); err != nil {
		log.FatalE("setting SELinux context environment", err)
	}

	if err := os.Setenv(common.EnvSELinuxMountContext, fmt.Sprintf("%v", cfg.MountLabel)); err != nil {
		log.FatalE("setting SELinux mount context environment", err)
	}

	debug("Pivoting to filesystem %s", dir)   // 改变目录的权限
	if err := os.Chdir(dir); err != nil {
		log.FatalE("failed changing to dir", err)
	}

	ep, err := getStage1Entrypoint(dir, runEntrypoint)   // 获取下一个阶段的执行进程
	if err != nil {
		log.FatalE("error determining 'run' entrypoint", err)
	}
	args := []string{filepath.Join(destRootfs, ep)}

	if cfg.Debug {
		args = append(args, "--debug")
	}

	args = append(args, "--net="+cfg.Net.String())   // 获取网络配置

	if cfg.Interactive {
		args = append(args, "--interactive")
	}
	if len(privateUsers) > 0 {
		args = append(args, "--private-users="+privateUsers)
	}
	if cfg.MDSRegister {
		mdsToken, err := registerPod(".", cfg.UUID, cfg.Apps)
		if err != nil {
			log.FatalE("failed to register the pod", err)
		}

		args = append(args, "--mds-token="+mdsToken)
	}

	if cfg.LocalConfig != "" {
		args = append(args, "--local-config="+cfg.LocalConfig)
	}

	s1v, err := getStage1InterfaceVersion(dir)  
	if err != nil {
		log.FatalE("error determining stage1 interface version", err)
	}

	if cfg.Hostname != "" {
		if interfaceVersionSupportsHostname(s1v) {
			args = append(args, "--hostname="+cfg.Hostname)
		} else {
			log.Printf("warning: --hostname option is not supported by stage1")
		}
	}

	if cfg.DNSConfMode.Hosts != "default" || cfg.DNSConfMode.Resolv != "default" {
		if interfaceVersionSupportsDNSConfMode(s1v) {
			args = append(args, fmt.Sprintf("--dns-conf-mode=resolv=%s,hosts=%s", cfg.DNSConfMode.Resolv, cfg.DNSConfMode.Hosts))
		} else {
			log.Printf("warning: --dns-conf-mode option not supported by stage1")
		}
	}

	if interfaceVersionSupportsInsecureOptions(s1v) {
		if cfg.InsecureCapabilities {
			args = append(args, "--disable-capabilities-restriction")
		}
		if cfg.InsecurePaths {
			args = append(args, "--disable-paths")
		}
		if cfg.InsecureSeccomp {
			args = append(args, "--disable-seccomp")
		}
	}

	if cfg.Mutable {   // 是否是可变的Pod
		mutable, err := supportsMutableEnvironment(dir)

		switch {
		case err != nil:
			log.FatalE("error determining stage1 mutable support", err)
		case !mutable:
			log.Fatalln("stage1 does not support mutable pods")
		}

		args = append(args, "--mutable")
	}

	if cfg.IPCMode != "" {
		if interfaceVersionSupportsIPCMode(s1v) {
			args = append(args, "--ipc="+cfg.IPCMode)
		} else {
			log.Printf("warning: --ipc option is not supported by stage1")
		}
	}

	args = append(args, cfg.UUID.String())

	// make sure the lock fd stays open across exec
	if err := sys.CloseOnExec(cfg.LockFd, false); err != nil {   //设置锁的状态
		log.Fatalf("error clearing FD_CLOEXEC on lock fd")
	}

	tpmEvent := fmt.Sprintf("rkt: Rootfs: %s Manifest: %s Stage1 args: %s", cfg.CommonConfig.RootHash, cfg.CommonConfig.ManifestData, strings.Join(args, " "))
	// If there's no TPM available or there's a failure for some other
	// reason, ignore it and continue anyway. Long term we'll want policy
	// that enforces TPM behaviour, but we don't have any infrastructure
	// around that yet.
	_ = tpm.Extend(tpmEvent)

	debug("Execing %s", args)
	if err := syscall.Exec(args[0], args, os.Environ()); err != nil {   // 执行进入init程序
		log.FatalE("error execing init", err)
	}
}

根据源码的流程可知,首先创建了Pod的相关内容,然后在根据配置创建工作目录,最后直接进入init程序进行运行,大部分情况下默认情况init就是指向的可执行程序。

stage1

在执行完成init程序之后,就通过init程序将pod的manifest转换成systemd-nspawn的服务程序。主要就是通过stage0创建的隔离环境、网络和挂载来启动pod。最主要的工作如下:

  1. 读取pod的manifest,获取每个镜像默认的执行入口,并根据配置重新改写。
  2. 通过设置和执行在隔离环境下来执行,当前有三种执行的方式,fly,简单的只有chroot的执行环境;systemd/nspawn,基于隔离环境的systemd的执行方式;kvm,一整个隔离的kvm环境。

在默认情况下是使用了systemd的执行环境。

func stage1(rp *stage1commontypes.RuntimePod) int {
	uuid, err := types.NewUUID(flag.Arg(0))     // 获取uuid
	if err != nil {
		log.FatalE("UUID is missing or malformed", err)
	}

	root := "."
	p, err := stage1commontypes.LoadPod(root, uuid, rp)   // 加载pod的mainifest的信息
	if err != nil {
		log.FatalE("failed to load pod", err)
	}

	if err := p.SaveRuntime(); err != nil {   // 保存运行信息
		log.FatalE("failed to save runtime parameters", err)
	}

	// set close-on-exec flag on RKT_LOCK_FD so it gets correctly closed when invoking
	// network plugins
	lfd, err := common.GetRktLockFD()          // 获取环境锁
	if err != nil {
		log.FatalE("failed to get rkt lock fd", err)
	}

	if err := sys.CloseOnExec(lfd, true); err != nil {
		log.FatalE("failed to set FD_CLOEXEC on rkt lock", err)
	}

	mirrorLocalZoneInfo(p.Root)

	flavor, _, err := stage1initcommon.GetFlavor(p)    // 获取运行管理的方式
	if err != nil {
		log.FatalE("failed to get stage1 flavor", err)
	}

	var n *networking.Networking                         // 获取网络配置相关
	if p.NetList.Contained() {
		fps, err := commonnet.ForwardedPorts(p.Manifest)
		if err != nil {
			log.FatalE("error initializing forwarding ports", err)
		}

		noDNS := p.ResolvConfMode != "default" // force ignore CNI DNS results
		n, err = networking.Setup(root, p.UUID, fps, p.NetList, localConfig, flavor, noDNS, debug)
		if err != nil {
			log.FatalE("failed to setup network", err)
		}

		if err = n.Save(); err != nil {
			log.PrintE("failed to save networking state", err)
			n.Teardown(flavor, debug)
			return 254
		}

		if len(p.MDSToken) > 0 {
			hostIP, err := n.GetForwardableNetHostIP()
			if err != nil {
				log.FatalE("failed to get default Host IP", err)
			}

			p.MetadataServiceURL = common.MetadataServicePublicURL(hostIP, p.MDSToken)
		}
	} else {
		if flavor == "kvm" {
			log.Fatal("flavor kvm requires private network configuration (try --net)")
		}
		if len(p.MDSToken) > 0 {
			p.MetadataServiceURL = common.MetadataServicePublicURL(localhostIP, p.MDSToken)
		}
	}

	mnt := fs.NewLoggingMounter(
		fs.MounterFunc(syscall.Mount),
		fs.UnmounterFunc(syscall.Unmount),
		diag.Printf,
	) 

	// set hostname inside pod
	// According to systemd manual (https://www.freedesktop.org/software/systemd/man/hostname.html) :
	// "The /etc/hostname file configures the name of the local system that is set
	// during boot using the sethostname system call"
	if p.Hostname == "" {
		p.Hostname = stage1initcommon.GetMachineID(p)
	} 
	hostnamePath := filepath.Join(common.Stage1RootfsPath(p.Root), "etc/hostname")         // 设置hostname文件
	if err := ioutil.WriteFile(hostnamePath, []byte(p.Hostname), 0644); err != nil {
		log.PrintE("error writing "+hostnamePath, err)
		return 254
	}
	if err := user.ShiftFiles([]string{hostnamePath}, &p.UidRange); err != nil {
		log.PrintE("error shifting "+hostnamePath, err)
	}

	if p.ResolvConfMode == "host" {
		stage1initcommon.UseHostResolv(mnt, root)
	}

	// Set up the hosts file.
	// We write <stage1>/etc/rkt-hosts if we want to override each app's hosts,
	// and <stage1>/etc/hosts-fallback if we want to let the app "win"
	// Either way, we should add our hostname to it, unless the hosts's
	// /etc/hosts is bind-mounted in.
	if p.EtcHostsMode == "host" { // We should bind-mount the hosts's /etc/hosts
		stage1initcommon.UseHostHosts(mnt, root)
	} else if p.EtcHostsMode == "default" { // Create hosts-fallback
		hostsFile := filepath.Join(common.Stage1RootfsPath(p.Root), "etc", "hosts-fallback")
		if err := stage1initcommon.AddHostsEntry(hostsFile, "127.0.0.1", p.Hostname); err != nil {
			log.PrintE("Failed to write hostname to "+hostsFile, err)
			return 254
		}
	} else if p.EtcHostsMode == "stage0" { // The stage0 has created rkt-hosts
		hostsFile := filepath.Join(common.Stage1RootfsPath(p.Root), "etc", "rkt-hosts")
		if err := stage1initcommon.AddHostsEntry(hostsFile, "127.0.0.1", p.Hostname); err != nil {
			log.PrintE("Failed to write hostname to "+hostsFile, err)
			return 254
		}
	}

	if p.Mutable {
		if err = stage1initcommon.MutableEnv(p); err != nil {     // 查看是否可变的环境变量并写入unit服务
			log.FatalE("cannot initialize mutable environment", err)
		}
	} else {
		if err = stage1initcommon.ImmutableEnv(p); err != nil {   // 不可变的环境变量并写入unit服务
			log.FatalE("cannot initialize immutable environment", err)
		}
	}

	if err := stage1initcommon.SetJournalPermissions(p); err != nil {    // 设置权限
		log.PrintE("warning: error setting journal ACLs, you'll need root to read the pod journal", err)
	}

	if flavor == "kvm" {
		kvm.InitDebug(debug)
		if err := KvmNetworkingToSystemd(p, n); err != nil {
			log.FatalE("failed to configure systemd for kvm", err)
		}
	}

	canMachinedRegister := false
	if flavor != "kvm" {
		// kvm doesn't register with systemd right now, see #2664.
		canMachinedRegister = machinedRegister()
	}
	diag.Printf("canMachinedRegister %t", canMachinedRegister)

	// --ipc=[auto|private|parent]
	// default to private
	parentIPC := false      // 设置隔离等级
	switch p.IPCMode {
	case "parent":
		parentIPC = true
	case "private":
		parentIPC = false
	case "auto":
		fallthrough
	case "":
		parentIPC = false
	default:
		log.Fatalf("unknown value for --ipc parameter: %v", p.IPCMode)
	}
	if parentIPC && flavor == "kvm" {
		log.Fatal("flavor kvm requires private IPC namespace (try to remove --ipc)")
	}

	args, env, err := getArgsEnv(p, flavor, canMachinedRegister, debug, n, parentIPC)  // 获取执行的参数与环境变量
	if err != nil {
		log.FatalE("cannot get environment", err)
	}
	diag.Printf("args %q", args)
	diag.Printf("env %q", env)

	// create a separate mount namespace so the cgroup filesystems
	// are unmounted when exiting the pod
	if err := syscall.Unshare(syscall.CLONE_NEWNS); err != nil {     // 设置隔离参数
		log.FatalE("error unsharing", err)
	}

	// we recursively make / a "shared and slave" so mount events from the
	// new namespace don't propagate to the host namespace but mount events
	// from the host propagate to the new namespace and are forwarded to
	// its peer group
	// See https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
	if err := mnt.Mount("", "/", "none", syscall.MS_REC|syscall.MS_SLAVE, ""); err != nil {
		log.FatalE("error making / a slave mount", err)
	}
	if err := mnt.Mount("", "/", "none", syscall.MS_REC|syscall.MS_SHARED, ""); err != nil {
		log.FatalE("error making / a shared and slave mount", err)
	}

	unifiedCgroup, err := cgroup.IsCgroupUnified("/")
	if err != nil {
		log.FatalE("error determining cgroup version", err)
	}
	diag.Printf("unifiedCgroup %t", unifiedCgroup)

	machineID := stage1initcommon.GetMachineID(p)

	subcgroup, err := getContainerSubCgroup(machineID, canMachinedRegister, unifiedCgroup) // 设置subcgroup
	if err != nil {
		log.FatalE("error getting container subcgroup", err)
	}
	diag.Printf("subcgroup %q", subcgroup)

	if err := ioutil.WriteFile(filepath.Join(p.Root, "subcgroup"),
		[]byte(fmt.Sprintf("%s", subcgroup)), 0644); err != nil {       // 写入subcgroup文件
		log.FatalE("cannot write subcgroup file", err)
	}

	if !unifiedCgroup {
		enabledCgroups, err := v1.GetEnabledCgroups()
		if err != nil {
			log.FatalE("error getting v1 cgroups", err)
		}
		diag.Printf("enabledCgroups %q", enabledCgroups)

		if err := mountHostV1Cgroups(mnt, enabledCgroups); err != nil {
			log.FatalE("couldn't mount the host v1 cgroups", err)
		}

		if !canMachinedRegister {
			if err := v1.JoinSubcgroup("systemd", subcgroup); err != nil {
				log.FatalE(fmt.Sprintf("error joining subcgroup %q", subcgroup), err)
			}
		}

		var serviceNames []string
		for _, app := range p.Manifest.Apps {
			serviceNames = append(serviceNames, stage1initcommon.ServiceUnitName(app.Name))
		}
		diag.Printf("serviceNames %q", serviceNames)

		if err := mountContainerV1Cgroups(mnt, p, enabledCgroups, subcgroup, serviceNames); err != nil {
			log.FatalE("couldn't mount the container v1 cgroups", err)
		}

	}

	// KVM flavor has a bit different logic in handling pid vs ppid, for details look into #2389
	// it doesn't require the existence of a "ppid", instead it registers the current pid (which
	// will be reused by lkvm binary) as a pod process pid used during entering
	pid_filename := "ppid"
	if flavor == "kvm" {
		pid_filename = "pid"
	}

	if err = stage1common.WritePid(os.Getpid(), pid_filename); err != nil {   // 写入pid文件
		log.FatalE("error writing pid", err)
	}

	if flavor == "kvm" {
		if err := KvmPrepareMounts(p); err != nil {
			log.FatalE("error preparing mounts", err)
		}
	}

	err = stage1common.WithClearedCloExec(lfd, func() error {
		return syscall.Exec(args[0], args, env)    //  跳入执行
	})

	if err != nil {
		log.FatalE(fmt.Sprintf("failed to execute %q", args[0]), err)
	}

	return 0
}

...


// getArgsEnv returns the nspawn or lkvm args and env according to the flavor
// as the first two return values respectively.
func getArgsEnv(p *stage1commontypes.Pod, flavor string, canMachinedRegister bool, debug bool, n *networking.Networking, parentIPC bool) ([]string, []string, error) {
	var args []string
	env := os.Environ()

	// We store the pod's flavor so we can later garbage collect it correctly
	if err := os.Symlink(flavor, filepath.Join(p.Root, stage1initcommon.FlavorFile)); err != nil {
		return nil, nil, errwrap.Wrap(errors.New("failed to create flavor symlink"), err)
	}

	// systemd-nspawn needs /etc/machine-id to link the container's journal
	// to the host. Since systemd-v230, /etc/machine-id is mandatory, see
	// https://github.com/systemd/systemd/commit/e01ff70a77e781734e1e73a2238af2e9bf7967a8
	mPath := filepath.Join(common.Stage1RootfsPath(p.Root), "etc", "machine-id")
	machineID := strings.Replace(p.UUID.String(), "-", "", -1)  // 获取id

	switch flavor {    // 根据不同的启动类型来进行启动默认是coreos
	case "kvm":
		if p.PrivateUsers != "" {
			return nil, nil, fmt.Errorf("flag --private-users cannot be used with an lkvm stage1")
		}

		// kernel and hypervisor binaries are located relative to the working directory
		// of init (/var/lib/rkt/..../uuid)
		// TODO: move to path.go
		kernelPath := filepath.Join(common.Stage1RootfsPath(p.Root), "kernel_image")
		netDescriptions := kvm.GetNetworkDescriptions(n)

		cpu, mem := kvm.GetAppsResources(p.Manifest.Apps)

		// Parse hypervisor
		hv, err := KvmCheckHypervisor(common.Stage1RootfsPath(p.Root))
		if err != nil {
			return nil, nil, err
		}

		// Set start command for hypervisor
		StartCmd := hvlkvm.StartCmd
		switch hv {
		case "lkvm":
			StartCmd = hvlkvm.StartCmd
		case "qemu":
			StartCmd = hvqemu.StartCmd
		default:
			return nil, nil, fmt.Errorf("unrecognized hypervisor")
		}

		hvStartCmd := StartCmd(
			common.Stage1RootfsPath(p.Root),
			p.UUID.String(),
			kernelPath,
			netDescriptions,
			cpu,
			mem,
			debug,
		)

		if hvStartCmd == nil {
			return nil, nil, fmt.Errorf("no hypervisor")
		}

		args = append(args, hvStartCmd...)

		// lkvm requires $HOME to be defined,
		// see https://github.com/rkt/rkt/issues/1393
		if os.Getenv("HOME") == "" {
			env = append(env, "HOME=/root")
		}

		if err := linkJournal(common.Stage1RootfsPath(p.Root), machineID); err != nil {
			return nil, nil, errwrap.Wrap(errors.New("error linking pod's journal"), err)
		}

		// use only dynamic libraries provided in the image
		// from systemd v231 there's a new internal libsystemd-shared-v231.so
		// which is present in /usr/lib/systemd
		env = append(env, "LD_LIBRARY_PATH="+filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib/systemd"))

		return args, env, nil

	case "coreos":
		args = append(args, filepath.Join(common.Stage1RootfsPath(p.Root), interpBin)) // 获取roorfs
		args = append(args, filepath.Join(common.Stage1RootfsPath(p.Root), nspawnBin))  // 获取执行的路径
		args = append(args, "--boot")             // Launch systemd in the pod
		args = append(args, "--notify-ready=yes") // From systemd v231  拼接参数

		if context := os.Getenv(common.EnvSELinuxContext); context != "" {
			args = append(args, fmt.Sprintf("-Z%s", context))
		}

		if context := os.Getenv(common.EnvSELinuxMountContext); context != "" {
			args = append(args, fmt.Sprintf("-L%s", context))
		}

		if canMachinedRegister {
			args = append(args, fmt.Sprintf("--register=true"))   // 是否注册
		} else {
			args = append(args, fmt.Sprintf("--register=false"))
		}

		kubernetesLogDir, ok := p.Manifest.Annotations.Get("coreos.com/rkt/experiment/kubernetes-log-dir")
		if ok {
			args = append(args, fmt.Sprintf("--bind=%s:/rkt/kubernetes/log", kubernetesLogDir))
		}

		// use only dynamic libraries provided in the image
		// from systemd v231 there's a new internal libsystemd-shared-v231.so
		// which is present in /usr/lib/systemd
		env = append(env, "LD_LIBRARY_PATH="+
			filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib")+":"+
			filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib/systemd"))

	case "src":
		args = append(args, filepath.Join(common.Stage1RootfsPath(p.Root), interpBin))
		args = append(args, filepath.Join(common.Stage1RootfsPath(p.Root), nspawnBin))
		args = append(args, "--boot")             // Launch systemd in the pod
		args = append(args, "--notify-ready=yes") // From systemd v231

		if context := os.Getenv(common.EnvSELinuxContext); context != "" {
			args = append(args, fmt.Sprintf("-Z%s", context))
		}

		if context := os.Getenv(common.EnvSELinuxMountContext); context != "" {
			args = append(args, fmt.Sprintf("-L%s", context))
		}

		if canMachinedRegister {
			args = append(args, fmt.Sprintf("--register=true"))
		} else {
			args = append(args, fmt.Sprintf("--register=false"))
		}

		kubernetesLogDir, ok := p.Manifest.Annotations.Get("coreos.com/rkt/experiment/kubernetes-log-dir")
		if ok {
			args = append(args, fmt.Sprintf("--bind=%s:/rkt/kubernetes/log", kubernetesLogDir))
		}

		// use only dynamic libraries provided in the image
		// from systemd v231 there's a new internal libsystemd-shared-v231.so
		// which is present in /usr/lib/systemd
		env = append(env, "LD_LIBRARY_PATH="+
			filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib")+":"+
			filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib/systemd"))

	case "host":
		hostNspawnBin, err := common.LookupPath("systemd-nspawn", os.Getenv("PATH"))
		if err != nil {
			return nil, nil, err
		}

		// Check dynamically which version is installed on the host
		// Support version >= 220
		versionBytes, err := exec.Command(hostNspawnBin, "--version").CombinedOutput()
		if err != nil {
			return nil, nil, errwrap.Wrap(fmt.Errorf("unable to probe %s version", hostNspawnBin), err)
		}
		versionStr := strings.SplitN(string(versionBytes), "\n", 2)[0]
		var version int
		n, err := fmt.Sscanf(versionStr, "systemd %d", &version)
		if err != nil {
			return nil, nil, fmt.Errorf("cannot parse version: %q", versionStr)
		}
		if n != 1 || version < 220 {
			return nil, nil, fmt.Errorf("rkt needs systemd-nspawn >= 220. %s version not supported: %v", hostNspawnBin, versionStr)
		}

		// Copy systemd, bash, etc. in stage1 at run-time
		if err := installAssets(version); err != nil {
			return nil, nil, errwrap.Wrap(errors.New("cannot install assets from the host"), err)
		}

		args = append(args, hostNspawnBin)
		args = append(args, "--boot") // Launch systemd in the pod
		args = append(args, fmt.Sprintf("--register=true"))

		if version >= 231 {
			args = append(args, "--notify-ready=yes") // From systemd v231
		}

		if context := os.Getenv(common.EnvSELinuxContext); context != "" {
			args = append(args, fmt.Sprintf("-Z%s", context))
		}

		if context := os.Getenv(common.EnvSELinuxMountContext); context != "" {
			args = append(args, fmt.Sprintf("-L%s", context))
		}

		kubernetesLogDir, ok := p.Manifest.Annotations.Get("coreos.com/rkt/experiment/kubernetes-log-dir")
		if ok {
			args = append(args, fmt.Sprintf("--bind=%s:/rkt/kubernetes/log", kubernetesLogDir))
		}

	default:
		return nil, nil, fmt.Errorf("unrecognized stage1 flavor: %q", flavor)
	}

	machineIDBytes := append([]byte(machineID), '\n')   //保存Id信息
	if err := ioutil.WriteFile(mPath, machineIDBytes, 0644); err != nil {
		return nil, nil, errwrap.Wrap(errors.New("error writing /etc/machine-id"), err)
	}
	if err := user.ShiftFiles([]string{mPath}, &p.UidRange); err != nil {
		return nil, nil, errwrap.Wrap(errors.New("error shifting /etc/machine-id"), err)
	}

	// link journal only if the host is running systemd
	if util.IsRunningSystemd() {   // 检查是否运行
		args = append(args, "--link-journal=try-guest")

		keepUnit, err := util.RunningFromSystemService()
		if err != nil {
			if err == dlopen.ErrSoNotFound {
				log.Print("warning: libsystemd not found even though systemd is running. Cgroup limits set by the environment (e.g. a systemd service) won't be enforced.")
			} else {
				return nil, nil, errwrap.Wrap(errors.New("error determining if we're running from a system service"), err)
			}
		}

		if keepUnit {
			args = append(args, "--keep-unit")
		}
	} else {
		args = append(args, "--link-journal=no")
	}

	if !debug {
		args = append(args, "--quiet")             // silence most nspawn output (log_warning is currently not covered by this)
		env = append(env, "SYSTEMD_LOG_LEVEL=err") // silence log_warning too
	}

	if parentIPC {
		env = append(env, "SYSTEMD_NSPAWN_SHARE_NS_IPC=true")
	}

	env = append(env, "SYSTEMD_NSPAWN_CONTAINER_SERVICE=rkt")
	// TODO (alepuccetti) remove this line when rkt will use cgroup namespace
	// If the kernel has the cgroup namespace enabled, systemd v232 will use it by default.
	// This was introduced by https://github.com/systemd/systemd/pull/3809 and it will cause
	// problems in rkt when cgns is enabled and cgroup-v1 is used. For more information see
	// https://github.com/systemd/systemd/pull/3589#discussion_r70277625.
	// The following line tells systemd-nspawn not to use cgroup namespace using the environment variable
	// introduced by https://github.com/systemd/systemd/pull/3809.
	env = append(env, "SYSTEMD_NSPAWN_USE_CGNS=no")

	if p.InsecureOptions.DisablePaths {
		env = append(env, "SYSTEMD_NSPAWN_API_VFS_WRITABLE=yes")
	}

	if len(p.PrivateUsers) > 0 {
		args = append(args, "--private-users="+p.PrivateUsers)   // 加入权限参数
	}

	nsargs, err := stage1initcommon.PodToNspawnArgs(p)
	if err != nil {
		return nil, nil, errwrap.Wrap(errors.New("failed to generate nspawn args"), err)
	}
	args = append(args, nsargs...)

	// Arguments to systemd
	args = append(args, "--")
	args = append(args, "--default-standard-output=tty") // redirect all service logs straight to tty
	if !debug {
		args = append(args, "--log-target=null") // silence systemd output inside pod
		args = append(args, "--show-status=0")   // silence systemd initialization status output
	}

	return args, env, nil   // 拼凑完成启动的参数信息
}

从执行的流程上可知,stage1主要就是将对应的配置参数在一个隔离的环境中进行落盘并启动,默认通过systemd和systemd-nspawn来进行启动。

stage2

最终启动正在的可执行程序,在已经创建好的环境中进行运行。

通过一层层的文件的组织与引用systemd和systemd-nspawn来进行文件形式的容器的管理方式,并且通过不同的stage1来进行不同的模式的选择,如果使用最简单的模式也可以通过简单的linux的特性来进行运行具体流程位于stage1_fly中的代码流程。

总结

rkt利用了文件系统来进行各个运行的pod的管理,从而安全的解耦所有正在运行的容器,并且通过分层的架构设计,从而可以支持kvm、systemd/nspawn等不同的特性来运行隔离环境来管理,并且也提供简单的纯洁环境fly来进行app的运行,但是经过多年的发展在支持OCI规范的过程中,出现了containerd和cri-o等等。由于本人才疏学浅,如有错误请批评指正。

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐