rkt简单概述
rkt项目rkt项目最早跟随k8s使用的运行时的组件,并且也入选过cncf的沙箱项目,但是在最后的使用中还是被抛弃了,其中主要的是croi-o和containerd两个项目的接受度更高,并且社区活跃度越来越低,最终停止维护。虽然停止维护但是也可以是一个很好的案例来学习一下rkt项目的设计思路与思想。主要的学习资料就是官网提供的 运行原理 与架构。rkt原理梳理rkt 的主要界面是一个命令行工具 r
rkt项目
rkt项目最早跟随k8s使用的运行时的组件,并且也入选过cncf的沙箱项目,但是在最后的使用中还是被抛弃了,其中主要的是croi-o和containerd两个项目的接受度更高,并且社区活跃度越来越低,最终停止维护。虽然停止维护但是也可以是一个很好的案例来学习一下rkt项目的设计思路与思想。主要的学习资料就是官网提供的 运行原理 与架构。
rkt原理梳理
rkt 的主要界面是一个命令行工具 rkt,它不需要长时间运行的守护进程。 这种架构允许 rkt 就地更新,而不会影响当前正在运行的应用程序容器。 这也意味着可以在不同的操作之间分离特权级别。
rkt 中的所有状态都通过文件系统进行通信、文件锁定等工具用于确保 rkt 命令的并发调用之间的合作和互斥。
rkt通过分层来实现各个业务层的逻辑,从而可以在快速的迭代中保持系统的稳定。
官网的原理很清晰的展示了,在调用了rkt的进程之后是如何进行容器的隔离到最终的app的启动,在图中stage1就是广义上的pod,然后stage2就是通过pod运行的运行时的容器。
主要分为三步来启动。
- 调用进程-> stage0:调用进程使用自己的机制来调用rkt二进制文件(stage0)。 当通过常规 shell 或主管启动时,stage0 通常被派生并执行,成为调用 shell 或主管的子进程。
- stage0 -> stage1:使用普通的 exec(3) 将 stage0 进程替换为 stage1 入口点。 入口点由 stage1 映像清单中的 coreos.com/rkt/stage1/run 注释引用。
- stage1 -> stage2:stage1 入口点使用其机制来调用 stage2 应用程序可执行文件。 应用程序可执行文件由 stage2 映像清单中的 apps.app.exec 设置引用。
通过三个步骤将容器创建的过程分别进行初始化,并且这种不需要进行长后台运行的情况来保持运行,并且将运行的情况都通过文件的组织来进行管理。
stage0
当调用rkt二进制文件来进行pod的运行的时候,rkt会初始化如下的相关的任务:
- 获取指定的ACI(镜像),包括镜像的相关参数。
- 创建一个Pod的uuid。
- 创建一个Pod的Manifest。
- 为Pod创建一个文件系统。
- 在文件系统中为创建stage1和stage2文件夹。
- 解压stage1的ACI到Pod的文件系统中。
- 解压ACI并拷贝每个app到stage2文件夹中。
将生成符合 ACE 规范的 pod manifest,stage0 创建的文件系统应该是如下:
/pod
/stage1
/stage1/manifest
/stage1/rootfs/init
/stage1/rootfs/opt
/stage1/rootfs/opt/stage2/${app1-name}
/stage1/rootfs/opt/stage2/${app2-name}
其中:
pod
是Pod的manifest文件。stage1
是一个可以安全读写的stage1镜像的拷贝。stage1/manifest
是stage1镜像的manifest。stage1/rootfs
是stage1镜像的根文件系统。stage1/rootfs/init
是一个指定的可执行文件,在stage1镜像中的配置。stage1/rootfs/opt/stage2
是解压的镜像的拷贝。
此时 stage0 执行 /stage1/rootfs/init 并将当前工作目录设置为新文件系统的根目录。
以rkt run app.aci为例来了解。通过该命令最终调用的就是封装好的runRun函数。
func runRun(cmd *cobra.Command, args []string) (exit int) {
privateUsers := user.NewBlankUidRange() // 获取权限
err := parseApps(&rktApps, args, cmd.Flags(), true) // 解析输入参数
if err != nil {
stderr.PrintE("error parsing app image arguments", err)
return 254
}
if flagStoreOnly && flagNoStore {
stderr.Print("both --store-only and --no-store specified")
return 254
}
if flagStoreOnly { // 获取镜像的拉取方式
flagPullPolicy = image.PullPolicyNever
}
if flagNoStore {
flagPullPolicy = image.PullPolicyUpdate
}
if flagPrivateUsers { // 是否支持名称空间
if !common.SupportsUserNS() {
stderr.Print("--private-users is not supported, kernel compiled without user namespace support")
return 254
}
privateUsers.SetRandomUidRange(user.DefaultRangeCount)
}
if len(flagPorts) > 0 && flagNet.None() { // 检查port的运行端口
stderr.Print("--port flag does not work with 'none' networking")
return 254
}
if len(flagPorts) > 0 && flagNet.Host() {
stderr.Print("--port flag does not work with 'host' networking")
return 254
}
if flagMDSRegister && flagNet.None() {
stderr.Print("--mds-register flag does not work with --net=none. Please use 'host', 'default' or an equivalent network")
return 254
}
if len(flagPodManifest) > 0 && (rktApps.Count() > 0 ||
(*appsVolume)(&rktApps).String() != "" || (*appMount)(&rktApps).String() != "" ||
len(flagPorts) > 0 || flagPullPolicy == image.PullPolicyNever ||
flagPullPolicy == image.PullPolicyUpdate || flagInheritEnv ||
!flagExplicitEnv.IsEmpty() || !flagEnvFromFile.IsEmpty()) {
stderr.Print("conflicting flags set with --pod-manifest (see --help)")
return 254
}
if flagInteractive && rktApps.Count() > 1 { // 检查交互端口和app数量
stderr.Print("interactive option only supports one image")
return 254
}
if rktApps.Count() < 1 && len(flagPodManifest) == 0 {
stderr.Print("must provide at least one image or specify the pod manifest")
return 254
}
s, err := imagestore.NewStore(storeDir()) // 新建一个存储的目录
if err != nil {
stderr.PrintE("cannot open store", err)
return 254
}
ts, err := treestore.NewStore(treeStoreDir(), s)
if err != nil {
stderr.PrintE("cannot open treestore", err)
return 254
}
config, err := getConfig()
if err != nil {
stderr.PrintE("cannot get configuration", err)
return 254
}
s1img, err := getStage1Hash(s, ts, config) // 获取stage1的hash值
if err != nil {
stderr.Error(err)
return 254
}
fn := &image.Finder{
S: s,
Ts: ts,
Ks: getKeystore(),
Headers: config.AuthPerHost,
DockerAuth: config.DockerCredentialsPerRegistry,
InsecureFlags: globalFlags.InsecureFlags,
Debug: globalFlags.Debug,
TrustKeysFromHTTPS: globalFlags.TrustKeysFromHTTPS,
PullPolicy: flagPullPolicy,
WithDeps: true,
}
if err := fn.FindImages(&rktApps); err != nil { // 查找镜像
stderr.Error(err)
return 254
}
p, err := pkgPod.NewPod(getDataDir()) // 新建一个Pod工作文件夹并设置状态
if err != nil {
stderr.PrintE("error creating new pod", err)
return 254
}
// if requested, write out pod UUID early so "rkt rm" can
// clean it up even if something goes wrong
if flagUUIDFileSave != "" {
if err := pkgPod.WriteUUIDToFile(p.UUID, flagUUIDFileSave); err != nil { // 保存pod的uuid 以方便后续管理
stderr.PrintE("error saving pod UUID to file", err)
return 254
}
}
processLabel, mountLabel, err := label.InitLabels([]string{}) // 获取标签信息
if err != nil {
stderr.PrintE("error initialising SELinux", err)
return 254
}
p.MountLabel = mountLabel
cfg := stage0.CommonConfig{
DataDir: getDataDir(),
MountLabel: mountLabel,
ProcessLabel: processLabel,
Store: s,
TreeStore: ts,
Stage1Image: *s1img,
UUID: p.UUID,
Debug: globalFlags.Debug,
Mutable: false,
} // 设置stage0的配置信息
ovlOk := true
if err := common.PathSupportsOverlay(getDataDir()); err != nil { // 确保Overlayfs是否正确
if oerr, ok := err.(common.ErrOverlayUnsupported); ok {
stderr.Printf("disabling overlay support: %q", oerr.Error())
ovlOk = false
} else {
stderr.PrintE("error determining overlay support", err)
return 254
}
}
useOverlay := !flagNoOverlay && ovlOk
pcfg := stage0.PrepareConfig{
CommonConfig: &cfg,
UseOverlay: useOverlay,
PrivateUsers: privateUsers,
} // 准备的配置信息
if len(flagPodManifest) > 0 {
pcfg.PodManifest = flagPodManifest
} else {
pcfg.Ports = []types.ExposedPort(flagPorts)
pcfg.InheritEnv = flagInheritEnv
pcfg.ExplicitEnv = flagExplicitEnv.Strings()
pcfg.EnvFromFile = flagEnvFromFile.Strings()
pcfg.Apps = &rktApps
}
if globalFlags.Debug {
stage0.InitDebug()
}
keyLock, err := lock.SharedKeyLock(lockDir(), common.PrepareLock) // 设置锁
if err != nil {
stderr.PrintE("cannot get shared prepare lock", err)
return 254
}
err = stage0.Prepare(pcfg, p.Path(), p.UUID) // 设置准备的状态
if err != nil {
stderr.PrintE("error setting up stage0", err)
keyLock.Close()
return 254
}
keyLock.Close()
// get the lock fd for run
lfd, err := p.Fd() // 获取pid
if err != nil {
stderr.PrintE("error getting pod lock fd", err)
return 254
}
// skip prepared by jumping directly to run, we own this pod
if err := p.ToRun(); err != nil { // 直接运行并改变pod状态
stderr.PrintE("unable to transition to run", err)
return 254
}
rktgid, err := common.LookupGid(common.RktGroup)
if err != nil {
stderr.Printf("group %q not found, will use default gid when rendering images", common.RktGroup)
rktgid = -1
}
DNSConfMode, DNSConfig, HostsEntries, err := parseDNSFlags(flagHostsEntries, flagDNS, flagDNSSearch, flagDNSOpt, flagDNSDomain)
if err != nil { // 解析网络相关
stderr.PrintE("error with dns flags", err)
return 254
}
rcfg := stage0.RunConfig{
CommonConfig: &cfg,
Net: flagNet,
LockFd: lfd,
Interactive: flagInteractive,
DNSConfMode: DNSConfMode,
DNSConfig: DNSConfig,
MDSRegister: flagMDSRegister,
LocalConfig: globalFlags.LocalConfigDir,
RktGid: rktgid,
Hostname: flagHostname,
InsecureCapabilities: globalFlags.InsecureFlags.SkipCapabilities(),
InsecurePaths: globalFlags.InsecureFlags.SkipPaths(),
InsecureSeccomp: globalFlags.InsecureFlags.SkipSeccomp(),
UseOverlay: useOverlay,
HostsEntries: *HostsEntries,
IPCMode: flagIPCMode,
} // 设置运行状态的配置文件
_, manifest, err := p.PodManifest() // 获取主要的描述文件
if err != nil {
stderr.PrintE("cannot get the pod manifest", err)
return 254
}
if len(manifest.Apps) == 0 {
stderr.Print("pod must contain at least one application")
return 254
}
rcfg.Apps = manifest.Apps
stage0.Run(rcfg, p.Path(), getDataDir()) // execs, never returns 执行跳转从而完成从stage0的任务
return 254
}
...
// Run mounts the right overlay filesystems and actually runs the prepared
// pod by exec()ing the stage1 init inside the pod filesystem.
func Run(cfg RunConfig, dir string, dataDir string) {
privateUsers, err := preparedWithPrivateUsers(dir)
if err != nil {
log.FatalE("error preparing private users", err)
}
debug("Setting up stage1")
if err := setupStage1Image(cfg, dir, cfg.UseOverlay); err != nil { // 创建stage1的image
log.FatalE("error setting up stage1", err)
}
debug("Wrote filesystem to %s\n", dir)
for _, app := range cfg.Apps {
if err := setupAppImage(cfg, app.Name, app.Image.ID, dir, cfg.UseOverlay); err != nil {
log.FatalE("error setting up app image", err)
}
}
destRootfs := common.Stage1RootfsPath(dir) // 获取根目录
writeDnsConfig(&cfg, destRootfs) // 写dns配置
if err := os.Setenv(common.EnvLockFd, fmt.Sprintf("%v", cfg.LockFd)); err != nil {
log.FatalE("setting lock fd environment", err)
}
if err := os.Setenv(common.EnvSELinuxContext, fmt.Sprintf("%v", cfg.ProcessLabel)); err != nil {
log.FatalE("setting SELinux context environment", err)
}
if err := os.Setenv(common.EnvSELinuxMountContext, fmt.Sprintf("%v", cfg.MountLabel)); err != nil {
log.FatalE("setting SELinux mount context environment", err)
}
debug("Pivoting to filesystem %s", dir) // 改变目录的权限
if err := os.Chdir(dir); err != nil {
log.FatalE("failed changing to dir", err)
}
ep, err := getStage1Entrypoint(dir, runEntrypoint) // 获取下一个阶段的执行进程
if err != nil {
log.FatalE("error determining 'run' entrypoint", err)
}
args := []string{filepath.Join(destRootfs, ep)}
if cfg.Debug {
args = append(args, "--debug")
}
args = append(args, "--net="+cfg.Net.String()) // 获取网络配置
if cfg.Interactive {
args = append(args, "--interactive")
}
if len(privateUsers) > 0 {
args = append(args, "--private-users="+privateUsers)
}
if cfg.MDSRegister {
mdsToken, err := registerPod(".", cfg.UUID, cfg.Apps)
if err != nil {
log.FatalE("failed to register the pod", err)
}
args = append(args, "--mds-token="+mdsToken)
}
if cfg.LocalConfig != "" {
args = append(args, "--local-config="+cfg.LocalConfig)
}
s1v, err := getStage1InterfaceVersion(dir)
if err != nil {
log.FatalE("error determining stage1 interface version", err)
}
if cfg.Hostname != "" {
if interfaceVersionSupportsHostname(s1v) {
args = append(args, "--hostname="+cfg.Hostname)
} else {
log.Printf("warning: --hostname option is not supported by stage1")
}
}
if cfg.DNSConfMode.Hosts != "default" || cfg.DNSConfMode.Resolv != "default" {
if interfaceVersionSupportsDNSConfMode(s1v) {
args = append(args, fmt.Sprintf("--dns-conf-mode=resolv=%s,hosts=%s", cfg.DNSConfMode.Resolv, cfg.DNSConfMode.Hosts))
} else {
log.Printf("warning: --dns-conf-mode option not supported by stage1")
}
}
if interfaceVersionSupportsInsecureOptions(s1v) {
if cfg.InsecureCapabilities {
args = append(args, "--disable-capabilities-restriction")
}
if cfg.InsecurePaths {
args = append(args, "--disable-paths")
}
if cfg.InsecureSeccomp {
args = append(args, "--disable-seccomp")
}
}
if cfg.Mutable { // 是否是可变的Pod
mutable, err := supportsMutableEnvironment(dir)
switch {
case err != nil:
log.FatalE("error determining stage1 mutable support", err)
case !mutable:
log.Fatalln("stage1 does not support mutable pods")
}
args = append(args, "--mutable")
}
if cfg.IPCMode != "" {
if interfaceVersionSupportsIPCMode(s1v) {
args = append(args, "--ipc="+cfg.IPCMode)
} else {
log.Printf("warning: --ipc option is not supported by stage1")
}
}
args = append(args, cfg.UUID.String())
// make sure the lock fd stays open across exec
if err := sys.CloseOnExec(cfg.LockFd, false); err != nil { //设置锁的状态
log.Fatalf("error clearing FD_CLOEXEC on lock fd")
}
tpmEvent := fmt.Sprintf("rkt: Rootfs: %s Manifest: %s Stage1 args: %s", cfg.CommonConfig.RootHash, cfg.CommonConfig.ManifestData, strings.Join(args, " "))
// If there's no TPM available or there's a failure for some other
// reason, ignore it and continue anyway. Long term we'll want policy
// that enforces TPM behaviour, but we don't have any infrastructure
// around that yet.
_ = tpm.Extend(tpmEvent)
debug("Execing %s", args)
if err := syscall.Exec(args[0], args, os.Environ()); err != nil { // 执行进入init程序
log.FatalE("error execing init", err)
}
}
根据源码的流程可知,首先创建了Pod的相关内容,然后在根据配置创建工作目录,最后直接进入init程序进行运行,大部分情况下默认情况init就是指向的可执行程序。
stage1
在执行完成init程序之后,就通过init程序将pod的manifest转换成systemd-nspawn的服务程序。主要就是通过stage0创建的隔离环境、网络和挂载来启动pod。最主要的工作如下:
- 读取pod的manifest,获取每个镜像默认的执行入口,并根据配置重新改写。
- 通过设置和执行在隔离环境下来执行,当前有三种执行的方式,fly,简单的只有chroot的执行环境;systemd/nspawn,基于隔离环境的systemd的执行方式;kvm,一整个隔离的kvm环境。
在默认情况下是使用了systemd的执行环境。
func stage1(rp *stage1commontypes.RuntimePod) int {
uuid, err := types.NewUUID(flag.Arg(0)) // 获取uuid
if err != nil {
log.FatalE("UUID is missing or malformed", err)
}
root := "."
p, err := stage1commontypes.LoadPod(root, uuid, rp) // 加载pod的mainifest的信息
if err != nil {
log.FatalE("failed to load pod", err)
}
if err := p.SaveRuntime(); err != nil { // 保存运行信息
log.FatalE("failed to save runtime parameters", err)
}
// set close-on-exec flag on RKT_LOCK_FD so it gets correctly closed when invoking
// network plugins
lfd, err := common.GetRktLockFD() // 获取环境锁
if err != nil {
log.FatalE("failed to get rkt lock fd", err)
}
if err := sys.CloseOnExec(lfd, true); err != nil {
log.FatalE("failed to set FD_CLOEXEC on rkt lock", err)
}
mirrorLocalZoneInfo(p.Root)
flavor, _, err := stage1initcommon.GetFlavor(p) // 获取运行管理的方式
if err != nil {
log.FatalE("failed to get stage1 flavor", err)
}
var n *networking.Networking // 获取网络配置相关
if p.NetList.Contained() {
fps, err := commonnet.ForwardedPorts(p.Manifest)
if err != nil {
log.FatalE("error initializing forwarding ports", err)
}
noDNS := p.ResolvConfMode != "default" // force ignore CNI DNS results
n, err = networking.Setup(root, p.UUID, fps, p.NetList, localConfig, flavor, noDNS, debug)
if err != nil {
log.FatalE("failed to setup network", err)
}
if err = n.Save(); err != nil {
log.PrintE("failed to save networking state", err)
n.Teardown(flavor, debug)
return 254
}
if len(p.MDSToken) > 0 {
hostIP, err := n.GetForwardableNetHostIP()
if err != nil {
log.FatalE("failed to get default Host IP", err)
}
p.MetadataServiceURL = common.MetadataServicePublicURL(hostIP, p.MDSToken)
}
} else {
if flavor == "kvm" {
log.Fatal("flavor kvm requires private network configuration (try --net)")
}
if len(p.MDSToken) > 0 {
p.MetadataServiceURL = common.MetadataServicePublicURL(localhostIP, p.MDSToken)
}
}
mnt := fs.NewLoggingMounter(
fs.MounterFunc(syscall.Mount),
fs.UnmounterFunc(syscall.Unmount),
diag.Printf,
)
// set hostname inside pod
// According to systemd manual (https://www.freedesktop.org/software/systemd/man/hostname.html) :
// "The /etc/hostname file configures the name of the local system that is set
// during boot using the sethostname system call"
if p.Hostname == "" {
p.Hostname = stage1initcommon.GetMachineID(p)
}
hostnamePath := filepath.Join(common.Stage1RootfsPath(p.Root), "etc/hostname") // 设置hostname文件
if err := ioutil.WriteFile(hostnamePath, []byte(p.Hostname), 0644); err != nil {
log.PrintE("error writing "+hostnamePath, err)
return 254
}
if err := user.ShiftFiles([]string{hostnamePath}, &p.UidRange); err != nil {
log.PrintE("error shifting "+hostnamePath, err)
}
if p.ResolvConfMode == "host" {
stage1initcommon.UseHostResolv(mnt, root)
}
// Set up the hosts file.
// We write <stage1>/etc/rkt-hosts if we want to override each app's hosts,
// and <stage1>/etc/hosts-fallback if we want to let the app "win"
// Either way, we should add our hostname to it, unless the hosts's
// /etc/hosts is bind-mounted in.
if p.EtcHostsMode == "host" { // We should bind-mount the hosts's /etc/hosts
stage1initcommon.UseHostHosts(mnt, root)
} else if p.EtcHostsMode == "default" { // Create hosts-fallback
hostsFile := filepath.Join(common.Stage1RootfsPath(p.Root), "etc", "hosts-fallback")
if err := stage1initcommon.AddHostsEntry(hostsFile, "127.0.0.1", p.Hostname); err != nil {
log.PrintE("Failed to write hostname to "+hostsFile, err)
return 254
}
} else if p.EtcHostsMode == "stage0" { // The stage0 has created rkt-hosts
hostsFile := filepath.Join(common.Stage1RootfsPath(p.Root), "etc", "rkt-hosts")
if err := stage1initcommon.AddHostsEntry(hostsFile, "127.0.0.1", p.Hostname); err != nil {
log.PrintE("Failed to write hostname to "+hostsFile, err)
return 254
}
}
if p.Mutable {
if err = stage1initcommon.MutableEnv(p); err != nil { // 查看是否可变的环境变量并写入unit服务
log.FatalE("cannot initialize mutable environment", err)
}
} else {
if err = stage1initcommon.ImmutableEnv(p); err != nil { // 不可变的环境变量并写入unit服务
log.FatalE("cannot initialize immutable environment", err)
}
}
if err := stage1initcommon.SetJournalPermissions(p); err != nil { // 设置权限
log.PrintE("warning: error setting journal ACLs, you'll need root to read the pod journal", err)
}
if flavor == "kvm" {
kvm.InitDebug(debug)
if err := KvmNetworkingToSystemd(p, n); err != nil {
log.FatalE("failed to configure systemd for kvm", err)
}
}
canMachinedRegister := false
if flavor != "kvm" {
// kvm doesn't register with systemd right now, see #2664.
canMachinedRegister = machinedRegister()
}
diag.Printf("canMachinedRegister %t", canMachinedRegister)
// --ipc=[auto|private|parent]
// default to private
parentIPC := false // 设置隔离等级
switch p.IPCMode {
case "parent":
parentIPC = true
case "private":
parentIPC = false
case "auto":
fallthrough
case "":
parentIPC = false
default:
log.Fatalf("unknown value for --ipc parameter: %v", p.IPCMode)
}
if parentIPC && flavor == "kvm" {
log.Fatal("flavor kvm requires private IPC namespace (try to remove --ipc)")
}
args, env, err := getArgsEnv(p, flavor, canMachinedRegister, debug, n, parentIPC) // 获取执行的参数与环境变量
if err != nil {
log.FatalE("cannot get environment", err)
}
diag.Printf("args %q", args)
diag.Printf("env %q", env)
// create a separate mount namespace so the cgroup filesystems
// are unmounted when exiting the pod
if err := syscall.Unshare(syscall.CLONE_NEWNS); err != nil { // 设置隔离参数
log.FatalE("error unsharing", err)
}
// we recursively make / a "shared and slave" so mount events from the
// new namespace don't propagate to the host namespace but mount events
// from the host propagate to the new namespace and are forwarded to
// its peer group
// See https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
if err := mnt.Mount("", "/", "none", syscall.MS_REC|syscall.MS_SLAVE, ""); err != nil {
log.FatalE("error making / a slave mount", err)
}
if err := mnt.Mount("", "/", "none", syscall.MS_REC|syscall.MS_SHARED, ""); err != nil {
log.FatalE("error making / a shared and slave mount", err)
}
unifiedCgroup, err := cgroup.IsCgroupUnified("/")
if err != nil {
log.FatalE("error determining cgroup version", err)
}
diag.Printf("unifiedCgroup %t", unifiedCgroup)
machineID := stage1initcommon.GetMachineID(p)
subcgroup, err := getContainerSubCgroup(machineID, canMachinedRegister, unifiedCgroup) // 设置subcgroup
if err != nil {
log.FatalE("error getting container subcgroup", err)
}
diag.Printf("subcgroup %q", subcgroup)
if err := ioutil.WriteFile(filepath.Join(p.Root, "subcgroup"),
[]byte(fmt.Sprintf("%s", subcgroup)), 0644); err != nil { // 写入subcgroup文件
log.FatalE("cannot write subcgroup file", err)
}
if !unifiedCgroup {
enabledCgroups, err := v1.GetEnabledCgroups()
if err != nil {
log.FatalE("error getting v1 cgroups", err)
}
diag.Printf("enabledCgroups %q", enabledCgroups)
if err := mountHostV1Cgroups(mnt, enabledCgroups); err != nil {
log.FatalE("couldn't mount the host v1 cgroups", err)
}
if !canMachinedRegister {
if err := v1.JoinSubcgroup("systemd", subcgroup); err != nil {
log.FatalE(fmt.Sprintf("error joining subcgroup %q", subcgroup), err)
}
}
var serviceNames []string
for _, app := range p.Manifest.Apps {
serviceNames = append(serviceNames, stage1initcommon.ServiceUnitName(app.Name))
}
diag.Printf("serviceNames %q", serviceNames)
if err := mountContainerV1Cgroups(mnt, p, enabledCgroups, subcgroup, serviceNames); err != nil {
log.FatalE("couldn't mount the container v1 cgroups", err)
}
}
// KVM flavor has a bit different logic in handling pid vs ppid, for details look into #2389
// it doesn't require the existence of a "ppid", instead it registers the current pid (which
// will be reused by lkvm binary) as a pod process pid used during entering
pid_filename := "ppid"
if flavor == "kvm" {
pid_filename = "pid"
}
if err = stage1common.WritePid(os.Getpid(), pid_filename); err != nil { // 写入pid文件
log.FatalE("error writing pid", err)
}
if flavor == "kvm" {
if err := KvmPrepareMounts(p); err != nil {
log.FatalE("error preparing mounts", err)
}
}
err = stage1common.WithClearedCloExec(lfd, func() error {
return syscall.Exec(args[0], args, env) // 跳入执行
})
if err != nil {
log.FatalE(fmt.Sprintf("failed to execute %q", args[0]), err)
}
return 0
}
...
// getArgsEnv returns the nspawn or lkvm args and env according to the flavor
// as the first two return values respectively.
func getArgsEnv(p *stage1commontypes.Pod, flavor string, canMachinedRegister bool, debug bool, n *networking.Networking, parentIPC bool) ([]string, []string, error) {
var args []string
env := os.Environ()
// We store the pod's flavor so we can later garbage collect it correctly
if err := os.Symlink(flavor, filepath.Join(p.Root, stage1initcommon.FlavorFile)); err != nil {
return nil, nil, errwrap.Wrap(errors.New("failed to create flavor symlink"), err)
}
// systemd-nspawn needs /etc/machine-id to link the container's journal
// to the host. Since systemd-v230, /etc/machine-id is mandatory, see
// https://github.com/systemd/systemd/commit/e01ff70a77e781734e1e73a2238af2e9bf7967a8
mPath := filepath.Join(common.Stage1RootfsPath(p.Root), "etc", "machine-id")
machineID := strings.Replace(p.UUID.String(), "-", "", -1) // 获取id
switch flavor { // 根据不同的启动类型来进行启动默认是coreos
case "kvm":
if p.PrivateUsers != "" {
return nil, nil, fmt.Errorf("flag --private-users cannot be used with an lkvm stage1")
}
// kernel and hypervisor binaries are located relative to the working directory
// of init (/var/lib/rkt/..../uuid)
// TODO: move to path.go
kernelPath := filepath.Join(common.Stage1RootfsPath(p.Root), "kernel_image")
netDescriptions := kvm.GetNetworkDescriptions(n)
cpu, mem := kvm.GetAppsResources(p.Manifest.Apps)
// Parse hypervisor
hv, err := KvmCheckHypervisor(common.Stage1RootfsPath(p.Root))
if err != nil {
return nil, nil, err
}
// Set start command for hypervisor
StartCmd := hvlkvm.StartCmd
switch hv {
case "lkvm":
StartCmd = hvlkvm.StartCmd
case "qemu":
StartCmd = hvqemu.StartCmd
default:
return nil, nil, fmt.Errorf("unrecognized hypervisor")
}
hvStartCmd := StartCmd(
common.Stage1RootfsPath(p.Root),
p.UUID.String(),
kernelPath,
netDescriptions,
cpu,
mem,
debug,
)
if hvStartCmd == nil {
return nil, nil, fmt.Errorf("no hypervisor")
}
args = append(args, hvStartCmd...)
// lkvm requires $HOME to be defined,
// see https://github.com/rkt/rkt/issues/1393
if os.Getenv("HOME") == "" {
env = append(env, "HOME=/root")
}
if err := linkJournal(common.Stage1RootfsPath(p.Root), machineID); err != nil {
return nil, nil, errwrap.Wrap(errors.New("error linking pod's journal"), err)
}
// use only dynamic libraries provided in the image
// from systemd v231 there's a new internal libsystemd-shared-v231.so
// which is present in /usr/lib/systemd
env = append(env, "LD_LIBRARY_PATH="+filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib/systemd"))
return args, env, nil
case "coreos":
args = append(args, filepath.Join(common.Stage1RootfsPath(p.Root), interpBin)) // 获取roorfs
args = append(args, filepath.Join(common.Stage1RootfsPath(p.Root), nspawnBin)) // 获取执行的路径
args = append(args, "--boot") // Launch systemd in the pod
args = append(args, "--notify-ready=yes") // From systemd v231 拼接参数
if context := os.Getenv(common.EnvSELinuxContext); context != "" {
args = append(args, fmt.Sprintf("-Z%s", context))
}
if context := os.Getenv(common.EnvSELinuxMountContext); context != "" {
args = append(args, fmt.Sprintf("-L%s", context))
}
if canMachinedRegister {
args = append(args, fmt.Sprintf("--register=true")) // 是否注册
} else {
args = append(args, fmt.Sprintf("--register=false"))
}
kubernetesLogDir, ok := p.Manifest.Annotations.Get("coreos.com/rkt/experiment/kubernetes-log-dir")
if ok {
args = append(args, fmt.Sprintf("--bind=%s:/rkt/kubernetes/log", kubernetesLogDir))
}
// use only dynamic libraries provided in the image
// from systemd v231 there's a new internal libsystemd-shared-v231.so
// which is present in /usr/lib/systemd
env = append(env, "LD_LIBRARY_PATH="+
filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib")+":"+
filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib/systemd"))
case "src":
args = append(args, filepath.Join(common.Stage1RootfsPath(p.Root), interpBin))
args = append(args, filepath.Join(common.Stage1RootfsPath(p.Root), nspawnBin))
args = append(args, "--boot") // Launch systemd in the pod
args = append(args, "--notify-ready=yes") // From systemd v231
if context := os.Getenv(common.EnvSELinuxContext); context != "" {
args = append(args, fmt.Sprintf("-Z%s", context))
}
if context := os.Getenv(common.EnvSELinuxMountContext); context != "" {
args = append(args, fmt.Sprintf("-L%s", context))
}
if canMachinedRegister {
args = append(args, fmt.Sprintf("--register=true"))
} else {
args = append(args, fmt.Sprintf("--register=false"))
}
kubernetesLogDir, ok := p.Manifest.Annotations.Get("coreos.com/rkt/experiment/kubernetes-log-dir")
if ok {
args = append(args, fmt.Sprintf("--bind=%s:/rkt/kubernetes/log", kubernetesLogDir))
}
// use only dynamic libraries provided in the image
// from systemd v231 there's a new internal libsystemd-shared-v231.so
// which is present in /usr/lib/systemd
env = append(env, "LD_LIBRARY_PATH="+
filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib")+":"+
filepath.Join(common.Stage1RootfsPath(p.Root), "usr/lib/systemd"))
case "host":
hostNspawnBin, err := common.LookupPath("systemd-nspawn", os.Getenv("PATH"))
if err != nil {
return nil, nil, err
}
// Check dynamically which version is installed on the host
// Support version >= 220
versionBytes, err := exec.Command(hostNspawnBin, "--version").CombinedOutput()
if err != nil {
return nil, nil, errwrap.Wrap(fmt.Errorf("unable to probe %s version", hostNspawnBin), err)
}
versionStr := strings.SplitN(string(versionBytes), "\n", 2)[0]
var version int
n, err := fmt.Sscanf(versionStr, "systemd %d", &version)
if err != nil {
return nil, nil, fmt.Errorf("cannot parse version: %q", versionStr)
}
if n != 1 || version < 220 {
return nil, nil, fmt.Errorf("rkt needs systemd-nspawn >= 220. %s version not supported: %v", hostNspawnBin, versionStr)
}
// Copy systemd, bash, etc. in stage1 at run-time
if err := installAssets(version); err != nil {
return nil, nil, errwrap.Wrap(errors.New("cannot install assets from the host"), err)
}
args = append(args, hostNspawnBin)
args = append(args, "--boot") // Launch systemd in the pod
args = append(args, fmt.Sprintf("--register=true"))
if version >= 231 {
args = append(args, "--notify-ready=yes") // From systemd v231
}
if context := os.Getenv(common.EnvSELinuxContext); context != "" {
args = append(args, fmt.Sprintf("-Z%s", context))
}
if context := os.Getenv(common.EnvSELinuxMountContext); context != "" {
args = append(args, fmt.Sprintf("-L%s", context))
}
kubernetesLogDir, ok := p.Manifest.Annotations.Get("coreos.com/rkt/experiment/kubernetes-log-dir")
if ok {
args = append(args, fmt.Sprintf("--bind=%s:/rkt/kubernetes/log", kubernetesLogDir))
}
default:
return nil, nil, fmt.Errorf("unrecognized stage1 flavor: %q", flavor)
}
machineIDBytes := append([]byte(machineID), '\n') //保存Id信息
if err := ioutil.WriteFile(mPath, machineIDBytes, 0644); err != nil {
return nil, nil, errwrap.Wrap(errors.New("error writing /etc/machine-id"), err)
}
if err := user.ShiftFiles([]string{mPath}, &p.UidRange); err != nil {
return nil, nil, errwrap.Wrap(errors.New("error shifting /etc/machine-id"), err)
}
// link journal only if the host is running systemd
if util.IsRunningSystemd() { // 检查是否运行
args = append(args, "--link-journal=try-guest")
keepUnit, err := util.RunningFromSystemService()
if err != nil {
if err == dlopen.ErrSoNotFound {
log.Print("warning: libsystemd not found even though systemd is running. Cgroup limits set by the environment (e.g. a systemd service) won't be enforced.")
} else {
return nil, nil, errwrap.Wrap(errors.New("error determining if we're running from a system service"), err)
}
}
if keepUnit {
args = append(args, "--keep-unit")
}
} else {
args = append(args, "--link-journal=no")
}
if !debug {
args = append(args, "--quiet") // silence most nspawn output (log_warning is currently not covered by this)
env = append(env, "SYSTEMD_LOG_LEVEL=err") // silence log_warning too
}
if parentIPC {
env = append(env, "SYSTEMD_NSPAWN_SHARE_NS_IPC=true")
}
env = append(env, "SYSTEMD_NSPAWN_CONTAINER_SERVICE=rkt")
// TODO (alepuccetti) remove this line when rkt will use cgroup namespace
// If the kernel has the cgroup namespace enabled, systemd v232 will use it by default.
// This was introduced by https://github.com/systemd/systemd/pull/3809 and it will cause
// problems in rkt when cgns is enabled and cgroup-v1 is used. For more information see
// https://github.com/systemd/systemd/pull/3589#discussion_r70277625.
// The following line tells systemd-nspawn not to use cgroup namespace using the environment variable
// introduced by https://github.com/systemd/systemd/pull/3809.
env = append(env, "SYSTEMD_NSPAWN_USE_CGNS=no")
if p.InsecureOptions.DisablePaths {
env = append(env, "SYSTEMD_NSPAWN_API_VFS_WRITABLE=yes")
}
if len(p.PrivateUsers) > 0 {
args = append(args, "--private-users="+p.PrivateUsers) // 加入权限参数
}
nsargs, err := stage1initcommon.PodToNspawnArgs(p)
if err != nil {
return nil, nil, errwrap.Wrap(errors.New("failed to generate nspawn args"), err)
}
args = append(args, nsargs...)
// Arguments to systemd
args = append(args, "--")
args = append(args, "--default-standard-output=tty") // redirect all service logs straight to tty
if !debug {
args = append(args, "--log-target=null") // silence systemd output inside pod
args = append(args, "--show-status=0") // silence systemd initialization status output
}
return args, env, nil // 拼凑完成启动的参数信息
}
从执行的流程上可知,stage1主要就是将对应的配置参数在一个隔离的环境中进行落盘并启动,默认通过systemd和systemd-nspawn来进行启动。
stage2
最终启动正在的可执行程序,在已经创建好的环境中进行运行。
通过一层层的文件的组织与引用systemd和systemd-nspawn来进行文件形式的容器的管理方式,并且通过不同的stage1来进行不同的模式的选择,如果使用最简单的模式也可以通过简单的linux的特性来进行运行具体流程位于stage1_fly中的代码流程。
总结
rkt利用了文件系统来进行各个运行的pod的管理,从而安全的解耦所有正在运行的容器,并且通过分层的架构设计,从而可以支持kvm、systemd/nspawn等不同的特性来运行隔离环境来管理,并且也提供简单的纯洁环境fly来进行app的运行,但是经过多年的发展在支持OCI规范的过程中,出现了containerd和cri-o等等。由于本人才疏学浅,如有错误请批评指正。
更多推荐
所有评论(0)