ceph 创建一个桶命令_Ceph的最佳实践

背景最近在做个对象存储，基于ROOK项目，简而言之，就是使用Rook提供的K8s operator，基于K8s cluster部署Ceph服务。部署将Rook项目克隆到本地，使用里面的k8s/ceph目录。1，部署CRDkubectl apply -f common.yaml2，部署operatorkubectl apply -f operator.yaml3，部署Ceph clusterclus

weixin_39989862

1316人浏览 · 2021-01-05 09:17:06

weixin_39989862 · 2021-01-05 09:17:06 发布

背景

最近在做个对象存储，基于ROOK项目，简而言之，就是使用Rook提供的K8s operator，基于K8s cluster部署Ceph服务。

部署

将Rook项目克隆到本地，使用里面的k8s/ceph目录。

1，部署CRD

kubectl apply -f common.yaml

2，部署operator

kubectl apply -f operator.yaml

3，部署Ceph cluster

cluster.yaml文件里的内容需要修改，一定要适配自己的硬件情况。修改完成后：

kubectl apply -f cluster.yaml

4，创建对象存储

kubectl apply -f object.yaml

#以及用户
kubectl apply -f object-user.yaml

5，创建Ceph pool（块存储池）

使用的kind是CephBlockPool

kubectl apply -f pool.yaml

6，创建文件存储

这块宿主机的内核版本一定要>= 5.3（详情参考文末）：

kubectl apply -f filesystem.yaml
kubectl apply -f storageclass.yaml

7，创建PVC

之后你就可以创建各种PVC了。

入门条款

1，判断ceph安装是否成功

首先看pod的情况，有operator、mgr、agent、discover、mon、osd、tools，且osd-prepare是completed的状态，其它是running的状态：

gemfield@master:~/rook-ceph/rook/cluster/examples/kubernetes/ceph$ kubectl -n rook-ceph get pods
NAME                                           READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-4f2wl                         3/3     Running     0          3h47m
csi-cephfsplugin-5k6z8                         3/3     Running     0          3h47m
csi-cephfsplugin-d5rhh                         3/3     Running     0          3h47m
csi-cephfsplugin-provisioner-c6b8b6dff-ds56t   4/4     Running     0          3h47m
csi-cephfsplugin-provisioner-c6b8b6dff-zjx6x   4/4     Running     0          3h47m
csi-cephfsplugin-sw98v                         3/3     Running     0          3h47m
csi-cephfsplugin-vsgwq                         3/3     Running     0          3h47m
csi-rbdplugin-f8cdm                            3/3     Running     0          3h47m
csi-rbdplugin-g7xcf                            3/3     Running     0          3h47m
csi-rbdplugin-kvr4m                            3/3     Running     0          3h47m
csi-rbdplugin-lc2lp                            3/3     Running     0          3h47m
csi-rbdplugin-provisioner-68f5664455-4pjm2     5/5     Running     0          3h47m
csi-rbdplugin-provisioner-68f5664455-p9n9r     5/5     Running     0          3h47m
csi-rbdplugin-zsdk9                            3/3     Running     0          3h47m
rook-ceph-agent-6dczj                          1/1     Running     0          3h47m
rook-ceph-agent-rvvtr                          1/1     Running     0          3h47m
rook-ceph-agent-t79hz                          1/1     Running     0          3h47m
rook-ceph-agent-w65x2                          1/1     Running     0          3h47m
rook-ceph-agent-wbxv4                          1/1     Running     0          3h47m
rook-ceph-mgr-a-77b8674677-zph5s               1/1     Running     0          3h44m
rook-ceph-mon-a-578bcdc486-4nfxn               1/1     Running     0          3h45m
rook-ceph-mon-b-7c5fdd6794-29frc               1/1     Running     0          3h45m
rook-ceph-mon-c-7d79cf845c-tbdbf               1/1     Running     0          3h45m
rook-ceph-operator-6dbdbc6b94-dsd6z            1/1     Running     0          3h48m
rook-ceph-osd-0-7dd859cc77-42gj9               1/1     Running     0          3h43m
rook-ceph-osd-1-7b48479788-8vq4m               1/1     Running     0          3h43m
rook-ceph-osd-2-d95cb9479-fpbhd                1/1     Running     0          3h42m
rook-ceph-osd-3-779699c4fd-82zrl               1/1     Running     0          3h43m
rook-ceph-osd-4-5c87768ddf-wcsw4               1/1     Running     0          3h43m
rook-ceph-osd-5-689d78db84-t28qk               1/1     Running     0          3h42m
rook-ceph-osd-prepare-ai02-r4lg9               0/1     Completed   0          3h44m
rook-ceph-osd-prepare-ai03-vx6nj               0/1     Completed   0          3h44m
rook-ceph-osd-prepare-ai04-h69zf               0/1     Completed   0          3h44m
rook-ceph-tools-7cf4cc7568-6lcgv               1/1     Running     0          6m23s
rook-discover-flxm2                            1/1     Running     0          3h48m
rook-discover-h7j2f                            1/1     Running     0          3h48m
rook-discover-nwsm5                            1/1     Running     0          3h48m
rook-discover-t6fl7                            1/1     Running     0          3h48m
rook-discover-t6xgz                            1/1     Running     0          3h48m

其次看ceph status的状态需要是HEALTH：

gemfield@master:~/rook-ceph/rook/cluster/examples/kubernetes/ceph$ kubectl -n rook-ceph exec -it rook-ceph-tools-7cf4cc7568-6lcgv bash
[root@AI03 /]# ceph status
  cluster:
    id:     0cf36812-094c-4920-84aa-0c0099b01fe0
    health: HEALTH_WARN
            mon b is low on available space
 
  services:
    mon: 3 daemons, quorum a,b,c (age 3h)
    mgr: a(active, since 3h)
    osd: 6 osds: 6 up (since 3h), 6 in (since 3h)
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   6.0 GiB used, 21 TiB / 21 TiB avail
    pgs:

然后创建Block、Object、Shared File System；最后创建Dashboard。

2，如何重启Dashboard？

Dashboard上enable一些功能后，通常需要重启Dashboard服务才能使用。使用下面的命令来重启Dashboard：

#在toolbox容器里执行
[root@AI03 /]# ceph mgr module disable dashboard
[root@AI03 /]# ceph mgr module enable dashboard

基础条款

1，需要osd、mon、mds、rbd、rgw、CephFS服务

其中后三者分别提供block、object/gw、fs/shared fs服务。

2，K8s化部署使用Rook

其它情况可以使用官方的ceph-deploy或者docker方式。

3，一个磁盘一个osd

4，Ceph的Dashboard用着还不错

但dashboard上看性能数据需要安装prometheus和Grafana，并启用；

5，部署完成后应该立刻做性能测试

基础健康状态确认好，再应用上层业务。我用的机械硬盘，30个OSD测试的写入速度大概是70MB/s。

6，注意检查有些Ceph host或者osd是不是down了

有时候是因为宿主机上的磁盘设备号变了。

7，Ceph的Pool（多租户）

以pool为颗粒度，如果不创建/指定，则数据会存放在默认的pool里。创建pool需要设置pg的数量，一般来说每个OSD为100个PG，也可以按照如下规则配置：

若少于5个OSD， 设置pg_num为128。
5~10个OSD，设置pg_num为512。
10~50个OSD，设置pg_num为4096。
超过50个OSD，可以参考pgcalc计算。

Pool上还需要设置CRUSH Rules策略，这是data如何分布式存储的策略。

此外，针对pool，还可以调整POOL副本数量、删除POOL、设置POOL配额、重命名POOL、查看POOL状态信息。

8，Ceph的backup & Restore

Pool上有创建快照的功能。

9，怎么提高ceph的写入速度？

接入的服务越来越多，每秒钟需要写入的数据将从几十MB慢慢到1GB、10GB...集群如何扩展？

10，Ceph的Pool有2种方式

replicated和erasure两种，默认是replicated。replicated就是多重备份（官方建议是3），好处是数据更安全，功能更全面，坏处是消耗磁盘空间大，成本高；而erasure相当于RAID5，好处是节省空间，坏处是支持的操作有限。

11，replicated的数量

官方建议是3，但我觉得2也可以。并且replica=2节省空间、速度更快。

如果有25 OSDs ，每个对应一个4TB的磁盘。那么可用的pool的存储量为：

Raw size: 25*4 = 100TB
replica=2 : 100/2 = 50TB
replica=3 : 100/3 = 33.33TB

12，Ceph可以提供NFS支持

依赖NFS-Ganesha，并且要在Ceph中启用。

高阶问题

1，使用CephFS的时候，速度奇慢无比。

首先为什么会使用CephFS呢？因为CephFS可以被多个Pod挂载共享。当使用CephFS作为SC，然后通过k8s PVC将存储挂载到Pod的目录上，在挂载的目录上使用dd命令进行写入测试：

root@media-76d54565f5-dhc2w:/app/gemfield# dd if=/dev/zero of=gemfieldtest bs=500M count=1             
1+0 records in
1+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 14.1455 s, 37.1 MB/s

速度只有37.1MB/s，下面是当前Ceph集群上所用的

gemfield@deepvac:~$ kubectl get pvc
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
media            Bound    pvc-4828fa4e-de02-40a7-a972-64fef09f411c   18627Gi    RWX            csi-cephfs        24d

gemfield@deepvac:~$ kubectl get sc
NAME              PROVISIONER                     AGE
csi-cephfs        rook-ceph.cephfs.csi.ceph.com   30d

gemfield@deepvac:~$ kubectl describe sc csi-cephfs
Name:            csi-cephfs
IsDefaultClass:  No
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"csi-cephfs"},"mountOptions":null,"parameters":{"clusterID":"rook-ceph","csi.storage.k8s.io/node-stage-secret-name":"rook-ceph-csi","csi.storage.k8s.io/node-stage-secret-namespace":"rook-ceph","csi.storage.k8s.io/provisioner-secret-name":"rook-ceph-csi","csi.storage.k8s.io/provisioner-secret-namespace":"rook-ceph","fsName":"myfs","pool":"myfs-data0"},"provisioner":"rook-ceph.cephfs.csi.ceph.com","reclaimPolicy":"Retain"}

Provisioner:           rook-ceph.cephfs.csi.ceph.com
Parameters:            clusterID=rook-ceph,csi.storage.k8s.io/node-stage-secret-name=rook-ceph-csi,csi.storage.k8s.io/node-stage-secret-namespace=rook-ceph,csi.storage.k8s.io/provisioner-secret-name=rook-ceph-csi,csi.storage.k8s.io/provisioner-secret-namespace=rook-ceph,fsName=myfs,pool=myfs-data0
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Retain
VolumeBindingMode:     Immediate
Events:                <none>

gemfield@deepvac:~$ mount | grep kubelet | grep pvc | awk '{print $1}'
ceph-fuse
ceph-fuse
ceph-fuse
ceph-fuse
ceph-fuse
ceph-fuse

速度慢的原因在https://github.com/rook/rook/issues/3315 和 https://github.com/rook/rook/issues/3619 已经有所描述，总而言之，就是：

1，如果Linux内核版本<4.17的话，只能使用ceph-fuse client，要想维持可观的磁盘IO，只能使用flexVolume：

    volumes:
    - name: image-store
    flexVolume:
        driver: ceph.rook.io/rook
        fsType: ceph
        options:
        fsName: myfs # name of the filesystem specified in the filesystem CRD.
        clusterNamespace: rook-ceph # namespace where the Rook cluster is deployed
        # by default the path is /, but you can override and mount a specific path of the filesystem by using the path attribute
        # the path must exist on the filesystem, otherwise mounting the filesystem at that path will fail
        # path: /some/path/inside/cephfs

2，如果Linux内核版本>= 4.17的话，可以使用kernel client，磁盘IO性能会大为改观（在Ubuntu 18.04 LTS上，如果要升级内核版本，可以使用下面的命令）：

sudo apt install linux-generic-hwe-18.04

在升级完内核版本后，更改https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/csi/cephfs/storageclass.yaml#L31 ：

mounter: kernel

重新创建SC、PVC，然后mount到Pod上，再进行dd命令的测试：

root@media-759f9dd87-bbhhw:/app/gemfield# dd if=/dev/zero of=deleteme bs=10G count=1
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB, 2.0 GiB) copied, 2.24164 s, 958 MB/s


root@media-759f9dd87-bbhhw:/app/gemfield# time cp deleteme deleteme2
real    0m22.985s
user    0m0.005s
sys     0m2.725s

可见速度得到了无与伦比的增长。

2，Ceph安装成功后，一段时间后会crash

如果在一个子网里，有的宿主机可以访问公网，有的不能访问公网，这就会出现时间同步问题。可以安装ntp服务来解决这个问题：

1，在可以访问公网的机器上安装ntp服务

sudo apt install ntp

2，修改ntp服务配置文件

# /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help

driftfile /var/lib/ntp/ntp.drift

# Leap seconds definition provided by tzdata
leapfile /usr/share/zoneinfo/leap-seconds.list

# Enable this if you want statistics to be logged.
#statsdir /var/log/ntpstats/

statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable

# Specify one or more NTP servers.
server 0.cn.pool.ntp.org
server 1.cn.pool.ntp.org
server 2.cn.pool.ntp.org
server 3.cn.pool.ntp.org

# Use servers from the NTP Pool Project. Approved by Ubuntu Technical Board
# on 2011-02-08 (LP: #104525). See http://www.pool.ntp.org/join.html for
# more information.
pool 0.cn.pool.ntp.org iburst
pool 1.cn.pool.ntp.org iburst
pool 2.cn.pool.ntp.org iburst
pool 3.cn.pool.ntp.org iburst

# Use Ubuntu's ntp server as a fallback.
pool ntp.ubuntu.com

# Access control configuration; see /usr/share/doc/ntp-doc/html/accopt.html for
# details.  The web page <http://support.ntp.org/bin/view/Support/AccessRestrictions>
# might also be helpful.
#
# Note that "restrict" applies to both servers and clients, so a configuration
# that might be intended to block requests from certain clients could also end
# up blocking replies from your own upstream servers.

# By default, exchange time with everybody, but don't allow configuration.
restrict -4 default kod notrap nomodify nopeer noquery limited
restrict -6 default kod notrap nomodify nopeer noquery limited

# Local users may interrogate the ntp server more closely.
restrict 127.0.0.1
restrict ::1

# Needed for adding pool entries
restrict source notrap nomodify noquery

# Clients from this (example!) subnet have unlimited access, but only if
# cryptographically authenticated.
#restrict 192.168.1.0 mask 255.255.255.0 trust


# If you want to provide time to your local subnet, change the next line.
# (Again, the address is an example only.)
broadcast 192.168.1.255

# If you want to listen to time broadcasts on your local subnet, de-comment the
# next lines.  Please do this only if you trust everybody on the network!
#disable auth
#broadcastclient

#Changes recquired to use pps synchonisation as explained in documentation:
#http://www.ntp.org/ntpfaq/NTP-s-config-adv.htm#AEN3918

#server 127.127.8.1 mode 135 prefer    # Meinberg GPS167 with PPS
#fudge 127.127.8.1 time1 0.0042        # relative to PPS for my hardware

#server 127.127.22.1                   # ATOM(PPS)
#fudge 127.127.22.1 flag3 1            # enable PPS API

3，在其它宿主机上配置时间同步

修改的文件是timesyncd.conf，属于systemd。

sudo vi /etc/systemd/timesyncd.conf

#修改为以下内容，IP是前述配置了ntp server的
[Time]
NTP=192.168.1.93

4，使“时间同步”配置生效

sudo systemctl restart systemd-timesyncd.service 
sudo systemctl status systemd-timesyncd.service

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub