Ceph 故障处理记录一

一、故障描述
k8s01、k8s02、k8s03三台服务器组成的ceph集群
集群状态如下:

[root@k8s01 ops]# ceph -s
  cluster:
    id:     b5f36dec-8faa-4efa-b08d-cbcd8305ae63
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            1 MDSs report slow requests
            clock skew detected on mon.k8s03, mon.k8s04
            1 monitors have not enabled msgr2
            Reduced data availability: 81 pgs stale
            Degraded data redundancy: 11395/34185 objects degraded (33.333%), 81 pgs degraded, 81 pgs undersized
 
  services:
    mon: 3 daemons, quorum k8s01,k8s03,k8s04 (age 7h)
    mgr: k8s01(active, since 7h)
    mds: cephfs:1 {0=k8s01=up:active}
    osd: 3 osds: 3 up (since 2m), 3 in (since 9h)
 
  task status:
    scrub status:
        mds.k8s01: idle
 
  data:
    pools:   3 pools, 81 pgs
    objects: 11.39k objects, 1.8 GiB
    usage:   6.1 GiB used, 154 GiB / 160 GiB avail
    pgs:     11395/34185 objects degraded (33.333%)
             81 stale+active+undersized+degraded

主要问题在于:cephfs客户端无法挂载!
初步分析:由于mds 只有一台k8s01,刚好重启的是k8s01,由于重启间隔时间较长可能有半小时至一小时,导致了数据不一致,主要看如何修复数据。
后期需要考虑MDS高可用!

二、解决问题思路
①、提示81pgs 过时
通过ceph -s 可以看出
Reduced data availability: 81 pgs stale
pg 状态是:pgs stale | pgs 过时

卡住的 PGs

有失败发生后,PG 会进入“degraded”(降级)或“peering”(连接建立中)状态,这种情况时有发生。通常这些状态意味着正常的失败恢复正在进行。然而,如果一个 PG 长时间处于这些状态中的某个,就意味着有更大的问题。因此 monitor 在 PG 卡 ( stuck ) 在非最优状态时会告警。我们具体检查:

inactive(不活跃)—— PG 长时间不是active(即它不能提供读写服务了);
unclean(不干净)—— PG 长时间不是clean(例如它未能从前面的失败完全恢复);
stale(不新鲜)—— PG 状态没有被ceph-osd更新,表明存储这个 PG 的所有节点可能都down了。

解决办法:
通过以上得知可以通过重启osd 服务来恢复PG状态。
试过的方法:

systemctl restart ceph-osd.target
systemctl restart ceph-osd@1.service

依次重启了三个集群点,发现状态还是没变!

[root@k8s03 ~]#  systemctl restart ceph-osd.target
[root@k8s03 ~]# ceph pg stat  #无法查看PG状态
^CInterrupted
[root@k8s03 ~]# systemctl restart ceph-osd@1.service
[root@k8s03 ~]# ceph -s 
  cluster:
    id:     b5f36dec-8faa-4efa-b08d-cbcd8305ae63
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            1 MDSs report slow requests
            clock skew detected on mon.k8s03, mon.k8s04
            1 monitors have not enabled msgr2
            1 osds down
            1 host (1 osds) down
            Reduced data availability: 81 pgs stale
            Degraded data redundancy: 11395/34185 objects degraded (33.333%), 81 pgs degraded, 81 pgs undersized
 
  services:
    mon: 3 daemons, quorum k8s01,k8s03,k8s04 (age 7h)
    mgr: k8s01(active, since 7h)
    mds: cephfs:1 {0=k8s01=up:active}
    osd: 3 osds: 2 up (since 1.90458s), 3 in (since 9h)
 
  task status:
    scrub status:
        mds.k8s01: idle
 
  data:
    pools:   3 pools, 81 pgs
    objects: 11.39k objects, 1.8 GiB
    usage:   6.1 GiB used, 154 GiB / 160 GiB avail
    pgs:     11395/34185 objects degraded (33.333%)
             81 stale+active+undersized+degraded

最后通过重启服务器发现状态恢复回来。

[root@k8s04 ~]# ceph -s
  cluster:
    id:     b5f36dec-8faa-4efa-b08d-cbcd8305ae63
    health: HEALTH_WARN
            mon k8s01 is low on available space
            1 monitors have not enabled msgr2
 
  services:
    mon: 3 daemons, quorum k8s01,k8s03,k8s04 (age 8m)
    mgr: k8s01(active, since 7h)
    mds: cephfs:1 {0=k8s01=up:active}
    osd: 3 osds: 3 up (since 8m), 3 in (since 10h)
 
  task status:
    scrub status:
        mds.k8s01: idle
 
  data:
    pools:   3 pools, 81 pgs
    objects: 11.42k objects, 1.7 GiB
    usage:   8.9 GiB used, 231 GiB / 240 GiB avail
    pgs:     81 active+clean
 
  io:
    client:   5.9 KiB/s wr, 0 op/s rd, 0 op/s wr

参考:https://www.jianshu.com/p/9d740d025034
参考:https://blog.csdn.net/pansaky/article/details/86700301

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐