解决 Ceph PG stuck in degraded + undersized

執行 ceph -s 發現有很多 pg 狀態卡住不動。

cluster 1a1d374a-c6e9-48cb-9b45-525a6fdaa91e
 health HEALTH_WARN
        64 pgs degraded
        64 pgs stale
        64 pgs stuck degraded
        64 pgs stuck stale
        64 pgs stuck unclean
        64 pgs stuck undersized
        64 pgs undersized
 monmap e1: 1 mons at {twin-storage-01=172.16.91.1:6789/0}
        election epoch 2, quorum 0 twin-storage-01
 mdsmap e5: 1/1/1 up {0=twin-storage-01=up:active}
 osdmap e92: 7 osds: 7 up, 7 in
  pgmap v685: 832 pgs, 7 pools, 43573 kB data, 38 objects
        7491 MB used, 14889 GB / 14896 GB avail
             769 active+clean
              37 stale+active+undersized+degraded+remapped
              26 stale+active+undersized+degraded

很多 pg 卡在 degraded + undersized 狀態。

執行 ceph health detail 看到詳細一點的資訊:

pg 0.2c is stuck stale for 59834.916633, current state stale+active+undersized+degraded, last acting [0]
pg 0.2b is stuck stale for 59834.916635, current state stale+active+undersized+degraded, last acting [0]
pg 0.2a is stuck stale for 59834.916637, current state stale+active+undersized+degraded+remapped, last acting [0]
pg 0.29 is stuck stale for 59834.916638, current state stale+active+undersized+degraded, last acting [0]
pg 0.28 is stuck stale for 59834.916640, current state stale+active+undersized+degraded, last acting [0]
pg 0.27 is stuck stale for 59834.916642, current state stale+active+undersized+degraded+remapped, last acting [0]
pg 0.26 is stuck stale for 59834.916644, current state stale+active+undersized+degraded+remapped, last acting [0]
pg 0.25 is stuck stale for 59834.916645, current state stale+active+undersized+degraded+remapped, last acting [0]
pg 0.24 is stuck stale for 59834.916647, current state stale+active+undersized+degraded+remapped, last acting [0]

想要利用 ceph pg <pgid> query 查看 pg 的詳細資訊,卻出現 error:

$ ceph pg 0.24 query
Error ENOENT: i don't have pgid 0.24

猜測問題產生的原因,可能是我在建置這一套 Ceph cluster 時,曾經把所有的 OSD 都移出重建。可能移除時沒有把資料清乾淨;或是我移除的方法不對,…等原因。


Solution

解決方法就是用 ceph pg force_creat_pg <pgid> 去覆蓋那個有問題的 pg

$ ceph pg force_create_pg 0.24
pg 0.24 now creating, ok

這個 pg 就會轉成 creating,過一段時間,等 creating完成之後,就可以 query 出那個 pg 的資訊:

$ ceph pg 0.24 query
{
    "state": "active+clean",
    "snap_trimq": "[]",
    "epoch": 94,
    "up": [
        2,
        4
    ],

    ... (省略)...

    "agent_state": {}
}

同時,ceph -s也可以看到少了有狀況的 pg:

cluster 1a1d374a-c6e9-48cb-9b45-525a6fdaa91e
 health HEALTH_WARN
        63 pgs degraded
        63 pgs stale
        63 pgs stuck degraded
        63 pgs stuck stale
        63 pgs stuck unclean
        63 pgs stuck undersized
        63 pgs undersized
 monmap e1: 1 mons at {twin-storage-01=172.16.91.1:6789/0}
        election epoch 2, quorum 0 twin-storage-01
 mdsmap e5: 1/1/1 up {0=twin-storage-01=up:active}
 osdmap e92: 7 osds: 7 up, 7 in
  pgmap v685: 832 pgs, 7 pools, 43573 kB data, 38 objects
        7491 MB used, 14889 GB / 14896 GB avail
             769 active+clean
              37 stale+active+undersized+degraded+remapped
              26 stale+active+undersized+degraded

如果有問題的 pg 數量很多,可以用 for loop 去跑:

for pg in `ceph health detail | grep "stale+active+undersized+degraded" | awk '{print $2}' | sort | uniq`;
do
  ceph pg force_create_pg $pg
done

用 for loop 跑指令下太快,可能會變成以下狀況:

cluster 1a1d374a-c6e9-48cb-9b45-525a6fdaa91e
 health HEALTH_WARN
        63 pgs stuck inactive
        63 pgs stuck unclean
 monmap e1: 1 mons at {twin-storage-01=172.16.91.1:6789/0}
        election epoch 2, quorum 0 twin-storage-01
 mdsmap e5: 1/1/1 up {0=twin-storage-01=up:active}
 osdmap e92: 7 osds: 7 up, 7 in
  pgmap v892: 832 pgs, 7 pools, 45412 kB data, 42 objects
        7496 MB used, 14889 GB / 14896 GB avail
             769 active+clean
              63 creating

ceph health detal 出現:

pg 0.31 is stuck inactive since forever, current state creating, last acting []

這時先不要急,放著讓他處理一段時間即可。