kill zombie dead regionservers

I had this dn24 RegionServer marked as dead in HBaseUI but this machine was decommissioned and removed from the cluster monthes ago.

After some digging, it appears that it stands here because it was still considered “active” by HBase, and the reason why had been found in HDFS :

[root@machine ~]# hdfs dfs -ls /apps/hbase/data/WALs/

drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn17.test.fr,60020,1446939183416
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn18.test.fr,60020,1446939179122
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn19.test.fr,60020,1446939182213
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn20.test.fr,60020,1446939182925
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn21.test.fr,60020,1446939185744
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn22.test.fr,60020,1446939173931
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn24.test.fr,60020,1409665198801-splitting
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn25.test.fr,60020,1446939185856
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn26.test.fr,60020,1446939178831
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn27.test.fr,60020,1446939183921
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn28.test.fr,60020,1446939179838
drwxrwx--- - hbase hdfs 0 2015-11-08 00:33 /apps/hbase/data/WALs/dn29.test.fr,60020,1446939178499

 

Found ? The WAL (Write-Ahead Log) was still in HDFS in the “splitting” state, so from HBase perspective it’s not dead.

I removed the dn24 WAL directory in HDFS, restarted HBaseMaster (no downtime on HBase when restarting HBaseMaster), it did go away.


2 Comments

So, what do you think ?