Failures : what I learned

in my HA cluster, Namenodes failed to start with the following :

2015-03-16 15:11:44,724 ERROR namenode.EditLogInputStream (EditLogFileInputStream.java:nextOpImpl(198)) - caught exception initializing http://gw.example.com:8480/getJournal?jid=mycluster&segmentTxId=88798&storageInfo=-56%3A567324971%3A0%3ACID-7958b480-2d52-49dc-8d71-e0d14429dbce
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$LogHeaderCorruptException: Unexpected version of the file system log file: -620756992. Current version = -56.
{...}
015-03-16 15:11:45,057 FATAL namenode.NameNode (NameNode.java:main(1400)) - Exception in namenode join
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0. Expected transaction ID was 88798

at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:193)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:133)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:805)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:665)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:272)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:891)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:638)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:480)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:536)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:695)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:680)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1329)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1395)
Caused by: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException: got premature end-of-file at txid 88618; expected file to go up to 14352384
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:194)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:180)
... 12 more
2015-03-16 15:11:45,066 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2015-03-16 15:11:45,079 INFO namenode.NameNode (StringUtils.java:run(640)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at dn1.example.com/240.0.0.12
************************************************************/

procedure is to use the recover mode :

[root@dn1 ~]# kinit -kt /etc/security/keytabs/nn.service.keytab nn/dn1.example.com

[root@dn1 ~]# /usr/bin/hadoop namenode -recover

{...}
15/03/16 15:33:01 ERROR namenode.MetaRecoveryContext: We failed to read txId 88798
15/03/16 15:33:01 INFO namenode.MetaRecoveryContext:
Enter 'c' to continue, skipping the bad section in the log
Enter 's' to stop reading the edit log here, abandoning any later edits
Enter 'q' to quit without saving
Enter 'a' to always select the first choice in the future without prompting. (c/s/q/a)

c
15/03/16 15:33:13 INFO namenode.MetaRecoveryContext: Continuing
15/03/16 15:33:13 INFO namenode.FSImage: Edits file http://nn.example.com:8480/getJournal?jid=mycluster&segmentTxId=88798&storageInfo=-56%3A567324971%3A0%3ACID-7958b480-2d52-49dc-8d71-e0d14429dbce of size 1048576 edits # 0 loaded in 11 seconds
15/03/16 15:33:13 INFO namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
15/03/16 15:33:13 INFO namenode.NameCache: initialized with 9 entries 212 lookups
15/03/16 15:33:13 INFO namenode.FSNamesystem: Finished loading FSImage in 18865 msecs
15/03/16 15:33:13 INFO namenode.FSImage: Save namespace ...
15/03/16 15:33:14 INFO namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 88618
15/03/16 15:33:14 INFO namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000000000088386, cpktTxId=0000000000000088386)
15/03/16 15:33:15 INFO namenode.MetaRecoveryContext: RECOVERY COMPLETE
15/03/16 15:33:15 INFO namenode.FSNamesystem: Stopping services started for active state
15/03/16 15:33:15 INFO namenode.FSNamesystem: Stopping services started for standby state
15/03/16 15:33:15 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at dn1.example.com/240.0.0.12
************************************************************/

Then we just have to start NameNode (you can expect some missing blocks though)


3 Comments

  • Reply Sarang |

    2015-08-31 11:50:45,511 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 11:51:45,526 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 11:52:45,544 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 11:53:45,560 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 11:54:45,576 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 11:55:45,594 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 11:56:45,611 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 11:57:45,627 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 11:58:45,649 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 11:59:45,664 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 12:00:45,679 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’
    2015-08-31 12:01:45,700 WARN org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet: Received an invalid request file transfer request from 10.7.1.73: This node has namespaceId ‘1137443819 and clusterId ‘CID-9a464540-a166-49c2-899d-1bf48a734a32′ but the requesting node expected ‘624146408’ and ‘CID-137f625b-804e-4f90-b29b-fbfec4bb1f6c’

  • Reply Sarang |

    When i disable active namenode by killing pid,standby namenode also shutdowns.How can i switch to standby namenode while the active namenode fails ????

  • Reply administrator |

    This is not a regular behaviour, standbyNameNode should take the lead.

    As I can see in your messages, there’s a problem with clusterID and namespace, as if you put a datanode from cluster B to cluster A ? or something related.

    I’m afraid this shouldn’t be easy to recover : from my point of view, you should kill your stdbyNN and bootstrap activeNN again to get back to a standard configuration; Note that you may loose some HDFS data in the process.

So, what do you think ?

  • Time limit is exhausted. Please reload CAPTCHA.