Posts tagged with: got premature end-of-file at txid

Failures : what I learned

in my HA cluster, Namenodes failed to start with the following :

2015-03-16 15:11:44,724 ERROR namenode.EditLogInputStream (EditLogFileInputStream.java:nextOpImpl(198)) - caught exception initializing http://gw.example.com:8480/getJournal?jid=mycluster&segmentTxId=88798&storageInfo=-56%3A567324971%3A0%3ACID-7958b480-2d52-49dc-8d71-e0d14429dbce
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$LogHeaderCorruptException: Unexpected version of the file system log file: -620756992. Current version = -56.
{...}
015-03-16 15:11:45,057 FATAL namenode.NameNode (NameNode.java:main(1400)) - Exception in namenode join
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0. Expected transaction ID was 88798

at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:193)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:133)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:805)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:665)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:272)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:891)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:638)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:480)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:536)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:695)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:680)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1329)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1395)
Caused by: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException: got premature end-of-file at txid 88618; expected file to go up to 14352384
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:194)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:180)
... 12 more
2015-03-16 15:11:45,066 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2015-03-16 15:11:45,079 INFO namenode.NameNode (StringUtils.java:run(640)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at dn1.example.com/240.0.0.12
************************************************************/

procedure is to use the recover mode :

[root@dn1 ~]# kinit -kt /etc/security/keytabs/nn.service.keytab nn/dn1.example.com

[root@dn1 ~]# /usr/bin/hadoop namenode -recover

{...}
15/03/16 15:33:01 ERROR namenode.MetaRecoveryContext: We failed to read txId 88798
15/03/16 15:33:01 INFO namenode.MetaRecoveryContext:
Enter 'c' to continue, skipping the bad section in the log
Enter 's' to stop reading the edit log here, abandoning any later edits
Enter 'q' to quit without saving
Enter 'a' to always select the first choice in the future without prompting. (c/s/q/a)

c
15/03/16 15:33:13 INFO namenode.MetaRecoveryContext: Continuing
15/03/16 15:33:13 INFO namenode.FSImage: Edits file http://nn.example.com:8480/getJournal?jid=mycluster&segmentTxId=88798&storageInfo=-56%3A567324971%3A0%3ACID-7958b480-2d52-49dc-8d71-e0d14429dbce of size 1048576 edits # 0 loaded in 11 seconds
15/03/16 15:33:13 INFO namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
15/03/16 15:33:13 INFO namenode.NameCache: initialized with 9 entries 212 lookups
15/03/16 15:33:13 INFO namenode.FSNamesystem: Finished loading FSImage in 18865 msecs
15/03/16 15:33:13 INFO namenode.FSImage: Save namespace ...
15/03/16 15:33:14 INFO namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 88618
15/03/16 15:33:14 INFO namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000000000088386, cpktTxId=0000000000000088386)
15/03/16 15:33:15 INFO namenode.MetaRecoveryContext: RECOVERY COMPLETE
15/03/16 15:33:15 INFO namenode.FSNamesystem: Stopping services started for active state
15/03/16 15:33:15 INFO namenode.FSNamesystem: Stopping services started for standby state
15/03/16 15:33:15 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at dn1.example.com/240.0.0.12
************************************************************/

Then we just have to start NameNode (you can expect some missing blocks though)