Monthly Archives: May 2015

so your HBase is broken

HBase can be a little tricky to understand, especially when talking about fixing.

There are 2 basic ways to fix things in HBase :

Hbase hbck

First try to run hbase hbck to see if there are inconsistencies.

If so, run a simple

[root@sandbox ~]# sudo -u hbase hbase hbck -fix

will most of the time fix things up (regions assigments).

There are a lot of options hbase hbck -help, useful ones could be hbase hbck -repair (which goes with a lot of repairs options) and hbase hbck -fixTableLocks for fixing tables locked for a long time

Recovering .META

There’s a jar shipped with HBase which can helps recovering .META lost from fs only.

To do so :

[hbase@sandbox root]$ hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair -base /hadoop/hbase -details

HBase sample table

Let’s create an simple HBase table from scratch !

There are many ways of creating a HBase table and populate it : bulk load, hbase shell, hive with HBaseStorageHandler, etc.
Here we’ll gonna use the ImportTsv class which aims to parse .tsv file to insert it into an existing HBase table.

First, let’s grab some data !

Download access.tsv to any machine of your cluster : this is a 2Gb zipped file with sample tab-separated data, containing columns rowkey,date,refer-url and http-code, and put it on HDFS.

[root@sandbox ~]# gunzip access.tsv.gz
[root@sandbox ~]# hdfs dfs -copyFromLocal ./access.tsv /tmp/

Now we have to create the table in HBase shell; it will contain only one ColumnFamily for this example

[root@sandbox ~]# hbase shell
hbase(main):001:0> create 'access_demo','cf1'
0 row(s) in 14.2610 seconds

And start the import with the ad hoc class, select the columns (don’t forget the HBASE_ROW_KEY which could be any of the column, hence it’s the first here).
Syntax is hbase JAVA_CLASS -DPARAMETERS TABLE_NAME FILE

Notice that you can specify tsv separator ‘-Dimporttsv.separator=,’ and that you obviously can add different column families cf1:field1,cf1:field2,cf2:field3,cf2:field4

[root@sandbox ~]# hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,cf1:date,cf1:refer-url,cf1:http-code access_demo /tmp/access.tsv

2015-05-21 19:55:38,144 INFO [main] mapreduce.Job: Job job_1432235700898_0002 running in uber mode : false
2015-05-21 19:55:38,151 INFO [main] mapreduce.Job: map 0% reduce 0%
2015-05-21 19:56:00,718 INFO [main] mapreduce.Job: map 7% reduce 0%
2015-05-21 19:56:03,742 INFO [main] mapreduce.Job: map 21% reduce 0%
2015-05-21 19:56:06,785 INFO [main] mapreduce.Job: map 65% reduce 0%
2015-05-21 19:56:10,846 INFO [main] mapreduce.Job: map 95% reduce 0%
2015-05-21 19:56:11,855 INFO [main] mapreduce.Job: map 100% reduce 0%
2015-05-21 19:56:13,948 INFO [main] mapreduce.Job: Job job_1432235700898_0002 completed successfully

Let’s check :

[root@sandbox ~]# hbase shell
hbase(main):001:0> list
TABLE
access_demo
iemployee
sales_data
3 row(s) in 9.7180 seconds

=> ["access_demo", "iemployee", "sales_data"]
hbase(main):002:0> scan 'access_demo'
ROW COLUMN+CELL
# rowkey column=cf1:date, timestamp=1432238079103, value=date
# rowkey column=cf1:http-code, timestamp=1432238079103, value=http-code
# rowkey column=cf1:refer-url, timestamp=1432238079103, value=refer-url
74.201.80.25/san-rafael-ca/events/sho column=cf1:date, timestamp=1432238079103, value=2008-01-25 16:20:50
w/80343522-eckhart-tolle
74.201.80.25/san-rafael-ca/events/sho column=cf1:http-code, timestamp=1432238079103, value=200
w/80343522-eckhart-tolle
74.201.80.25/san-rafael-ca/events/sho column=cf1:refer-url, timestamp=1432238079103, value=www.google.com/search
w/80343522-eckhart-tolle
calendar.boston.com/ column=cf1:date, timestamp=1432238079103, value=2008-01-25 19:35:50
calendar.boston.com/ column=cf1:http-code, timestamp=1432238079103, value=200

This is it !


get metrics with Ambari API

[vagrant@gw ~]$ curl -u admin:admin -X GET http://gw.example.com:8080/api/v1/clusters/hdp-cluster/hosts/nn.example.com/host_components/NAMENODE?fields=metrics/jvm
{
 "href" : "http://gw.example.com:8080/api/v1/clusters/hdp-cluster/hosts/nn.example.com/host_components/NAMENODE?fields=metrics/jvm",
 "HostRoles" : {
 "cluster_name" : "hdp-cluster",
 "component_name" : "NAMENODE",
 "host_name" : "nn.example.com"
 },
 "host" : {
 "href" : "http://gw.example.com:8080/api/v1/clusters/hdp-cluster/hosts/nn.example.com"
 },
 "metrics" : {
 "jvm" : {
 "HeapMemoryMax" : 1052770304,
 "HeapMemoryUsed" : 56104392,
 "NonHeapMemoryMax" : 318767104,
 "NonHeapMemoryUsed" : 49148216,
 "gcCount" : 190,
 "gcTimeMillis" : 4599,
 "logError" : 0,
 "logFatal" : 0,
 "logInfo" : 16574,
 "logWarn" : 2657,
 "memHeapCommittedM" : 1004.0,
 "memHeapUsedM" : 53.473206,
 "memMaxM" : 1004.0,
 "memNonHeapCommittedM" : 133.625,
 "memNonHeapUsedM" : 46.87139,
 "threadsBlocked" : 0,
 "threadsNew" : 0,
 "threadsRunnable" : 7,
 "threadsTerminated" : 0,
 "threadsTimedWaiting" : 54,
 "threadsWaiting" : 7
 }
 }
}

The metrics you may want to watch are HeapMemoryMax and HeapMemoryUsed