Monthly Archives: April 2016

Spark on HBase with Spark shell

Some minor adjustements are needed for attacking HBase tables from a Spark context.

Let’s first quick create a “t1″ HBase sample table with 40 lines


[root@sandbox ~]# cat hbase_load.txt
create 't1', 'f1'
for i in '1'..'10' do \
for j in '1'..'2' do \
for k in '1'..'2' do \
rnd=(0...64).map { (65 + rand(26)).chr }.join
put 't1', "#{i}-#{j}-#{k}", "f1:#{j}#{k}", "#{rnd}"
end \
end \
[root@sandbox ~]# cat hbase_load.txt |hbase shell

You need to adjust your Spark classpath (guava 14 needed so included the first I’d found):

[root@sandbox ~]# export SPARK_CLASSPATH=/usr/hdp/current/spark-client/lib/hbase-common.jar:/usr/hdp/current/spark-client/lib/hbase-client.jar:/usr/hdp/current/spark-client/lib/hbase-protocol.jar:/usr/hdp/current/spark-client/lib/hbase-server.jar:/etc/hbase/conf:/usr/hdp/

[root@sandbox ~]# spark-shell --master yarn-client

As a side note, the SPARK_CLASSPATH is deprecated in Spark 1.5.x+ so you shall use instead 
[root@sandbox ~]# spark-shell --master yarn-client --driver-class-path=/usr/hdp/current/spark-client/lib/hbase-common.jar:/usr/hdp/current/spark-client/lib/hbase-client.jar:/usr/hdp/current/spark-client/lib/hbase-protocol.jar:/usr/hdp/current/spark-client/lib/hbase-hadoop2-compat.jar:/usr/hdp/current/spark-client/lib/hbase-server.jar:/etc/hbase/conf:/usr/hdp/

I did ran into bugs using the previous : […]Caused by: java.lang.IllegalStateException: unread block data so I used the first version (using SPARK_CLASSPATH)
Now it’s Scala’s time !

import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

val tableName = "t1"
val hconf = HBaseConfiguration.create()
hconf.set(TableInputFormat.INPUT_TABLE, "t1")

val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("records found : " + hBaseRDD.count())

2016-04-07 18:44:40,553 INFO [main] scheduler.DAGScheduler: Job 0 finished: count at <console>:30, took 2.092481 s
Number of Records found : 40
If you want to use HBase Admin to see table list, snapshotting, or any admin-related operation, you’ll use
scala> val admin = new HBaseAdmin(hconf)

And if you want to create a table :
val tableDesc = new HTableDescriptor(tableName)

set date and time in VirtualBox

If you set date manually in your VirtualBox VM, datetime reset to the host date.

This behaviour is caused by VirtualBox Guest Additions, so you first need to stop that service :

sudo service vboxadd-service stop

You’ll then be able to change the date

date --set="8 Apr 2016 18:00:00"

Hadoop log compression on the fly with log4j

Hadoop logs are as verbose and useful as heavy. From that last perspective, some want to zip their logs so they can maintain their /var/log partition under warnings.

Thanks to log4j, you can achieve that in 2 ways :

1. use the log4j-extras package

2. use the log4j2 package which contains (at least !) compression

Here I’ll use the first, using it for Hive logging :

  • Download the log4j-extras package
  • put the jar in the lib : either you want to put in for “global” Hadoop, or maybe here just for Hive, so put it in /usr/hdp/
  • now adjust log4j properties to use rolling.RollingFileAppender instead of DRFA (Daily Rolling File Appender) using Ambari (for the example, in Advanced hive-log4j of the Hive service configs) or in Hive

log4j.appender.request.layout = org.apache.log4j.PatternLayout
log4j.appender.request.layout.ConversionPattern=%d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n

Remember to get over the DRFA lines by commenting or deleting the lines.

Restart components, and you have zipped DRFA on daily basis (yyyyMMdd)