Browsing posts in: hadoop

Hadoop log compression on the fly with log4j

Hadoop logs are as verbose and useful as heavy. From that last perspective, some want to zip their logs so they can maintain their /var/log partition under warnings.

Thanks to log4j, you can achieve that in 2 ways :

1. use the log4j-extras package

2. use the log4j2 package which contains (at least !) compression

Here I’ll use the first, using it for Hive logging :

  • Download the log4j-extras package
  • put the jar in the lib : either you want to put in for “global” Hadoop, or maybe here just for Hive, so put it in /usr/hdp/2.2.4.2-2/hive/lib/
  • now adjust log4j properties to use rolling.RollingFileAppender instead of DRFA (Daily Rolling File Appender) using Ambari (for the example, in Advanced hive-log4j of the Hive service configs) or in Hive log4j.properties
hive.root.logger=INFO,request

log4j.appender.request=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.request.File=${hive.log.dir}/${hive.log.file}
log4j.appender.request.RollingPolicy=org.apache.log4j.rolling.TimeBasedRollingPolicy
log4j.appender.request.RollingPolicy.ActiveFileName=${hive.log.dir}/${hive.log.file}
log4j.appender.request.RollingPolicy.FileNamePattern=${hive.log.dir}/${hive.log.file}.%d{yyyyMMdd}.log.gz
log4j.appender.request.layout = org.apache.log4j.PatternLayout
log4j.appender.request.layout.ConversionPattern=%d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n

Remember to get over the DRFA lines by commenting or deleting the lines.

Restart components, and you have zipped DRFA on daily basis (yyyyMMdd)

 


get Hadoop metric with jmx

JMX is widely used in Java world to stream metrics and KPIs.

In Hadoop, the most used JMX is on the NameNode :

http://NAMENODE_FQDN:50070/jmx

This provides an overview of all the things accessible through JMX :

JMX Hadoop metrics JMX Overview

You can have a list of all metrics provided by JMX :

http://sandbox.hortonworks.com:50070/jmx?qry=Hadoop:*

JMX Hadoop metrics

And you can finally (which is very useful if you’re looking at something very specific) get a specific metric when specifying the bean name and the attribute :

http://NAMENODE_FQDN:50070/jmx?get=MXBeanName::AttributeName

For example, if we want to get the CapacityUsed found here :

JMX find metric

Then we’ll call JMX with MXBeanName = Hadoop:service=NameNode,name=FSNamesystemState and AttributeName set to CapacityUsed

So we call http://sandbox.hortonworks.com:50070/jmx?get=Hadoop:service=NameNode,name=FSNamesystemState::CapacityUsed

JMX CapacityUsed


Hadoop HDFS commands

Leaving SafeMode :

$ bin/hadoop dfsadmin -safemode leave

 

Failover NameNode :

find HDFS option

dfs.namenode.http-address.mycluster.nn1=nn.example.com:50070
dfs.namenode.http-address.mycluster.nn2=dn1.example.com:50070

So nn.example.com is nn1

[root@nn ~]# hdfs haadmin -getServiceState nn1
standby

Force transition for a NN to Active

[root@nn ~]# hdfs haadmin -transitionToActive --forcemanual nn1

 

Your virtual cluster on your machine

Playing with HDP sandbox is cool. This is a quick way to see how things are going, try some Hive requests, and so on.

At one time, you may want to build your own virtual cluster on your machine : several virtual machines, making a “real” cluster to have something closest to reality.

Of course you can start from scratch, but there’s a quickest and easiest way : playing with structor.

Structor is a project hosted by Hortonworks, based on Vagrant (basically a tool who plays with VMs on VirtualBox or VMWare, dealing with OS images) and there’s a lot of forks for its functionalities to grow.
I based my stuff on Andrew Miller’s fork, with some minor modifications.

Ok, let’s start with getting structor :

~$ mkdir Repository && cd Repository
Repository~ $ git clone https://github.com/laurentedel/structor.git

This will create a structor directory in which you’ll find basically a Vagrantfile which contains everything to build your cluster !
You just have to specify aa HDP version, an Ambari version and a profile name to build the corresponding cluster.

For building my 3-nodes cluster, I just type

structor~ $ ./ambari-cluster hdp-2.1.3 ambari-1.6.1 3node-full

In the ./profiles directory you’ll find some predefined profiles : 1node-min, 1node-full, 3node-min and 3node-full, feel free to create and submit some other profiles, but remember that your VMs will need some memory to be usable ! I put 2GB for each VM, so a 3-nodes cluster will take 6GB (7GB with a proxy VM)

I added some slight modifications for VirtualBox VMs to be named as their configuration so you can have multiple ambari-clusters in VirtualBox.

Things you have to know :

1. add the hosts on /etc/hosts on your Linux or Mac, or the equivalent on Windows (look at the doc on my github)

2. the private key is stored in each VM

3. to ssh in a machine, cd to your structor directory and do a vagrant ssh HOSTNAME

structor~ $ vagrant ssh gw
Loading profile /Users/ledel/Repository/structor/profiles/3node.profile
Last login: Tue Mar 10 12:12:38 2015 from 10.0.2.2
Welcome to your Vagrant-built virtual machine.

[vagrant@gw ~]$

4. to suspend the cluster, just

structor~ $ vagrant suspend

 

to wake the cluster up :

structor~ $ vagrant up

 

 

5. To copy files from/to your local machine, the current Vagrant folder is shared with each VM on /vagrant

Enjoy !


Hadoop CLI tips & tricks

Here are some Hadoop CLI tips & tricks

For manual switch between Active & Standby NameNodes, you have to take in consideration the ServiceIds, which are by default nn1 and nn2.

If nn1 is the Active and nn2 the Standby NameNode, switch nn2 to Active with

[vagrant@gw ~]$ sudo -u hdfs hdfs haadmin -failover nn1 nn2

 


containers, memory and sizing

In the Hadoop world, and especially in the YARN sub-world, it can be tricky to understand how to achieve memory settings.

Based on this blogpost, here are the main concepts to understand :

– YARN is based on containers

– containers can be MAP containers or REDUCE containers

– total memory for YARN on each node : yarn.nodemanager.resource.memory-mb=40960

– min memory used by a container : yarn.scheduler.minimum-allocation-mb=2048

the combination of the above will give us the max containers number : total memory / container memory = containers number.

– memory used by a (map|reduce) container : mapreduce.(map|reduce).memory.mb=4096

each (map|reduce) container will spawn JVMs for map|reduce tasks, so we limit the heapsize of the map|reduce JVM :

mapreduce.(map|reduce).java.opts=-Xmx3072m

The last parameter is yarn.nodemanager.vmem-pmem-ratio (default 2.1) which is the ratio between physical memory and virtual memory.

Here we’ll have for a Map task :

  • 4 GB RAM allocated
  • 3 GB JVM heap space upper limit
  • 4*2.1 = 8.2GB Virtual memory upper limit

 

You’ll find max number of containers based on Map or Reduce containers ! so here we’ll have 40/4 = 10 mappers

source