Monthly Archives: February 2015

sample table in Hive

When you want to test some “real” stuff, it can be useful to have a Hive table with some “big” data in it.

Let’s build a table based on Wikipedia traffic stats. It will contains page counts for every page since 2008, allowing us to have partitions too.

These stats are available here so let’s grab those files first.
This is a lot of files, here we’re getting only 2008/01 files, which are each between 20 and 30MB.

I installed axel which is a download accelerator, available on EPEL repo or with the rpm

[vagrant@gw ~]$ sudo rpm -ivh http://pkgs.repoforge.org/axel/axel-2.4-1.el6.rf.x86_64.rpm
[vagrant@gw pagecounts]$ for YEAR in {2008..2008}; do for MONTH in {01..01}; do for DAY in {01..08}; do for HOUR in {00..23}; do axel -a http://dumps.wikimedia.org/other/pagecounts-raw/${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz; done; done; done; done

These files contains :

fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
fr.b Special:Recherche/Acteurs_et_actrices_N 1 739
fr.b Special:Recherche/Agrippa_d/%27Aubign%C3%A9 1 743

fr.b is the project name (french books), then the title of the page, the number of requests and finally the size of the content returned.

Let copy that data in HDFS :

[vagrant@gw pagecounts]$ for DAY in {01..08}; do
sudo -u hdfs hadoop fs -mkdir -p /apps/hive/warehouse/pagecounts/dt=2008-01-${DAY};
sudo -u hdfs hadoop fs -put pagecounts-200801${DAY}-* /apps/hive/warehouse/pagecounts/dt=2008-01-${DAY};
done

Now we can create the Hive table

hive> CREATE EXTERNAL TABLE pagecounts (project String, title String, requests Int, size Int) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Then you can add manually the partitions
hive> ALTER TABLE pagecounts ADD IF NOT EXISTS PARTITION(dt="2008-01-01") LOCATION "/apps/hive/warehouse/pagecounts/dt=2008-01-01";

or let Hive do it four you automagically :

hive> MSCK REPAIR TABLE pagecounts;
OK
Partitions not in metastore: pagecounts:dt=2008-01-02 pagecounts:dt=2008-01-03 pagecounts:dt=2008-01-04 pagecounts:dt=2008-01-05 pagecounts:dt=2008-01-06 pagecounts:dt=2008-01-07 pagecounts:dt=2008-01-08 pagecounts:dt=2008-01-09 pagecounts:dt=2008-01-10
Repair: Added partition to metastore pagecounts:dt=2008-01-02
Repair: Added partition to metastore pagecounts:dt=2008-01-03
Repair: Added partition to metastore pagecounts:dt=2008-01-04
Repair: Added partition to metastore pagecounts:dt=2008-01-05
Repair: Added partition to metastore pagecounts:dt=2008-01-06
Repair: Added partition to metastore pagecounts:dt=2008-01-07
Repair: Added partition to metastore pagecounts:dt=2008-01-08
Repair: Added partition to metastore pagecounts:dt=2008-01-09
Repair: Added partition to metastore pagecounts:dt=2008-01-10
Time taken: 0.762 seconds, Fetched: 9 row(s)

 


enabling Application Timeline Server on a kerberized cluster

When you enable security on your HDP cluster, the wizard deletes the ATS (App Timeline Server) which is useful to follow YARN applications history.

Re-enabling it is not very difficult :

1. Installation through Ambari API

[vagrant@gw ~]# curl -u admin:admin -H "X-Requested-By:ambari" -i -X POST http://gw.example.com:8080/api/v1/clusters/hdp-cluster/hosts/gw.example.com/host_components/APP_TIMELINE_SERVER

HTTP/1.1 201 Created

2. Component activation

[vagrant@gw ~]# curl -u admin:admin -H "X-Requested-By:ambari" -i -X PUT -d '{"RequestInfo": {"context": "Install AppTimelineServer via REST","query":"HostRoles/component_name.in('APP_TIMELINE_SERVER')"}, "Body":{"HostRoles": {"state": "INSTALLED"}}}' http://gw.example.com:8080/api/v1/clusters/hdp-cluster/hosts/gw.example.com/host_components

HTTP/1.1 202 Accepted
Set-Cookie: AMBARISESSIONID=1dkx65ox0k0vx1vigxw7r0i4aw;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain
Content-Length: 170
Server: Jetty(7.6.7.v20120910)

{
"href" : "http://gw.example.com:8080/api/v1/clusters/hdp-cluster/requests/260",
"Requests" : {
"id" : 260,
"status" : "InProgress"
}
}

3. Keytab for ATS creation

[vagrant@gw ~]# ipa service-add ats/gw.example.com@EXAMPLE.COM

4. Copy

[vagrant@gw ~]# ipa-getkeytab -s nn.example.com -p ats/gw.example.com@EXAMPLE.COM -k /etc/security/keytabs/ats.service.keytab

Keytab successfully retrieved and stored in: /etc/security/keytabs/ats.service.keytab
[vagrant@gw ~]# chown yarn:hadoop /etc/security/keytabs/ats.service.keytab && chmod 400 /etc/security/keytabs/ats.service.keytab

5. Add parameters in YARN

yarn.timeline-service.principal = ats/_HOST@EXAMPLE.COM
yarn.timeline-service.keytab = /etc/security/keytabs/ats.service.keytab
yarn.timeline-service.http-authentication.type = kerberos
yarn.timeline-service.http-authentication.kerberos.principal = HTTP/_HOST@EXAMPLE.COM
yarn.timeline-service.http-authentication.kerberos.keytab = /etc/security/keytabs/spnego.service.keytab

6. Restart YARN + ATS

The UI is enabled by default on port 8188 : http://gw.example.com:8188/applicationhistory


containers, memory and sizing

In the Hadoop world, and especially in the YARN sub-world, it can be tricky to understand how to achieve memory settings.

Based on this blogpost, here are the main concepts to understand :

– YARN is based on containers

– containers can be MAP containers or REDUCE containers

– total memory for YARN on each node : yarn.nodemanager.resource.memory-mb=40960

– min memory used by a container : yarn.scheduler.minimum-allocation-mb=2048

the combination of the above will give us the max containers number : total memory / container memory = containers number.

– memory used by a (map|reduce) container : mapreduce.(map|reduce).memory.mb=4096

each (map|reduce) container will spawn JVMs for map|reduce tasks, so we limit the heapsize of the map|reduce JVM :

mapreduce.(map|reduce).java.opts=-Xmx3072m

The last parameter is yarn.nodemanager.vmem-pmem-ratio (default 2.1) which is the ratio between physical memory and virtual memory.

Here we’ll have for a Map task :

  • 4 GB RAM allocated
  • 3 GB JVM heap space upper limit
  • 4*2.1 = 8.2GB Virtual memory upper limit

 

You’ll find max number of containers based on Map or Reduce containers ! so here we’ll have 40/4 = 10 mappers

source


Hive tips & tricks

in Hive CLI, see the headers when executing a request :

hive> select * from sample_07 limit 2;
OK
00-0000 All Occupations 134354250 40690
11-0000 Management occupations 6003930 96150
Time taken: 1.313 seconds, Fetched: 2 row(s)
hive> set hive.cli.print.header=true;
hive> select * from sample_07 limit 2;
OK
sample_07.code sample_07.description sample_07.total_emp sample_07.salary
00-0000 All Occupations 134354250 40690
11-0000 Management occupations 6003930 96150
Time taken: 1.199 seconds, Fetched: 2 row(s)

include the database in the Hive prompt :

hive> set hive.cli.print.current.db=true;
hive (default)>

 

When you want to merge small files in a Hive partition, you have to simple solutions :

* from Hive 0.14, use CONCATENATE

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

* or you can use the ARCHIVE command to merge to HAR (Hadoop ARchive) file

hive> set hive.archive.enabled=true;
hive> set har.partfile.size=1099511627776;
hive> ALTER TABLE tabname ARCHIVE PARTITION (dt_partition='2015-03-10');


lock tables in Hive

For enabling locking in Hive, you must first enable the LockManager by setting the two params in hive-site.xml :

<property>
<name>hive.zookeeper.quorum</name>
<value>sandbox.hortonworks.com:2181</value>
<description>The list of zookeeper servers to talk to.
This is only needed for read/write locks.</description>
</property>

<property>
<name>hive.support.concurrency</name>
<value>true</value>
<description>Whether Hive supports concurrency or not.
A Zookeeper instance must be up and running for the default
Hive lock manager to support read-write locks.</description>
</property>

After restarting Hive, here is how to use that :

hive> lock table my_table exclusive;
OK

You can see if there’s a lock on a table

hive> show locks my_table;
OK
default@my_table EXCLUSIVE
Time taken: 0.952 seconds, Fetched: 1 row(s)

When trying to access the table during this time :

hive> select count(*) from my_table;
conflicting lock present for default@my_table mode SHARED
conflicting lock present for default@my_table mode SHARED
...

and you can release the lock with

hive> unlock table my_table;
OK
Time taken: 1.126 seconds

then the previous request will be executed (by default the request is executed x times every 60 seconds (hive.lock.sleep.between.retries))
Locks (hive.lock.numretries) are tried 100 times, unlocks (hive.unlock.numretries) 10 times.


sandbox : AppTimelineServer webUI

When using the sandbox, you can be confronted to a port forwarding issue.

For example, if you try to access AppTimelineServer webUI on port 8188, you’ll have a “Connection refused”.

This is weird since the process is listening :

[root@sandbox ~]# nc -nz 127.0.0.1 8188
Connection to 127.0.0.1 8188 port [tcp/*] succeeded!

You’ll have to configure the VirtualBox port forwarding to add this port which has not been added :

Hortonworks_Sandbox_with_HDP_2_2_-_Network_and_Hortonworks_Sandbox_with_HDP_2_2__Running_

Hortonworks_Sandbox_with_HDP_2_2_-_Network


Create a sample table

Let’s quickly add a new table to be able to play with Hive.

We use the /etc/passwd file on our Linux system as a Hive table :

[vagrant@gw ~]$ hadoop fs -put /etc/passwd /tmp
[vagrant@gw ~]$ sudo su hdfs
[hdfs@gw vagrant]$ hive
hive> CREATE TABLE passwd (
user STRING,dummy STRING,uid INT,gid INT,name STRING,home STRING,shell STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ':'
STORED AS TEXTFILE;
hive> LOAD DATA INPATH '/tmp/passwd' OVERWRITE INTO TABLE passwd;
hive> select * from passwd;
OK
root x 0 0 root /root /bin/bash
bin x 1 1 bin /bin /sbin/nologin
daemon x 2 2 daemon /sbin /sbin/nologin
adm x 3 4 adm /var/adm /sbin/nologin
lp x 4 7 lp /var/spool/lpd /sbin/nologin
...

Voilà !


custom Nagios alert in Ambari

The exercise here is to make a very simple Nagios plugin to be integrated in Ambari webUI.

We’ll check if the cluster is in safe mode or not, and put that alert into Ambari.

First let’s make the plugin, in the same directory you’ll find all scripts used by Ambari which you can duplicate and adapt.

[vagrant@gw ~]$ sudo vi /usr/lib64/nagios/plugins/check_safemode.sh
#!/bin/bash
ret=$(hadoop dfsadmin -safemode get)
if [[ $ret == *OFF ]]; then
echo "OK: $ret"
exit 0
fi
echo "KO : $ret"
exit 1

Notice that you have to echo something before every exit in the plugin, else Nagios will give you an alert.

Now define the command to execute the plugin :

[vagrant@gw ~]$ sudo vi /etc/nagios/objects/hadoop-commands.cfg

...
define command{
command_name check_safemode
command_line $USER1$/check_wrapper.sh $USER1$/check_safemode.sh -H $HOSTADDRESS$
}

Get the hostgroup name (/etc/nagios/objects/hadoop-hostgroups.cfg) in which the plugin will be executed, for example nagios-server (only one server since it’s a HDFS check !)

In /etc/nagios/objects/hadoop-servicegroups.cfg, get the service the plugin will run into.
Here, we’ll put this alert in the HDFS service.

Now the alert entry :

[vagrant@gw ~]$ sudo vi /etc/nagios/objects/hadoop-services.cfg
...
# NAGIOS SERVER HDFS Checks
...
define service {
hostgroup_name nagios-server
use hadoop-service
service_description HDFS::Is Cluster in Safe Mode
servicegroups HDFS
check_command check_safemode
normal_check_interval 2
retry_check_interval 1
max_check_attempts 1
}

Notice that normal_check_interval is minutes between checks.

Then restart Nagios :

[vagrant@gw ~]$ sudo service nagios restart

The alert will appears in Ambari :
nagios safe mode off

To test, let put the cluster in safe mode :

[vagrant@gw ~]$ sudo su hdfs
[hdfs@gw vagrant]$ kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
[hdfs@gw vagrant]$ hadoop dfsadmin -safemode enter
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is ON

Now you’ll see in about a minute that the alert is on :

nagios safe mode on

Then you can leave safemode to be ok !

[hdfs@gw vagrant]$ hadoop dfsadmin -safemode leave
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is OFF

Note that this is just for demonstration purpose : the plugin is not implementing Kerberos for example, like in the check_nodemanager_health plugin.

You may also note that Nagios is writing its output to the /var/nagios/status.dat file which is collected and read by Ambari to display its information.

Adapted from Hortonworks documentation

 


upload a file with WebHDFS

By default, WebHDFS is enabled on your cluster, allowing you to make any HDFS operation through this REST API.

If you want to upload a file to HDFS, this has to be done in 2 steps :
1. create the resource

[hdfs@gw vagrant]$ curl -i --negotiate -u : -X PUT "http://nn.example.com:50070/webhdfs/v1/tmp/testfile?op=CREATE&overwrite=true"
HTTP/1.1 401 Authentication required
Date: Fri, 13 Feb 2015 11:29:54 GMT
Pragma: no-cache
Date: Fri, 13 Feb 2015 11:29:54 GMT
Pragma: no-cache
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Path=/; Expires=Thu, 01-Jan-1970 00:00:00 GMT; HttpOnly
Content-Length: 0
Server: Jetty(6.1.26)
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Fri, 13 Feb 2015 11:29:54 GMT
Date: Fri, 13 Feb 2015 11:29:54 GMT
Pragma: no-cache
Expires: Fri, 13 Feb 2015 11:29:54 GMT
Date: Fri, 13 Feb 2015 11:29:54 GMT
Pragma: no-cache
Content-Type: application/octet-stream
Set-Cookie: hadoop.auth=u=hdfs&p=hdfs@EXAMPLE.COM&t=kerberos&e=1423862994233&s=+gAWB/1q0QOKjK9Wf6W4Bl2B6BY=; Path=/; Expires=Fri, 13-Feb-2015 21:29:54 GMT; HttpOnly
Location: http://gw.example.com:1022/webhdfs/v1/tmp/testfile?op=CREATE&delegation=HAAEaGRmcwRoZGZzAIoBS4KzxDuKAUumwEg7CQgUs7isYeQ5F6u4cV-oSig--MQFgU8SV0VCSERGUyBkZWxlZ2F0aW9uDzI0MC4wLjAuMTE6ODAyMA&namenoderpcaddress=mycluster&overwrite=true
Content-Length: 0
Server: Jetty(6.1.26)

2. upload the file in that resource
Notice that we obtained a location in the last request result, with the datanode where the resource will be created.
Now we upload our file to that URL.

[hdfs@gw vagrant]$ curl -i -X PUT -T MY_LOCAL_FILE "http://gw.example.com:1022/webhdfs/v1/tmp/testfile?op=CREATE&delegation=HAAEaGRmcwRoZGZzAIoBS4KzxDuKAUumwEg7CQgUs7isYeQ5F6u4cV-oSig--MQFgU8SV0VCSERGUyBkZWxlZ2F0aW9uDzI0MC4wLjAuMTE6ODAyMA&namenoderpcaddress=mycluster&overwrite=true"
HTTP/1.1 100 Continue
HTTP/1.1 201 Created
Cache-Control: no-cache
Expires: Fri, 13 Feb 2015 11:30:20 GMT
Date: Fri, 13 Feb 2015 11:30:20 GMT
Pragma: no-cache
Expires: Fri, 13 Feb 2015 11:30:20 GMT
Date: Fri, 13 Feb 2015 11:30:20 GMT
Pragma: no-cache
Content-Type: application/octet-stream
Location: webhdfs://mycluster/tmp/testfile
Content-Length: 0
Server: Jetty(6.1.26)

Test :

[hdfs@gw vagrant]$ hdfs dfs -cat /tmp/testfile
This is a test

more info on the Hadoop WebHDFS page


My Ambari Notes

When trying to enable HA NameNode wizard, I encountered this weird exception in Install JournalNodes step.

When looking in the logs, I noticed this

07:57:09,617 ERROR [qtp1251571107-6336] AbstractResourceProvider:244 - Caught AmbariException when creating a resource
org.apache.ambari.server.ServiceComponentNotFoundException: ServiceComponent not found, clusterName=hdp-cluster, serviceName=HDFS, serviceComponentName=JOURNALNODE

exception in the ambari-server.log

So the solution was to “manually” install the ServiceComponent with the Ambari API :

[vagrant@gw ~]$ curl -u admin:admin -H "X-Requested-By:ambari" -i -X POST http://gw.example.com:8080/api/v1/clusters/hdp-cluster/services/HDFS/components/JOURNALNODE

HTTP/1.1 201 Created
Set-Cookie: AMBARISESSIONID=1frjfvu8yjb5fylf4w5m449c7;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain
Content-Length: 0
Server: Jetty(7.6.7.v20120910)

Pages:12