Browsing posts in: hive

accessing Hive tables with curl and webHCat

For a quick and easy access, you can think about using WebHCat, a REST interface for accessing HCatalog, though Hive.

Let’s assume we’re in a kerberized cluster (you cannot be in an unkerberized cluster, remember…)

First, we check which port is used, default is 50111, in the Hive / webhcat-site.xml (or in the Hive configuration within the Ambari interface)

templeton (webHCat) port

templeton (webHCat) port

templeton is the former name for WebHCat.

Let’s try to do a curl on webHCat to see the DDL of the default database :

[root@sandbox ~]# curl -i --negotiate -u: "http://sandbox.hortonworks.com:50111/templeton/v1/ddl/database/default"
HTTP/1.1 401 Authentication required
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Path=/; Expires=Thu, 01-Jan-1970 00:00:00 GMT; HttpOnly
Cache-Control: must-revalidate,no-cache,no-store
Content-Type: text/html;charset=ISO-8859-1
Content-Length: 1328
Server: Jetty(7.6.0.v20120127)

<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 401 Authentication required</title>
</head>
<body>
<h2>HTTP ERROR: 401</h2>
<p>Problem accessing /templeton/v1/ddl/database/default. Reason:
<pre> Authentication required</pre></p>
<hr /><i><small>Powered by Jetty://</small></i>

Hmmm, obviously : we have to kinit ourselves before being able to access HCatalog.

[root@sandbox ~]# kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
[root@sandbox ~]# curl -i --negotiate -u: "http://sandbox.hortonworks.com:50111/templeton/v1/ddl/database/default"
HTTP/1.1 401 Authentication required
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Path=/; Expires=Thu, 01-Jan-1970 00:00:00 GMT; HttpOnly
Cache-Control: must-revalidate,no-cache,no-store
Content-Type: text/html;charset=ISO-8859-1
Content-Length: 1328
Server: Jetty(7.6.0.v20120127)

HTTP/1.1 500 Server Error
Set-Cookie: hadoop.auth="u=hdfs&p=hdfs@HORTONWORKS.COM&t=kerberos&e=1475885113041&s=p+38gIJagH2o1pTkoGK+af3a6Ks="; Path=/; Expires=Sat, 08-Oct-2016 00:05:13 GMT; HttpOnly
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(7.6.0.v20120127)

{"error":"User: HTTP/sandbox.hortonworks.com@HORTONWORKS.COM is not allowed to impersonate hdfs"}

 

This is a fairly common message : as you’re requesting a REST Api, your request is encapsulated with the so-called SPNego  token, that you can think as the “Kerberos for HTTP”.

You must then be able to authenticate with SPNego token, but also HTTP should be able to impersonate you (meaning HTTP will do the request on behalf of your username)

Those proxyuser parameters could be found in the HDFS core-site.xml :

HTTP proxyuser configuration

HTTP proxyuser configuration

So here, we can see HTTP can impersonate only users belonging to the group users

[root@sandbox ~]# id hdfs
uid=505(hdfs) gid=501(hadoop) groups=501(hadoop),503(hdfs)
[root@sandbox ~]# id ambari-qa
uid=1001(ambari-qa) gid=501(hadoop) groups=501(hadoop),100(users)

That’s right, hdfs doesn’t belong to that group. However, ambari-qa does ! let’s kinit ourselves to be ambari-qa.

[root@sandbox ~]# kinit -kt /etc/security/keytabs/smokeuser.headless.keytab ambari-qa
[root@sandbox ~]# curl -i --negotiate -u: "http://sandbox.hortonworks.com:50111/templeton/v1/ddl/database/default"
HTTP/1.1 401 Authentication required
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Path=/; Expires=Thu, 01-Jan-1970 00:00:00 GMT; HttpOnly
Cache-Control: must-revalidate,no-cache,no-store
Content-Type: text/html;charset=ISO-8859-1
Content-Length: 1328
Server: Jetty(7.6.0.v20120127)

HTTP/1.1 200 OK
Set-Cookie: hadoop.auth="u=ambari-qa&p=ambari-qa@HORTONWORKS.COM&t=kerberos&e=1475885666292&s=/WGJZIe4BRKBoI4UmxfHUv8r7MU="; Path=/; Expires=Sat, 08-Oct-2016 00:14:26 GMT; HttpOnly
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(7.6.0.v20120127)

{"location":"hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse","ownerType":"ROLE","owner":"public","comment":"Default Hive database","database":"default"}

That’s it, you got your DDL !

 


make Hive console verbose

If you have some failing request in your Hive CLI with as much details as FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask then you may want to add some verbosity… Simply launch hive with redirecting logger to console :

hive -hiveconf hive.root.logger=INFO,console


sample table in Hive

When you want to test some “real” stuff, it can be useful to have a Hive table with some “big” data in it.

Let’s build a table based on Wikipedia traffic stats. It will contains page counts for every page since 2008, allowing us to have partitions too.

These stats are available here so let’s grab those files first.
This is a lot of files, here we’re getting only 2008/01 files, which are each between 20 and 30MB.

I installed axel which is a download accelerator, available on EPEL repo or with the rpm

[vagrant@gw ~]$ sudo rpm -ivh http://pkgs.repoforge.org/axel/axel-2.4-1.el6.rf.x86_64.rpm
[vagrant@gw pagecounts]$ for YEAR in {2008..2008}; do for MONTH in {01..01}; do for DAY in {01..08}; do for HOUR in {00..23}; do axel -a http://dumps.wikimedia.org/other/pagecounts-raw/${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz; done; done; done; done

These files contains :

fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
fr.b Special:Recherche/Acteurs_et_actrices_N 1 739
fr.b Special:Recherche/Agrippa_d/%27Aubign%C3%A9 1 743

fr.b is the project name (french books), then the title of the page, the number of requests and finally the size of the content returned.

Let copy that data in HDFS :

[vagrant@gw pagecounts]$ for DAY in {01..08}; do
sudo -u hdfs hadoop fs -mkdir -p /apps/hive/warehouse/pagecounts/dt=2008-01-${DAY};
sudo -u hdfs hadoop fs -put pagecounts-200801${DAY}-* /apps/hive/warehouse/pagecounts/dt=2008-01-${DAY};
done

Now we can create the Hive table

hive> CREATE EXTERNAL TABLE pagecounts (project String, title String, requests Int, size Int) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Then you can add manually the partitions
hive> ALTER TABLE pagecounts ADD IF NOT EXISTS PARTITION(dt="2008-01-01") LOCATION "/apps/hive/warehouse/pagecounts/dt=2008-01-01";

or let Hive do it four you automagically :

hive> MSCK REPAIR TABLE pagecounts;
OK
Partitions not in metastore: pagecounts:dt=2008-01-02 pagecounts:dt=2008-01-03 pagecounts:dt=2008-01-04 pagecounts:dt=2008-01-05 pagecounts:dt=2008-01-06 pagecounts:dt=2008-01-07 pagecounts:dt=2008-01-08 pagecounts:dt=2008-01-09 pagecounts:dt=2008-01-10
Repair: Added partition to metastore pagecounts:dt=2008-01-02
Repair: Added partition to metastore pagecounts:dt=2008-01-03
Repair: Added partition to metastore pagecounts:dt=2008-01-04
Repair: Added partition to metastore pagecounts:dt=2008-01-05
Repair: Added partition to metastore pagecounts:dt=2008-01-06
Repair: Added partition to metastore pagecounts:dt=2008-01-07
Repair: Added partition to metastore pagecounts:dt=2008-01-08
Repair: Added partition to metastore pagecounts:dt=2008-01-09
Repair: Added partition to metastore pagecounts:dt=2008-01-10
Time taken: 0.762 seconds, Fetched: 9 row(s)

 


Hive tips & tricks

in Hive CLI, see the headers when executing a request :

hive> select * from sample_07 limit 2;
OK
00-0000 All Occupations 134354250 40690
11-0000 Management occupations 6003930 96150
Time taken: 1.313 seconds, Fetched: 2 row(s)
hive> set hive.cli.print.header=true;
hive> select * from sample_07 limit 2;
OK
sample_07.code sample_07.description sample_07.total_emp sample_07.salary
00-0000 All Occupations 134354250 40690
11-0000 Management occupations 6003930 96150
Time taken: 1.199 seconds, Fetched: 2 row(s)

include the database in the Hive prompt :

hive> set hive.cli.print.current.db=true;
hive (default)>

 

When you want to merge small files in a Hive partition, you have to simple solutions :

* from Hive 0.14, use CONCATENATE

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

* or you can use the ARCHIVE command to merge to HAR (Hadoop ARchive) file

hive> set hive.archive.enabled=true;
hive> set har.partfile.size=1099511627776;
hive> ALTER TABLE tabname ARCHIVE PARTITION (dt_partition='2015-03-10');


lock tables in Hive

For enabling locking in Hive, you must first enable the LockManager by setting the two params in hive-site.xml :

<property>
<name>hive.zookeeper.quorum</name>
<value>sandbox.hortonworks.com:2181</value>
<description>The list of zookeeper servers to talk to.
This is only needed for read/write locks.</description>
</property>

<property>
<name>hive.support.concurrency</name>
<value>true</value>
<description>Whether Hive supports concurrency or not.
A Zookeeper instance must be up and running for the default
Hive lock manager to support read-write locks.</description>
</property>

After restarting Hive, here is how to use that :

hive> lock table my_table exclusive;
OK

You can see if there’s a lock on a table

hive> show locks my_table;
OK
default@my_table EXCLUSIVE
Time taken: 0.952 seconds, Fetched: 1 row(s)

When trying to access the table during this time :

hive> select count(*) from my_table;
conflicting lock present for default@my_table mode SHARED
conflicting lock present for default@my_table mode SHARED
...

and you can release the lock with

hive> unlock table my_table;
OK
Time taken: 1.126 seconds

then the previous request will be executed (by default the request is executed x times every 60 seconds (hive.lock.sleep.between.retries))
Locks (hive.lock.numretries) are tried 100 times, unlocks (hive.unlock.numretries) 10 times.


Create a sample table

Let’s quickly add a new table to be able to play with Hive.

We use the /etc/passwd file on our Linux system as a Hive table :

[vagrant@gw ~]$ hadoop fs -put /etc/passwd /tmp
[vagrant@gw ~]$ sudo su hdfs
[hdfs@gw vagrant]$ hive
hive> CREATE TABLE passwd (
user STRING,dummy STRING,uid INT,gid INT,name STRING,home STRING,shell STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ':'
STORED AS TEXTFILE;
hive> LOAD DATA INPATH '/tmp/passwd' OVERWRITE INTO TABLE passwd;
hive> select * from passwd;
OK
root x 0 0 root /root /bin/bash
bin x 1 1 bin /bin /sbin/nologin
daemon x 2 2 daemon /sbin /sbin/nologin
adm x 3 4 adm /var/adm /sbin/nologin
lp x 4 7 lp /var/spool/lpd /sbin/nologin
...

Voilà !