Posts tagged with: hive

accessing Hive tables with curl and webHCat

For a quick and easy access, you can think about using WebHCat, a REST interface for accessing HCatalog, though Hive.

Let’s assume we’re in a kerberized cluster (you cannot be in an unkerberized cluster, remember…)

First, we check which port is used, default is 50111, in the Hive / webhcat-site.xml (or in the Hive configuration within the Ambari interface)

templeton (webHCat) port

templeton (webHCat) port

templeton is the former name for WebHCat.

Let’s try to do a curl on webHCat to see the DDL of the default database :

[root@sandbox ~]# curl -i --negotiate -u: "http://sandbox.hortonworks.com:50111/templeton/v1/ddl/database/default"
HTTP/1.1 401 Authentication required
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Path=/; Expires=Thu, 01-Jan-1970 00:00:00 GMT; HttpOnly
Cache-Control: must-revalidate,no-cache,no-store
Content-Type: text/html;charset=ISO-8859-1
Content-Length: 1328
Server: Jetty(7.6.0.v20120127)

<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 401 Authentication required</title>
</head>
<body>
<h2>HTTP ERROR: 401</h2>
<p>Problem accessing /templeton/v1/ddl/database/default. Reason:
<pre> Authentication required</pre></p>
<hr /><i><small>Powered by Jetty://</small></i>

Hmmm, obviously : we have to kinit ourselves before being able to access HCatalog.

[root@sandbox ~]# kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
[root@sandbox ~]# curl -i --negotiate -u: "http://sandbox.hortonworks.com:50111/templeton/v1/ddl/database/default"
HTTP/1.1 401 Authentication required
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Path=/; Expires=Thu, 01-Jan-1970 00:00:00 GMT; HttpOnly
Cache-Control: must-revalidate,no-cache,no-store
Content-Type: text/html;charset=ISO-8859-1
Content-Length: 1328
Server: Jetty(7.6.0.v20120127)

HTTP/1.1 500 Server Error
Set-Cookie: hadoop.auth="u=hdfs&p=hdfs@HORTONWORKS.COM&t=kerberos&e=1475885113041&s=p+38gIJagH2o1pTkoGK+af3a6Ks="; Path=/; Expires=Sat, 08-Oct-2016 00:05:13 GMT; HttpOnly
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(7.6.0.v20120127)

{"error":"User: HTTP/sandbox.hortonworks.com@HORTONWORKS.COM is not allowed to impersonate hdfs"}

 

This is a fairly common message : as you’re requesting a REST Api, your request is encapsulated with the so-called SPNego  token, that you can think as the “Kerberos for HTTP”.

You must then be able to authenticate with SPNego token, but also HTTP should be able to impersonate you (meaning HTTP will do the request on behalf of your username)

Those proxyuser parameters could be found in the HDFS core-site.xml :

HTTP proxyuser configuration

HTTP proxyuser configuration

So here, we can see HTTP can impersonate only users belonging to the group users

[root@sandbox ~]# id hdfs
uid=505(hdfs) gid=501(hadoop) groups=501(hadoop),503(hdfs)
[root@sandbox ~]# id ambari-qa
uid=1001(ambari-qa) gid=501(hadoop) groups=501(hadoop),100(users)

That’s right, hdfs doesn’t belong to that group. However, ambari-qa does ! let’s kinit ourselves to be ambari-qa.

[root@sandbox ~]# kinit -kt /etc/security/keytabs/smokeuser.headless.keytab ambari-qa
[root@sandbox ~]# curl -i --negotiate -u: "http://sandbox.hortonworks.com:50111/templeton/v1/ddl/database/default"
HTTP/1.1 401 Authentication required
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Path=/; Expires=Thu, 01-Jan-1970 00:00:00 GMT; HttpOnly
Cache-Control: must-revalidate,no-cache,no-store
Content-Type: text/html;charset=ISO-8859-1
Content-Length: 1328
Server: Jetty(7.6.0.v20120127)

HTTP/1.1 200 OK
Set-Cookie: hadoop.auth="u=ambari-qa&p=ambari-qa@HORTONWORKS.COM&t=kerberos&e=1475885666292&s=/WGJZIe4BRKBoI4UmxfHUv8r7MU="; Path=/; Expires=Sat, 08-Oct-2016 00:14:26 GMT; HttpOnly
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(7.6.0.v20120127)

{"location":"hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse","ownerType":"ROLE","owner":"public","comment":"Default Hive database","database":"default"}

That’s it, you got your DDL !

 


sample table in Hive

When you want to test some “real” stuff, it can be useful to have a Hive table with some “big” data in it.

Let’s build a table based on Wikipedia traffic stats. It will contains page counts for every page since 2008, allowing us to have partitions too.

These stats are available here so let’s grab those files first.
This is a lot of files, here we’re getting only 2008/01 files, which are each between 20 and 30MB.

I installed axel which is a download accelerator, available on EPEL repo or with the rpm

[vagrant@gw ~]$ sudo rpm -ivh http://pkgs.repoforge.org/axel/axel-2.4-1.el6.rf.x86_64.rpm
[vagrant@gw pagecounts]$ for YEAR in {2008..2008}; do for MONTH in {01..01}; do for DAY in {01..08}; do for HOUR in {00..23}; do axel -a http://dumps.wikimedia.org/other/pagecounts-raw/${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz; done; done; done; done

These files contains :

fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
fr.b Special:Recherche/Acteurs_et_actrices_N 1 739
fr.b Special:Recherche/Agrippa_d/%27Aubign%C3%A9 1 743

fr.b is the project name (french books), then the title of the page, the number of requests and finally the size of the content returned.

Let copy that data in HDFS :

[vagrant@gw pagecounts]$ for DAY in {01..08}; do
sudo -u hdfs hadoop fs -mkdir -p /apps/hive/warehouse/pagecounts/dt=2008-01-${DAY};
sudo -u hdfs hadoop fs -put pagecounts-200801${DAY}-* /apps/hive/warehouse/pagecounts/dt=2008-01-${DAY};
done

Now we can create the Hive table

hive> CREATE EXTERNAL TABLE pagecounts (project String, title String, requests Int, size Int) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Then you can add manually the partitions
hive> ALTER TABLE pagecounts ADD IF NOT EXISTS PARTITION(dt="2008-01-01") LOCATION "/apps/hive/warehouse/pagecounts/dt=2008-01-01";

or let Hive do it four you automagically :

hive> MSCK REPAIR TABLE pagecounts;
OK
Partitions not in metastore: pagecounts:dt=2008-01-02 pagecounts:dt=2008-01-03 pagecounts:dt=2008-01-04 pagecounts:dt=2008-01-05 pagecounts:dt=2008-01-06 pagecounts:dt=2008-01-07 pagecounts:dt=2008-01-08 pagecounts:dt=2008-01-09 pagecounts:dt=2008-01-10
Repair: Added partition to metastore pagecounts:dt=2008-01-02
Repair: Added partition to metastore pagecounts:dt=2008-01-03
Repair: Added partition to metastore pagecounts:dt=2008-01-04
Repair: Added partition to metastore pagecounts:dt=2008-01-05
Repair: Added partition to metastore pagecounts:dt=2008-01-06
Repair: Added partition to metastore pagecounts:dt=2008-01-07
Repair: Added partition to metastore pagecounts:dt=2008-01-08
Repair: Added partition to metastore pagecounts:dt=2008-01-09
Repair: Added partition to metastore pagecounts:dt=2008-01-10
Time taken: 0.762 seconds, Fetched: 9 row(s)

 


Create a sample table

Let’s quickly add a new table to be able to play with Hive.

We use the /etc/passwd file on our Linux system as a Hive table :

[vagrant@gw ~]$ hadoop fs -put /etc/passwd /tmp
[vagrant@gw ~]$ sudo su hdfs
[hdfs@gw vagrant]$ hive
hive> CREATE TABLE passwd (
user STRING,dummy STRING,uid INT,gid INT,name STRING,home STRING,shell STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ':'
STORED AS TEXTFILE;
hive> LOAD DATA INPATH '/tmp/passwd' OVERWRITE INTO TABLE passwd;
hive> select * from passwd;
OK
root x 0 0 root /root /bin/bash
bin x 1 1 bin /bin /sbin/nologin
daemon x 2 2 daemon /sbin /sbin/nologin
adm x 3 4 adm /var/adm /sbin/nologin
lp x 4 7 lp /var/spool/lpd /sbin/nologin
...

Voilà !