Getting Started with Screwjack(Hive)¶

This is a show case for the usage of screwjack hive runtime. This section we will create a EMR hive module to get the topN hot tokens in search queries.

At first, thanks AOL share their query log online [1]. We download a piece of them from http://www.infochimps.com/datasets/aol-search-data as the test data in this demo. The data looks like below:

AnonID  Query   QueryTime   ItemRank    ClickURL
rentdirect.com  2006-03-01 07:17:12
www.prescriptionfortime.com 2006-03-12 12:31:06
staple.com  2006-03-17 21:19:29
staple.com  2006-03-17 21:19:45
www.newyorklawyersite.com   2006-03-18 08:02:58
www.newyorklawyersite.com   2006-03-18 08:03:09
westchester.gov 2006-03-20 03:55:57 1   http://www.westchestergov.com
space.comhttp   2006-03-24 20:51:24
dfdf    2006-03-24 22:23:07
dfdf    2006-03-24 22:23:14
vaniqa.comh 2006-03-25 23:27:12
www.collegeucla.edu 2006-04-03 21:12:14
www.elaorg  2006-04-03 21:25:20
207 ad2d 530    2006-04-08 01:31:04
207 ad2d 530    2006-04-08 01:31:14 1   http://www.courts.state.ny.us
broadway.vera.org   2006-04-08 08:38:23
broadway.vera.org   2006-04-08 08:38:31
vera.org    2006-04-08 08:38:42 1   http://www.vera.org
broadway.vera.org   2006-04-08 08:39:30
frankmellace.com    2006-04-09 02:19:24
ucs.ljx.com 2006-04-09 02:20:44
attornyleslie.com   2006-04-13 00:25:27
merit release appearance    2006-04-22 23:51:18
............................

Step 0: Initialize a hive module¶

screwjack init emr_hive -n hot_token_topN_on_emr -d "Get hotest token in search engine query log."

When prompt for Module Version and other options, you can press ENTER to use default options.

Module Version [0.1]:
Module Entry Command [/usr/bin/python main.py]:
Base Image [zetdata/ubuntu:trusty]:
init emr_hive
Sucessfully created 'hot_token_topn_on_emr'

After that, you will get a directory which its name is hot_token_topn_on_emr.

Step 1: Add the input/output and parameter to this module.¶

screwjack input_add query_log_s3_dir hive.s3.id_query_querytime
screwjack output_add hot_token_topN_s3_dir hive.s3.table.token_count
screwjack param_add topN string

Here query_log_dir is the hdfs dirctory which contain the raw data. The schema of the data is id,query and querytime. The hot_token_topN is the hive table name we gonna dump our result in.

Step 2: (optional) Make the UDF help explore the query into tokens¶

This step is optional. When you want a UDF in your module. Here is a example at: example-modules. After you build the jar of your UDF, you should put it into ./resource/udfs. Then, the HiveRuntime can automatically upload files s3.

Step 3: Write the hive script.¶

Then, we can open main.hql to fill our code like this:

set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;

CREATE TEMPORARY FUNCTION splitword AS 'com.your_company.hive.udtf.SplitWord';

--CREATE OUTPUT TABLE

DROP TABLE IF EXISTS hot_token_topN_table;
CREATE EXTERNAL  TABLE hot_token_topN_table
(
        token STRING,
        freq  INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '${OUTPUT_hot_token_topN_s3_dir}';

--CREATE AN EXTERNAL TABLE TO LOAD THE QUERY DATA
DROP TABLE IF EXISTS query;
CREATE EXTERNAL TABLE query
(
        id STRING,
        site STRING,
        timestp TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION '${INPUT_query_log_dir_s3_dir}';

INSERT OVERWRITE TABLE hot_token_topN_table
SELECT token, freq FROM
(
        SELECT token,count(1) AS freq FROM
        (
                SELECT splitword(site) AS token FROM query
        )token_table
        GROUP BY token
)token_frep
ORDER BY freq DESC LIMIT ${PARAM_topN};

You could reference the input parameter defined with screwjack by ${INPUT_inputname}, such as ${INPUT_query_log_s3_dir} in this case. Output parameter by ${OUTPUT_outputname}, such as ${OUTPUT_hot_token_topN_s3_dir} in this case. Parameter by ${PARAM_paramname}, such as ${PARAM_topN} in this case.

Step 4: Test locally¶

Before test, we need to upload the sample data to S3. Here we put them on s3://get-hot-token-kk/input/query. As module take input from its precursor, when we do test, we need to feed it by ourself. We touch the input parameter file and output parameter file to contain the input parameter and the output result. In this case, our input parameter is the s3 directory which contains the query logs. So create the file ./input.param and write s3://get-hot-token-kk/input/query into. Then create an output param to recive the output.

screwjack run local

Then type the corresponding parameter to run the test.

Param 'FILE_DIR' [./resources/files]:
Param 'UDF_DIR' [./resources/udfs]:
Param 'AWS_ACCESS_KEY_ID' []: YOUR_AWS_ACCESS_KEY
Param 'AWS_ACCESS_KEY_SECRET' []: YOUR_AWS_ACCESS_KEY_SECRET
Param 'S3_BUCKET' []: get-hot-token-kk
Param 'AWS_Region' []: us-east-1
Param 'EMR_jobFlowId' []: YOUR_EMR_JOB_FLOW_ID
Param 'topN' []: 10
Input 'query_log_s3_dir': input.param
Output 'hot_token_topN_s3_dir': output.param

During the test, if any error or bug emerge, you could modfiy your udtf and hive script according to the propted log. If everything is ok, the defined output parameter will be created and written into the output.param. If the test finished successfully, we could get the a s3 directory in output.param. Here we get s3://get-hot-token-kk/zetjob/your_username/job456/blk789/OUTPUT_hot_token_topN_s3_dir. Let us open the file on s3 and we get the hotest 10 tokens among the query.

of,110575
-,104052
in,91521
the,82961
for,70107
and,66675
to,45168
free,45149
a,36220
google,34970

Step 5: Test in docker¶

screwjack run docker

This step is to test whether the module could correctly run in a docker image. At first, screwjack will help to build a specific image with a hive runtime in. Then it will test your script and udfs in this image. If it turns out to be a success, we get this module ready to run online.

[1]	G. Pass, A. Chowdhury, C. Torgeson, “A Picture of Search” The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.