2015年11月2日 星期一

雲端 UberOS 資戰機 - ETL 雷達系統 (Pig)

Pig 簡介
Hadoop 的普及和其生態系統的不斷壯大並不令人感到意外。Hadoop 不斷進步的一個原因是 MapReduce 應用程式的編寫。雖然編寫 Map 和 Reduce 應用程式並不十分複雜,但這些編程確實需要一些程式開發經驗。Apache Pig 改變了這種狀況,它在 MapReduce 的基礎上提供了更簡單的過程語言。因此,您不需要編寫一個單獨的 MapReduce 應用程式,您可以用 Pig Latin 語言寫一個腳本,根據腳本 Pig 會先自動生成 MapReduce 程式, 然後交由 YARN 分散運算系統去執行。

準備資料集 (Dataset)
1. 登入 Hadoop Client
$ ssh ds01@cla01
ds01@cla01's password:
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.16.0-46-generic x86_64)

* Documentation: https://help.ubuntu.com/

Last login: Tue Sep 1 20:10:04 2015 from 172.17.42.1


[注意] 使用 Pig 分析工具 之前, 要先啟動 HDFS 及 YARN 系統

2.下載 Movie Dataset
$ wget https://raw.githubusercontent.com/rohitsden/pig-tutorial/master/movies_data.csv
head -n 6 movies_data.csv 
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333

$ wget https://raw.githubusercontent.com/rohitsden/pig-tutorial/master/movies_with_duplicates.csv
head -n 6 movies_with_duplicates.csv 
1,The Nightmare Before Christmas,1993,3.9,4568
1,The Nightmare Before Christmas,1993,3.9,4568
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150

3. 自製 Dataset
$ nano pigdata.txt
1234|emp_1234@company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]

4567|emp_4567@company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]

Pig 管理 HDFS 分散檔案系統
1. 啟動 Pig 分析工具
$ pig

15/11/02 11:11:15 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/11/02 11:11:15 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/11/02 11:11:15 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-11-02 11:11:15,780 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-11-02 11:11:15,780 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/bigred/pig_1446433875777.log
2015-11-02 11:11:15,808 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/bigred/.pigbootup not found
2015-11-02 11:11:16,478 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-11-02 11:11:16,479 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-11-02 11:11:16,479 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://nna:8020
2015-11-02 11:11:17,383 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2. 檢視目前 HDFS 工作目錄
grunt> pwd
hdfs://nna:8020/user/ds01
3.  基本管理命令
grunt> mkdir test
grunt> fs -touchz test/abc
grunt> ls test
hdfs://nna:8020/user/ds01/test/abc<r 2> 0

grunt> rm test/abc
2015-11-03 00:29:01,998 [main] INFO  org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file

grunt> rm test
2015-11-03 00:29:10,095 [main] INFO  org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file

4. 上載 資料集 至 HDFS 分散檔案系統
grunt> copyfromlocal movies_data.csv   .
grunt> copyfromlocal movies_with_duplicates.csv   .
grunt> copyfromlocal pigdata.txt   .

grunt> ls
hdfs://nna:8020/user/ds01/movies_data.csv<r 2>  2893177
hdfs://nna:8020/user/ds01/movies_with_duplicates.csv<r 2>       539
hdfs://nna:8020/user/ds01/pigdata.txt<r 2>      323

5. 顯示 pigdata.txt 檔案內容 
grunt> cat pidata.txt
1234|emp_1234@company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]
4567|emp_4567@company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]
grunt>

6. 檢視 movies_data.csv 儲存資訊
sh hdfs fsck movies_data.csv -files -blocks -locations
Connecting to namenode via http://nna:50070/fsck?ugi=ds01&files=1&blocks=1&locations=1&path=%2Fuser%2Fds01%2Fmovies_data.csv
FSCK started by ds01 (auth:SIMPLE) from /172.17.10.100 for path /user/ds01/movies_data.csv at Tue Nov 03 00:18:03 CST 2015
/user/ds01/movies_data.csv 2893177 bytes, 1 block(s):  OK
0. BP-1112556315-172.17.10.10-1441103394920:blk_1073743638_2814 len=2893177 repl=2 [DatanodeInfoWithStorage[172.17.10.21:50010,DS-1cd1a5f4-914c-4978-aefd-6d850ce4738b,DISK], DatanodeInfoWithStorage[172.17.10.20:50010,DS-0905c56a-c5fb-42fc-8065-b11496f6ff5b,DISK]]

Status: HEALTHY
 Total size:    2893177 B
 Total dirs:    0
 Total files:   1
 Total symlinks:                0
 Total blocks (validated):      1 (avg. block size 2893177 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    2
 Average block replication:     2.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          2
 Number of racks:               1

FSCK ended at Tue Nov 03 00:18:03 CST 2015 in 8 milliseconds

Pig 資料分析 (簡易 Schema)
1. 關聯資料集 
grunt>movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);

2. 檢視 movies 資料集的 Schema

因沒指定各個欄位資料型態, 由以下命令, 得知欄位內定資料型態為 bytearray
grunt>describe movies;
movies: {id: bytearray,name: bytearray,year: bytearray,rating: bytearray,duration: bytearray}

3. 顯示前五筆資料
grunt> five = limit movies 5;
grunt> dump five;
                                ::
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)

(5,Night Tide,1963,2.8,5126)

4. 顯示評比大於 4 的電影
grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;
grunt> DUMP movies_greater_than_four;
                               ::
(49383,Stephen Hawking's Grand Design,2012,4.1,)
(49486,Max Steel: Season 1,2013,4.1,)
(49504,Lilyhammer: Season 2 (Trailer),2013,4.5,106)
(49505,Life With Boys,2011,4.1,)
(49546,Bo Burnham: what.,2013,4.1,3614)
(49549,Life With Boys: Season 1,2011,4.1,)
(49554,Max Steel,2013,4.1,)
(49556,Lilyhammer: Season 1 (Recap),2013,4.2,194)
(49571,The Short Game (Trailer),2013,4.1,156)
(49579,Transformers Prime Beast Hunters: Predacons Rising,2013,4.2,3950)

5. 儲存分析結果
grunt> store movies_greater_than_four into 'movies_greater_than_four.csv’;
                               ::
Input(s):
Successfully read 49590 records (2893545 bytes) from: “hdfs://nna:8020/user/ds01/movies_data.csv
Output(s):
Successfully stored 897 records (35853 bytes) in: “hdfs://nna:8020/user/ds01/movies_greater_than_four.csv
                                ::
grunt> ls
hdfs://nna:8020/user/ds01/movies_data.csv<r 2>  2893177
hdfs://nna:8020/user/ds01/movies_greater_than_four.csv  <dir>
hdfs://nna:8020/user/ds01/movies_with_duplicates.csv<r 2>       539

6. 顯示分析結果
grunt> cat movies_greater_than_four.csv
                                   ::
49554   Max Steel       2013    4.1
49556   Lilyhammer: Season 1 (Recap)    2013    4.2     194
49571   The Short Game (Trailer)        2013    4.1     156
49579   Transformers Prime Beast Hunters: Predacons Rising      2013    4.2     3950

7. 離開 Pig
grunt> quit;
2015-08-27 14:53:47,782 [main] INFO  org.apache.pig.Main - Pig script completed in 21 seconds and 322 milliseconds (21322 ms)

將 HDFS 檔案系統中的 movies_greater_than_four.csv 下載至資料科學家的工作主機
ds02@cla01:~$ hdfs dfs -getmerge movies_greater_than_four.csv movie4.csv
ds02@cla01:~$ head -n 3 movie4.csv
139     Pulp Fiction    1994    4.1     9265
288     Life Is Beautiful       1997    4.2     6973
303     Mulan: Special Edition  1998    4.2     5270


Pig 資料分析 (沒有 Schema)

1. 取得臺灣地區地名資料集

$ wget http://data.moi.gov.tw/MoiOD/System/DownloadFile.aspx?DS=72BA3432-7B07-4FF4-86AA-FD9213006920 -O city.zip
$ ll -h city.zip
-rw-rw-r-- 1 ds02 ds02 9.0M 11月 13 13:40 city.zip

$ unzip city.zip
Archive: city.zip
inflating: жaжW╕ъо╞оw1031227.csv

$ iconv -f UCS-2 -t utf8 жaжW╕ъо╞оw1031227.csv -o city.tmp
$ head -n 2 city.tmp
地名名稱 漢語拼音 通用拼音 地名別稱 所屬縣市 所屬鄉鎮市區 所屬村里 地名意義 地名年代時間(起) 地名年代時間(迄) 地名類型 語言別 命名族群 相關位置與面積描述 地名沿革與文獻歷史簡述 地名相關事項訪談內容 普查使用之地圖與文獻 X坐標 Y坐標

太陽埤, 大安埤(蟳管埤) "Taiyang Pond ,Da-an Pond(Xuenguan Pond)" "Taiyang Pond ,Da-an Pond(Syunguan Pond)" 宜蘭縣 員山鄉 內城村 堡圖上寫作大安陂, 今記為太陽埤, 當地人則俗稱蟳管埤, 乃因此湖形似蟳的大腳, 以形得名。自然地理實體 位於臺7線上聯勤工廠東側山坡上的湖泊。 "" 臺灣地名辭書(卷一)宜蘭縣,臺灣省文獻會

$ tail -n +2 city.tmp > city.txt


2. 截取,轉換臺灣地區地名資料集

$ pig
grunt> copyfromlocal city.txt .
grunt> ls
hdfs://nna:8020/user/ds02/city.txt<r 2> 32259628

grunt> a = load 'city.txt';
grunt> b = foreach a generate $4,$5;
grunt> dump b;
                                ::
(澎湖縣,望安鄉)
(澎湖縣,望安鄉)
(澎湖縣,白沙鄉)
(澎湖縣,白沙鄉)

grunt> b_unique = distinct b;
grunt> dump b_unique;
                                 ::
(澎湖縣,望安鄉)
(澎湖縣,白沙鄉)
(澎湖縣,西嶼鄉)
(澎湖縣,馬公市)
(澎湖縣,)

grunt> c = filter b_unique by $1 is not null;
grunt> dump c;


3. 上載臺灣地區地名資料集

grunt> rmf city.csv
2015-11-13 16:50:36,044 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file

grunt> store c into 'city.csv' using PigStorage(',');
grunt> cat city.csv
                      ::
高雄市,茂林區
高雄市,茄萣區
高雄市,路竹區
高雄市,阿蓮區
高雄市,鳥松區
高雄市,鳳山區
高雄市,鹽埕區
高雄市,鼓山區
高雄市,那瑪夏區

Pig Latin 與 XML
使用 PiggyBank 的 User Define Function (UDF) 處理 XML 資料檔

1. 產生與裝載 XML 資料檔
$ nano catalog.xml
<?xml version="1.0"?>
<catalog>
      <large-product>
         <name>foo1</name>
         <price>110</price>
      </large-product>
      <large-product>
         <name>foo2</name>
         <price>120</price>
      </large-product>
      <small-product>
         <name>bar1</name>
         <price>10</price>
      </small-product>
      <small-product>
         <name>bar2</name>
         <price>20</price>
      </small-product>
      <small-product>
         <name>bar3</name>
         <price>30</price>
      </small-product>
</catalog>

$ hdfs dfs -put catalog.xml

2. 撰寫 Pig Script
$ nano catalog.pig
REGISTER /opt/pig-0.15.0/lib/piggybank.jar;
A = LOAD 'catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('small-product') AS (doc:chararray);
clean = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<small-product>\\s*<name>(.*)</name>\\s*<price>(.*)</price>\\s*</small-product>')) AS (name:chararray,price:int);
rmf alt_small_data.txt
store clean into 'alt_small_data.txt';

3. 執行 Pig Script

$ pig -f catalog.pig
$ pig -e cat alt_small_data.txt
bar1    10
bar2    20
bar3    30

Pig Latin 與 JSON
使用 JsonLoader 處理 JSON 資料集

分析 簡易格式 JSON 資料集
$ nano first_table.json
{"food":"Tacos", "person":"Alice", "amount":3}
{"food":"Tomato Soup", "person":"Sarah", "amount":2}
{"food":"Grilled Cheese", "person":"Alex", "amount":5}

上載至 HDFS
$ hdfs dfs -put first_table.json

啟動 Pig
$ pig
grunt> a = LOAD 'first_table.json' USING JsonLoader('food:chararray, person:chararray, amount:int');
grunt> dump a;
(Tacos,Alice,3)
(Tomato Soup,Sarah,2)
(Grilled Cheese,Alex,5)
(,,)

離開 Pig
$ quit;

分析 巢狀格式 JSON 資料集
$ cat second_table.json
{"recipe":"Tacos","ingredients":[{"name":"Beef"},{"name":"Lettuce"},{"name":"Cheese"}],"inventor":{"name":"Alex","age":25}}
{"recipe":"TomatoSoup","ingredients":[{"name":"Tomatoes"},{"name":"Milk"}],"inventor":{"name":"Steve","age":23}}

上載至 HDFS
$ hdfs dfs -put second_table.json

啟動 Pig
$ pig
grunt> a = LOAD 'second_table.json' USING JsonLoader('recipe:chararray, ingredients: {(name:chararray)}, inventor: (name:chararray, age:int)');
grunt> dump a;
                       ::
(Tacos,{(Beef),(Lettuce),(Cheese)},(Alex,25))
(Tomato Soup,{(Tomatoes),(Milk)},(Steve,23))
grunt> b = foreach a generate $0,FLATTEN($1);
grunt> dump a;
                ::
(Tacos,Beef)
(Tacos,Lettuce)
(Tacos,Cheese)

離開 Pig
$ quit;


離開 Hadoop Client
$ exit

[參考網站]


沒有留言:

張貼留言