Big Data 研究室: 雲端 UberOS 資戰機

Pig 簡介
Hadoop 的普及和其生態系統的不斷壯大並不令人感到意外。Hadoop 不斷進步的一個原因是 MapReduce 應用程式的編寫。雖然編寫 Map 和 Reduce 應用程式並不十分複雜，但這些編程確實需要一些程式開發經驗。Apache Pig 改變了這種狀況，它在 MapReduce 的基礎上提供了更簡單的過程語言。因此，您不需要編寫一個單獨的 MapReduce 應用程式，您可以用 Pig Latin 語言寫一個腳本，根據腳本 Pig 會先自動生成 MapReduce 程式, 然後交由 YARN 分散運算系統去執行。

準備資料集 (Dataset)

1. 登入 Hadoop Client

$ ssh ds01@cla01
ds01@cla01's password:
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.16.0-46-generic x86_64)

* Documentation: https://help.ubuntu.com/

Last login: Tue Sep 1 20:10:04 2015 from 172.17.42.1

[注意] 使用 Pig 分析工具之前, 要先啟動 HDFS 及 YARN 系統

2.下載 Movie Dataset

$ wget https://raw.githubusercontent.com/rohitsden/pig-tutorial/master/movies_data.csv

$ head -n 6 movies_data.csv

1,The Nightmare Before Christmas,1993,3.9,4568

2,The Mummy,1932,3.5,4388

3,Orphans of the Storm,1921,3.2,9062

4,The Object of Beauty,1991,2.8,6150

5,Night Tide,1963,2.8,5126

6,One Magic Christmas,1985,3.8,5333

$ wget https://raw.githubusercontent.com/rohitsden/pig-tutorial/master/movies_with_duplicates.csv

$ head -n 6 movies_with_duplicates.csv

1,The Nightmare Before Christmas,1993,3.9,4568

2,The Mummy,1932,3.5,4388

3,Orphans of the Storm,1921,3.2,9062

4,The Object of Beauty,1991,2.8,6150

3. 自製 Dataset

$ nano pigdata.txt

1234|emp_1234@company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]

4567|emp_4567@company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]

Pig 管理 HDFS 分散檔案系統

1. 啟動 Pig 分析工具
$ pig

15/11/02 11:11:15 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/11/02 11:11:15 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

15/11/02 11:11:15 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-11-02 11:11:15,780 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-11-02 11:11:15,780 [main] INFO org.apache.pig.Main - Logging error messages to: /home/bigred/pig_1446433875777.log

2015-11-02 11:11:15,808 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/bigred/.pigbootup not found

2015-11-02 11:11:16,478 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-11-02 11:11:16,479 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-11-02 11:11:16,479 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://nna:8020

2015-11-02 11:11:17,383 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2. 檢視目前 HDFS 工作目錄
grunt> pwd
hdfs://nna:8020/user/ds01

3. 基本管理命令

grunt> mkdir test

grunt> fs -touchz test/abc

grunt> ls test

hdfs://nna:8020/user/ds01/test/abc<r 2> 0

grunt> rm test/abc

2015-11-03 00:29:01,998 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file

grunt> rm test

2015-11-03 00:29:10,095 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file

4. 上載資料集至 HDFS 分散檔案系統

grunt> copyfromlocal movies_data.csv .

grunt> copyfromlocal movies_with_duplicates.csv .

grunt> copyfromlocal pigdata.txt .

grunt> ls

hdfs://nna:8020/user/ds01/movies_data.csv<r 2> 2893177
hdfs://nna:8020/user/ds01/movies_with_duplicates.csv<r 2> 539
hdfs://nna:8020/user/ds01/pigdata.txt<r 2> 323

5. 顯示 pigdata.txt 檔案內容

grunt> cat pidata.txt

1234|emp_1234@company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]

4567|emp_4567@company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]

grunt>

6. 檢視 movies_data.csv 儲存資訊
$ sh hdfs fsck movies_data.csv -files -blocks -locations
Connecting to namenode via http://nna:50070/fsck?ugi=ds01&files=1&blocks=1&locations=1&path=%2Fuser%2Fds01%2Fmovies_data.csv
FSCK started by ds01 (auth:SIMPLE) from /172.17.10.100 for path /user/ds01/movies_data.csv at Tue Nov 03 00:18:03 CST 2015
/user/ds01/movies_data.csv 2893177 bytes, 1 block(s): OK
0. BP-1112556315-172.17.10.10-1441103394920:blk_1073743638_2814 len=2893177 repl=2 [DatanodeInfoWithStorage[172.17.10.21:50010,DS-1cd1a5f4-914c-4978-aefd-6d850ce4738b,DISK], DatanodeInfoWithStorage[172.17.10.20:50010,DS-0905c56a-c5fb-42fc-8065-b11496f6ff5b,DISK]]

Status: HEALTHY
Total size: 2893177 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 2893177 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 2
Number of racks: 1

FSCK ended at Tue Nov 03 00:18:03 CST 2015 in 8 milliseconds

Pig 資料分析 (簡易 Schema)
1. 關聯資料集
grunt>movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);

2. 檢視 movies 資料集的 Schema
因沒指定各個欄位資料型態, 由以下命令, 得知欄位內定資料型態為 bytearray
grunt>describe movies;
movies: {id: bytearray,name: bytearray,year: bytearray,rating: bytearray,duration: bytearray}

3. 顯示前五筆資料
grunt> five = limit movies 5;
grunt> dump five;
::
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)

4. 顯示評比大於 4 的電影

grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;

grunt> DUMP movies_greater_than_four;

(49383,Stephen Hawking's Grand Design,2012,4.1,)

(49486,Max Steel: Season 1,2013,4.1,)

(49504,Lilyhammer: Season 2 (Trailer),2013,4.5,106)

(49505,Life With Boys,2011,4.1,)

(49546,Bo Burnham: what.,2013,4.1,3614)

(49549,Life With Boys: Season 1,2011,4.1,)

(49554,Max Steel,2013,4.1,)

(49556,Lilyhammer: Season 1 (Recap),2013,4.2,194)

(49571,The Short Game (Trailer),2013,4.1,156)

(49579,Transformers Prime Beast Hunters: Predacons Rising,2013,4.2,3950)

5. 儲存分析結果
grunt> store movies_greater_than_four into 'movies_greater_than_four.csv’;

Input(s):

Successfully read 49590 records (2893545 bytes) from: “hdfs://nna:8020/user/ds01/movies_data.csv”

Output(s):

Successfully stored 897 records (35853 bytes) in: “hdfs://nna:8020/user/ds01/movies_greater_than_four.csv”

grunt> ls

hdfs://nna:8020/user/ds01/movies_data.csv<r 2> 2893177

hdfs://nna:8020/user/ds01/movies_greater_than_four.csv <dir>

hdfs://nna:8020/user/ds01/movies_with_duplicates.csv<r 2> 539

6. 顯示分析結果

grunt> cat movies_greater_than_four.csv

49554 Max Steel 2013 4.1

49556 Lilyhammer: Season 1 (Recap) 2013 4.2 194

49571 The Short Game (Trailer) 2013 4.1 156

49579 Transformers Prime Beast Hunters: Predacons Rising 2013 4.2 3950

7. 離開 Pig

grunt> quit;

2015-08-27 14:53:47,782 [main] INFO org.apache.pig.Main - Pig script completed in 21 seconds and 322 milliseconds (21322 ms)

將 HDFS 檔案系統中的 movies_greater_than_four.csv 下載至資料科學家的工作主機

ds02@cla01:~$ hdfs dfs -getmerge movies_greater_than_four.csv movie4.csv

ds02@cla01:~$ head -n 3 movie4.csv

139 Pulp Fiction 1994 4.1 9265

288 Life Is Beautiful 1997 4.2 6973

303 Mulan: Special Edition 1998 4.2 5270

Pig 資料分析 (沒有 Schema)

1. 取得臺灣地區地名資料集

$ wget http://data.moi.gov.tw/MoiOD/System/DownloadFile.aspx?DS=72BA3432-7B07-4FF4-86AA-FD9213006920 -O city.zip
$ ll -h city.zip
-rw-rw-r-- 1 ds02 ds02 9.0M 11月 13 13:40 city.zip

$ unzip city.zip
Archive: city.zip
inflating: жaжW╕ъо╞оw1031227.csv

$ iconv -f UCS-2 -t utf8 жaжW╕ъо╞оw1031227.csv -o city.tmp
$ head -n 2 city.tmp
地名名稱漢語拼音通用拼音地名別稱所屬縣市所屬鄉鎮市區所屬村里地名意義地名年代時間(起) 地名年代時間(迄) 地名類型語言別命名族群相關位置與面積描述地名沿革與文獻歷史簡述地名相關事項訪談內容普查使用之地圖與文獻 X坐標 Y坐標

太陽埤，大安埤(蟳管埤) "Taiyang Pond ,Da-an Pond(Xuenguan Pond)" "Taiyang Pond ,Da-an Pond(Syunguan Pond)" 宜蘭縣員山鄉內城村堡圖上寫作大安陂，今記為太陽埤，當地人則俗稱蟳管埤，乃因此湖形似蟳的大腳，以形得名。自然地理實體位於臺7線上聯勤工廠東側山坡上的湖泊。 "" 臺灣地名辭書(卷一)宜蘭縣，臺灣省文獻會

$ tail -n +2 city.tmp > city.txt

2. 截取,轉換臺灣地區地名資料集

$ pig
grunt> copyfromlocal city.txt .
grunt> ls
hdfs://nna:8020/user/ds02/city.txt<r 2> 32259628

grunt> a = load 'city.txt';
grunt> b = foreach a generate $4,$5;
grunt> dump b;
::
(澎湖縣,望安鄉)
(澎湖縣,望安鄉)
(澎湖縣,白沙鄉)
(澎湖縣,白沙鄉)

grunt> b_unique = distinct b;
grunt> dump b_unique;
::
(澎湖縣,望安鄉)
(澎湖縣,白沙鄉)
(澎湖縣,西嶼鄉)
(澎湖縣,馬公市)
(澎湖縣,)

grunt> c = filter b_unique by $1 is not null;
grunt> dump c;

3. 上載臺灣地區地名資料集

grunt> rmf city.csv
2015-11-13 16:50:36,044 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file

grunt> store c into 'city.csv' using PigStorage(',');
grunt> cat city.csv
::
高雄市,茂林區
高雄市,茄萣區
高雄市,路竹區
高雄市,阿蓮區
高雄市,鳥松區
高雄市,鳳山區
高雄市,鹽埕區
高雄市,鼓山區
高雄市,那瑪夏區

Pig Latin 與 XML
使用 PiggyBank 的 User Define Function (UDF) 處理 XML 資料檔

1. 產生與裝載 XML 資料檔

$ nano catalog.xml

<?xml version="1.0"?>

<large-product>

</large-product>

<large-product>

</large-product>

<small-product>

</small-product>

<small-product>

</small-product>

<small-product>

</small-product>

</catalog>

$ hdfs dfs -put catalog.xml

2. 撰寫 Pig Script

$ nano catalog.pig

A = LOAD 'catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('small-product') AS (doc:chararray);

clean = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<small-product>\\s*<name>(.*)</name>\\s*<price>(.*)</price>\\s*</small-product>')) AS (name:chararray,price:int);

rmf alt_small_data.txt

store clean into 'alt_small_data.txt';

3. 執行 Pig Script

$ pig -f catalog.pig

$ pig -e cat alt_small_data.txt

bar1 10

bar2 20

bar3 30

Pig Latin 與 JSON
使用 JsonLoader 處理 JSON 資料集

分析簡易格式 JSON 資料集

$ nano first_table.json

{"food":"Tacos", "person":"Alice", "amount":3}

{"food":"Tomato Soup", "person":"Sarah", "amount":2}

{"food":"Grilled Cheese", "person":"Alex", "amount":5}

上載至 HDFS

$ hdfs dfs -put first_table.json

啟動 Pig

$ pig

grunt> a = LOAD 'first_table.json' USING JsonLoader('food:chararray, person:chararray, amount:int');

grunt> dump a;

(Tacos,Alice,3)

(Tomato Soup,Sarah,2)

(Grilled Cheese,Alex,5)

(,,)

離開 Pig
$ quit;

分析巢狀格式 JSON 資料集

$ cat second_table.json

{"recipe":"Tacos","ingredients":[{"name":"Beef"},{"name":"Lettuce"},{"name":"Cheese"}],"inventor":{"name":"Alex","age":25}}

{"recipe":"TomatoSoup","ingredients":[{"name":"Tomatoes"},{"name":"Milk"}],"inventor":{"name":"Steve","age":23}}

上載至 HDFS

$ hdfs dfs -put second_table.json

啟動 Pig

$ pig

grunt> a = LOAD 'second_table.json' USING JsonLoader('recipe:chararray, ingredients: {(name:chararray)}, inventor: (name:chararray, age:int)');

grunt> dump a;

(Tacos,{(Beef),(Lettuce),(Cheese)},(Alex,25))

(Tomato Soup,{(Tomatoes),(Milk)},(Steve,23))

grunt> b = foreach a generate $0,FLATTEN($1);

grunt> dump a;

(Tacos,Beef)

(Tacos,Lettuce)

(Tacos,Cheese)

離開 Pig
$ quit;

離開 Hadoop Client
$ exit

[參考網站]

1. Apache Pig Tutorial – Part http://www.rohitmenon.com/index.php/apache-pig-tutorial-part-1/

Big Data 研究室

網頁

2015年11月2日星期一

雲端 UberOS 資戰機 - ETL 雷達系統 (Pig)

沒有留言:

張貼留言

網頁

2015年11月2日 星期一

雲端 UberOS 資戰機 - ETL 雷達系統 (Pig)

沒有留言:

張貼留言

2015年11月2日星期一