Big Data 研究室: 11月 2015

2015年11月25日星期三

Big Data 一日實戰體驗營

感謝中國生產力錫金老師, 俐婷與上上群刀鋒研發小組的大力支持, 得以推出 Big Data 實機情境式體驗營 (二人使用一部樹莓小刀鋒)

課程資訊網址 : http://mymkc.com/app/edms/cpc2546/edm/2015/bigdata4.0-2/index.html

2015年11月23日星期一

WH 陣列雷達系統 - Hive 資料倉儲工具

Apache Hive 主要是讓不會寫 MapReduce 程式的 SQL人員，可以透過所熟悉的 SQL 語法分析大數據資料. 雖然 Hive 的語法和 SQL 可以說是 90%以上都相同，但是基於目前的設計架構，Hive 無法像關聯式資料庫可以快速查詢資料，雖然如此，但相信往後的版本會一一克服的!

Apache Hive 運作架構圖

在上圖中, Hive 資料表的 Schema (MetaStore) 是儲存在本機或網路型關聯式資料庫 (RDBMS), 而資料表的資料則是存在 HDFS 分散檔案系統, 資料分析會先將你所撰寫的 HiveQL 編譯成 MapReduce 程式, 然後交由 YARN 分散運算系統執行

啟動 Hadoop 電戰系統 (HDFS,YARN)

1. 啟動 Hadoop 所有貨櫃主機
$ dkstart a

2. 啟動 HDFS 分散檔案系統
$ starthdfs a

3. 啟動 YARN 分散運算系統
$ startyarn a

資料科學家上工
1. 登入 Hadoop Client 貨櫃主機
$ ssh ds01@cla01

2. 檢視 Hive 版本
$ hive --version
Hive 1.2.1
Subversion git://localhost.localdomain/home/sush/dev/hive.git -r 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
From source with checksum ab480aca41b24a9c3751b8c023338231

3. 確認連接的 Name Node

$ hive -S -e 'set -v' | grep 'fs.defaultFS'
fs.defaultFS=hdfs://nna:8020
mapreduce.job.hdfs-servers=${fs.defaultFS}

取得大專校院校別學生資料集
1. 下載與處理大專校院校別學生數檔
$ wget --no-check-certificate https://stats.moe.gov.tw/files/detail/103/103_student.txt

2. 轉換編碼
$ iconv -f UCS-2 -t utf8 103_student.txt -o temp.txt
$ sed 's/\"//g' < temp.txt >student.txt

3. 將 '總計' 欄位資料的 ',' 字元刪除
$ sed 's/,//g' < student.txt >student1.txt

4. 檢視資料
$ head -n 4 student1.txt
大專校院校別學生數
103 學年度 SY2014-2015
學校代碼學校名稱日間∕進修別等級別總計男生計女生計一年級男生一年級女生二年級男生二年級女生三年級男生三年級女生四年級男生四年級女生五年級男生五年級女生六年級男生六年級女生七年級男生七年級女生延修生男生延修生女生縣市名稱體系別
0001 國立政治大學 D 日 D 博士 973 583 390 117 76 79 62 94 58 98 57 75 53 61 43 59 41 - - 30 臺北市 1 一般
0001 國立政治大學 D 日 M 碩士 3816 1750 2066 626 707 573 683 344 404 207 272 - - - - -- - - 30 臺北市 1 一般
0001 國立政治大學 D 日 B 學士 9639 3711 5928 859 1359 843 1423 857 1394 881 1350 - - - - -- 271 402 30 臺北市 1 一般

啟動 Hive, 建立 student 資料表
1. 啟動 Hive

$ hive -S

2. 建立 student 資料表

hive> CREATE TABLE student (code string, name string, type string, class string, total int) row format delimited fields terminated by '\t' stored as textfile;
3. 載入資料
hive> load data local inpath 'student1.txt' into table student;

4. 顯示資料
hive> select code,name,total from student limit 10;
大專校院校別學生數 NULL
101 學年度 SY2012-2013 NULL
學校代碼學校名稱 NULL
0001 國立政治大學 973
0001 國立政治大學 3816
0001 國立政治大學 9639
0001 國立政治大學 1625
0002 國立清華大學 1786
0002 國立清華大學 3920
0002 國立清華大學 6280

5. 列印各校總人數
hive> select name,sum(total) from student group by name;
::
長庚科技大學 7595
長榮大學 10474
開南大學 9742
靜宜大學 12249
馬偕醫學院 530
馬偕醫護管理專科學校 4160
高美醫護管理專科學校 850
高苑科技大學 7723
高雄醫學大學 6981
黎明技術學院 4602
龍華科技大學 11254
6. 離開 Hive
hive> quit;

檢視 Hive 運作架構資訊
1. 檢視 Hive MetaData 目錄
$ tree -L 2 metastore_db
metastore_db
├── dbex.lck
├── db.lck
├── log
│ ├── log1.dat
│ ├── log.ctrl
│ └── logmirror.ctrl
├── seg0
│ ├── c101.dat
│ ├── c10.dat
│ ├── c111.dat
::
│ ├── cc0.dat
│ ├── cd1.dat
│ ├── ce1.dat
│ └── cf0.dat
├── service.properties
└── tmp

2. 檢視 Hive 資料表資料

$ hdfs dfs -ls /user/hive/warehouse

Found 1 items

drwxr-xr-x - ds01 biguser 0 2015-11-23 22:09 /user/hive/warehouse/student

Hive 外部資料表
1. 下載 Dataset, 並上載至 Hive 資料倉儲目錄
$ wget http://community.jaspersoft.com/sites/default/files/wiki_attachments/accounts.csv

$ head -n 1 accounts.csv
a69dae1f-b2ee-1257-3895-438dfb8ea964;2005-11-30 19:19:03;2005-11-30 19:19:03;1;beth_id;1;Alpha-Murraiin Communications, Inc;;Manufacturing;Communications;;;5423 Camby Rd.;La Mesa;CA;35890;USA;;;612-555-4878;;;;www.alpha-murraiincommunications,inc.com;;;;;5423 Camby Rd.;La Mesa;CA;35890;USA;0

2. 上載至 Hive 資料倉儲目錄
$ hdfs dfs -mkdir /user/hive/myacc
$ hdfs dfs -put accounts.csv /user/hive/myacc

3. 撰寫 createtbl.q 程序檔, 執行程序檔, 建立 accounts 外部資料表

$ nano createtbl.q

CREATE EXTERNAL TABLE accounts (

id STRING,

date_entered STRING,

date_modified STRING,

modified_user_id STRING,

assigned_user_id STRING,

created_by STRING,

name STRING,

parent_id STRING,

account_type STRING,

industry STRING,

annual_revenue STRING,

phone_fax STRING,

billing_address_street STRING,

billing_address_city STRING,

billing_address_state STRING,

billing_address_postalcode STRING,

billing_address_country STRING,

description STRING,

rating STRING,

phone_office STRING,

phone_alternate STRING,

email1 STRING,

email2 STRING,

website STRING,

ownership STRING,

employees STRING,

sic_code STRING,

ticker_symbol STRING,

shipping_address_street STRING,

shipping_address_city STRING,

shipping_address_state STRING,

shipping_address_postalcode STRING,

shipping_address_country STRING,

deleted BOOLEAN

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'

STORED AS TEXTFILE LOCATION '/user/hive/myacc';

[註] /user/hive/myacc 是目錄區, 不是資料集名稱

$ hive -S -f createtbl.q

4. 顯示 accounts 資料表的第一筆資料
$ hive -S -e 'select * from accounts limit 1'
a69dae1f-b2ee-1257-3895-438dfb8ea964 2005-11-30 19:19:03 2005-11-30 19:19:03 1 beth_id 1 Alpha-Murraiin Communications, Inc Manufacturing Communications 5423 Camby Rd. La Mesa CA 35890 USA 612-555-4878 www.alpha-murraiincommunications,inc.com 5423 Camby Rd. La Mesa CA 35890 USA NULL

5. 顯示總筆數
$ hive -S -e 'select count(*) from accounts'
1201

6. 刪除 accounts 資料表
$ hive -S -e 'drop table accounts'

accounts 資料表刪除後, 資料檔還是存在 (/user/hive/myacc/accounts.csv)
$ hdfs dfs -ls /user/hive/myacc
Found 1 items

-rw-r--r-- 2 ds01 biguser 357646 2015-11-24 00:37 /user/hive/myacc/accounts.csv

取得美國職棒資料集, 然後上傳到 HDFS
1. 下載魔球示範資料集
$ wget http://seanlahman.com/files/database/lahman2012-csv.zip

2. 解壓縮 lahman2012-csv.zip
$ unzip lahman2012-csv.zip

3. 上傳美國職棒資料集
$ hdfs dfs -mkdir baseball
$ hdfs dfs -put -f *.csv baseball
$ hdfs dfs -ls baseball

Found 26 items

-rw-r--r-- 2 ds01 biguser 198529 2015-11-23 22:40 baseball/AllstarFull.csv

-rw-r--r-- 2 ds01 biguser 5730747 2015-11-23 22:40 baseball/Appearances.csv

-rw-r--r-- 2 ds01 biguser 7304 2015-11-23 22:40 baseball/AwardsManagers.csv

-rw-r--r-- 2 ds01 biguser 240867 2015-11-23 22:40 baseball/AwardsPlayers.csv

-rw-r--r-- 2 ds01 biguser 16719 2015-11-23 22:40 baseball/AwardsShareManagers.csv

-rw-r--r-- 2 ds01 biguser 220135 2015-11-23 22:40 baseball/AwardsSharePlayers.csv

-rw-r--r-- 2 ds01 biguser 6488747 2015-11-23 22:40 baseball/Batting.csv

-rw-r--r-- 2 ds01 biguser 644669 2015-11-23 22:40 baseball/BattingPost.csv

-rw-r--r-- 2 ds01 biguser 8171830 2015-11-23 22:40 baseball/Fielding.csv

-rw-r--r-- 2 ds01 biguser 298470 2015-11-23 22:40 baseball/FieldingOF.csv

-rw-r--r-- 2 ds01 biguser 573945 2015-11-23 22:40 baseball/FieldingPost.csv

-rw-r--r-- 2 ds01 biguser 175990 2015-11-23 22:40 baseball/HallOfFame.csv

-rw-r--r-- 2 ds01 biguser 130719 2015-11-23 22:40 baseball/Managers.csv

-rw-r--r-- 2 ds01 biguser 3662 2015-11-23 22:40 baseball/ManagersHalf.csv

-rw-r--r-- 2 ds01 biguser 3049250 2015-11-23 22:40 baseball/Master.csv

-rw-r--r-- 2 ds01 biguser 3602473 2015-11-23 22:40 baseball/Pitching.csv

-rw-r--r-- 2 ds01 biguser 381812 2015-11-23 22:40 baseball/PitchingPost.csv

-rw-r--r-- 2 ds01 biguser 700024 2015-11-23 22:40 baseball/Salaries.csv

-rw-r--r-- 2 ds01 biguser 42933 2015-11-23 22:40 baseball/Schools.csv

-rw-r--r-- 2 ds01 biguser 180758 2015-11-23 22:40 baseball/SchoolsPlayers.csv

-rw-r--r-- 2 ds01 biguser 8369 2015-11-23 22:40 baseball/SeriesPost.csv

-rw-r--r-- 2 ds01 biguser 550032 2015-11-23 22:40 baseball/Teams.csv

-rw-r--r-- 2 ds01 biguser 3238 2015-11-23 22:40 baseball/TeamsFranchises.csv

-rw-r--r-- 2 ds01 biguser 1609 2015-11-23 22:40 baseball/TeamsHalf.csv

-rw-r--r-- 2 ds01 biguser 2893177 2015-11-23 22:40 baseball/movies_data.csv

-rw-r--r-- 2 ds01 biguser 12 2015-11-23 22:40 baseball/topyaer.csv

建立 MLB 資料庫及 Master 資料表
1. 建立 MLB 資料庫
$ hive -S -e 'create database MLB'

2. 撰寫 Hive 程序檔
$ nano c_mlb_master.q
create table MLB.Master
( lahmanID INT, playerID STRING, managerID INT, hofID STRING,
birthYear INT, birthMonth INT, birthDay INT, birthCountry STRING,
birthState STRING, birthCity STRING, deathYear INT, deathMonth INT,
deathDay INT, deathCountry STRING, deathState STRING, deathCity STRING,
nameFirst STRING, nameLast STRING, nameNote STRING, nameGiven STRING,
nameNick STRING, weight INT, height INT, bats STRING, throws STRING,
debut STRING, finalGame STRING, college STRING, lahman40ID STRING,
lahman45ID STRING, retroID STRING, holtzID STRING, bbrefID STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

3. 執行 Hive 程序檔, 建立 MLB.Master 資料表
$ hive -S -f c_mlb_master.q

匯入資料到 MLB.Master 資料表

1. 啟動 Hive
$ hive -S

2. 匯入資料
hive> LOAD DATA INPATH "baseball/Master.csv" OVERWRITE INTO TABLE MLB.Master;
3. 選擇 MLB 資料庫
hive> USE MLB;
4. 檢視 Master 資料表內容
hive> SELECT * FROM Master limit 2;
NULL playerID NULL hofID NULL NULL NULL birthCountry birthState birthCity NULL NULLNULL deathCountry deathState deathCity nameFirst nameLast nameNote nameGiven nameNick NULL NULL bats throws debut finalGame college lahman40ID lahman45ID retroID holtzID bbrefID

1 aaronha01 NULL aaronha01h 1934 2 5 USA AL Mobile NULL NULL NULL Hank Aaron Henry Louis "Hammer NULL NULL 180 72 R R 4/13/1954 10/3/1976 aaronha01 aaronha01 aaroh101

5. 列出 MLB 選手總數
hive> SELECT COUNT(*)-1 FROM Master;
181256. 離開 Hive
hive> quit;

建立 MLB.Batting 資料表, 並匯入資料
1. 建立 MLB.Batting 資料表
$ nano c_mlb_batting.q
create table MLB.Batting
( playerID STRING, yearID INT, stint INT, teamID STRING, lgID STRING,
G INT, G_batting INT, AB INT, R INT, H INT, twoB INT, threeB INT, HR INT,
RBI INT, SB INT, CS INT, BB INT, SO INT, IBB INT, HBP INT, SH INT,
SF INT, GIDP INT, G_old INT )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

$ hive -S -f c_mlb_batting.q

2. 匯入打擊資料

$ hive -S -e 'LOAD DATA INPATH "baseball/Batting.csv" OVERWRITE INTO TABLE MLB.Batting'

3. 檢視打擊資料
$ hive -S -e 'use MLB; select * from Batting limit 3'

playerID NULL NULL teamID lgID NULL NULL NULL NULL NULL NULL NULL NULL NULL NULLNULL NULL NULL NULL NULL NULL NULL NULL NULL
aardsda01 2004 1 SFN NL 11 11 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 11
aardsda01 2006 1 CHN NL 45 43 2 0 0 0 0 0 0 0 00 0 0 0 1 0 0 45

JOIN 跨表查詢
$ hive -S -e 'use MLB; SELECT A.PlayerID, B.teamID, B.AB, B.R, B.H, B.twoB, B.threeB, B.HR, B.RBI FROM Master A JOIN BATTING B ON A.playerID = B.playerID'
::
zuverge01 DET 4 0 0 0 0 0 0
zuverge01 BAL 23 1 5 1 0 0 0
zuverge01 BAL 17 0 2 0 0 0 2
zuverge01 BAL 23 1 3 0 0 0 0
zuverge01 BAL 9 0 2 0 1 0 2
zuverge01 BAL 0 0 0 0 0 0 0
zwilldu01 CHA 87 7 16 5 0 0 5
zwilldu01 CHF 592 91 185 38 8 16 95
zwilldu01 CHF 548 65 157 32 7 13 94
zwilldu01 CHN 53 4 6 1 0 1 8

資料科學家收工

$ exit

2015年11月2日星期一

雲端 UberOS 資戰機 - ETL 雷達系統 (Pig)

Pig 簡介
Hadoop 的普及和其生態系統的不斷壯大並不令人感到意外。Hadoop 不斷進步的一個原因是 MapReduce 應用程式的編寫。雖然編寫 Map 和 Reduce 應用程式並不十分複雜，但這些編程確實需要一些程式開發經驗。Apache Pig 改變了這種狀況，它在 MapReduce 的基礎上提供了更簡單的過程語言。因此，您不需要編寫一個單獨的 MapReduce 應用程式，您可以用 Pig Latin 語言寫一個腳本，根據腳本 Pig 會先自動生成 MapReduce 程式, 然後交由 YARN 分散運算系統去執行。

準備資料集 (Dataset)

1. 登入 Hadoop Client

$ ssh ds01@cla01
ds01@cla01's password:
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.16.0-46-generic x86_64)

* Documentation: https://help.ubuntu.com/

Last login: Tue Sep 1 20:10:04 2015 from 172.17.42.1

[注意] 使用 Pig 分析工具之前, 要先啟動 HDFS 及 YARN 系統

2.下載 Movie Dataset

$ wget https://raw.githubusercontent.com/rohitsden/pig-tutorial/master/movies_data.csv

$ head -n 6 movies_data.csv

1,The Nightmare Before Christmas,1993,3.9,4568

2,The Mummy,1932,3.5,4388

3,Orphans of the Storm,1921,3.2,9062

4,The Object of Beauty,1991,2.8,6150

5,Night Tide,1963,2.8,5126

6,One Magic Christmas,1985,3.8,5333

$ wget https://raw.githubusercontent.com/rohitsden/pig-tutorial/master/movies_with_duplicates.csv

$ head -n 6 movies_with_duplicates.csv

1,The Nightmare Before Christmas,1993,3.9,4568

2,The Mummy,1932,3.5,4388

3,Orphans of the Storm,1921,3.2,9062

4,The Object of Beauty,1991,2.8,6150

3. 自製 Dataset

$ nano pigdata.txt

1234|emp_1234@company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]

4567|emp_4567@company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]

Pig 管理 HDFS 分散檔案系統

1. 啟動 Pig 分析工具
$ pig

15/11/02 11:11:15 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/11/02 11:11:15 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

15/11/02 11:11:15 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-11-02 11:11:15,780 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-11-02 11:11:15,780 [main] INFO org.apache.pig.Main - Logging error messages to: /home/bigred/pig_1446433875777.log

2015-11-02 11:11:15,808 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/bigred/.pigbootup not found

2015-11-02 11:11:16,478 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-11-02 11:11:16,479 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-11-02 11:11:16,479 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://nna:8020

2015-11-02 11:11:17,383 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2. 檢視目前 HDFS 工作目錄
grunt> pwd
hdfs://nna:8020/user/ds01

3. 基本管理命令

grunt> mkdir test

grunt> fs -touchz test/abc

grunt> ls test

hdfs://nna:8020/user/ds01/test/abc<r 2> 0

grunt> rm test/abc

2015-11-03 00:29:01,998 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file

grunt> rm test

2015-11-03 00:29:10,095 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file

4. 上載資料集至 HDFS 分散檔案系統

grunt> copyfromlocal movies_data.csv .

grunt> copyfromlocal movies_with_duplicates.csv .

grunt> copyfromlocal pigdata.txt .

grunt> ls

hdfs://nna:8020/user/ds01/movies_data.csv<r 2> 2893177
hdfs://nna:8020/user/ds01/movies_with_duplicates.csv<r 2> 539
hdfs://nna:8020/user/ds01/pigdata.txt<r 2> 323

5. 顯示 pigdata.txt 檔案內容

grunt> cat pidata.txt

1234|emp_1234@company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]

4567|emp_4567@company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]

grunt>

6. 檢視 movies_data.csv 儲存資訊
$ sh hdfs fsck movies_data.csv -files -blocks -locations
Connecting to namenode via http://nna:50070/fsck?ugi=ds01&files=1&blocks=1&locations=1&path=%2Fuser%2Fds01%2Fmovies_data.csv
FSCK started by ds01 (auth:SIMPLE) from /172.17.10.100 for path /user/ds01/movies_data.csv at Tue Nov 03 00:18:03 CST 2015
/user/ds01/movies_data.csv 2893177 bytes, 1 block(s): OK
0. BP-1112556315-172.17.10.10-1441103394920:blk_1073743638_2814 len=2893177 repl=2 [DatanodeInfoWithStorage[172.17.10.21:50010,DS-1cd1a5f4-914c-4978-aefd-6d850ce4738b,DISK], DatanodeInfoWithStorage[172.17.10.20:50010,DS-0905c56a-c5fb-42fc-8065-b11496f6ff5b,DISK]]

Status: HEALTHY
Total size: 2893177 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 2893177 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 2
Number of racks: 1

FSCK ended at Tue Nov 03 00:18:03 CST 2015 in 8 milliseconds

Pig 資料分析 (簡易 Schema)
1. 關聯資料集
grunt>movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);

2. 檢視 movies 資料集的 Schema
因沒指定各個欄位資料型態, 由以下命令, 得知欄位內定資料型態為 bytearray
grunt>describe movies;
movies: {id: bytearray,name: bytearray,year: bytearray,rating: bytearray,duration: bytearray}

3. 顯示前五筆資料
grunt> five = limit movies 5;
grunt> dump five;
::
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)

4. 顯示評比大於 4 的電影

grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;

grunt> DUMP movies_greater_than_four;

(49383,Stephen Hawking's Grand Design,2012,4.1,)

(49486,Max Steel: Season 1,2013,4.1,)

(49504,Lilyhammer: Season 2 (Trailer),2013,4.5,106)

(49505,Life With Boys,2011,4.1,)

(49546,Bo Burnham: what.,2013,4.1,3614)

(49549,Life With Boys: Season 1,2011,4.1,)

(49554,Max Steel,2013,4.1,)

(49556,Lilyhammer: Season 1 (Recap),2013,4.2,194)

(49571,The Short Game (Trailer),2013,4.1,156)

(49579,Transformers Prime Beast Hunters: Predacons Rising,2013,4.2,3950)

5. 儲存分析結果
grunt> store movies_greater_than_four into 'movies_greater_than_four.csv’;

Input(s):

Successfully read 49590 records (2893545 bytes) from: “hdfs://nna:8020/user/ds01/movies_data.csv”

Output(s):

Successfully stored 897 records (35853 bytes) in: “hdfs://nna:8020/user/ds01/movies_greater_than_four.csv”

grunt> ls

hdfs://nna:8020/user/ds01/movies_data.csv<r 2> 2893177

hdfs://nna:8020/user/ds01/movies_greater_than_four.csv <dir>

hdfs://nna:8020/user/ds01/movies_with_duplicates.csv<r 2> 539

6. 顯示分析結果

grunt> cat movies_greater_than_four.csv

49554 Max Steel 2013 4.1

49556 Lilyhammer: Season 1 (Recap) 2013 4.2 194

49571 The Short Game (Trailer) 2013 4.1 156

49579 Transformers Prime Beast Hunters: Predacons Rising 2013 4.2 3950

7. 離開 Pig

grunt> quit;

2015-08-27 14:53:47,782 [main] INFO org.apache.pig.Main - Pig script completed in 21 seconds and 322 milliseconds (21322 ms)

將 HDFS 檔案系統中的 movies_greater_than_four.csv 下載至資料科學家的工作主機

ds02@cla01:~$ hdfs dfs -getmerge movies_greater_than_four.csv movie4.csv

ds02@cla01:~$ head -n 3 movie4.csv

139 Pulp Fiction 1994 4.1 9265

288 Life Is Beautiful 1997 4.2 6973

303 Mulan: Special Edition 1998 4.2 5270

Pig 資料分析 (沒有 Schema)

1. 取得臺灣地區地名資料集

$ wget http://data.moi.gov.tw/MoiOD/System/DownloadFile.aspx?DS=72BA3432-7B07-4FF4-86AA-FD9213006920 -O city.zip
$ ll -h city.zip
-rw-rw-r-- 1 ds02 ds02 9.0M 11月 13 13:40 city.zip

$ unzip city.zip
Archive: city.zip
inflating: жaжW╕ъо╞оw1031227.csv

$ iconv -f UCS-2 -t utf8 жaжW╕ъо╞оw1031227.csv -o city.tmp
$ head -n 2 city.tmp
地名名稱漢語拼音通用拼音地名別稱所屬縣市所屬鄉鎮市區所屬村里地名意義地名年代時間(起) 地名年代時間(迄) 地名類型語言別命名族群相關位置與面積描述地名沿革與文獻歷史簡述地名相關事項訪談內容普查使用之地圖與文獻 X坐標 Y坐標

太陽埤，大安埤(蟳管埤) "Taiyang Pond ,Da-an Pond(Xuenguan Pond)" "Taiyang Pond ,Da-an Pond(Syunguan Pond)" 宜蘭縣員山鄉內城村堡圖上寫作大安陂，今記為太陽埤，當地人則俗稱蟳管埤，乃因此湖形似蟳的大腳，以形得名。自然地理實體位於臺7線上聯勤工廠東側山坡上的湖泊。 "" 臺灣地名辭書(卷一)宜蘭縣，臺灣省文獻會

$ tail -n +2 city.tmp > city.txt

2. 截取,轉換臺灣地區地名資料集

$ pig
grunt> copyfromlocal city.txt .
grunt> ls
hdfs://nna:8020/user/ds02/city.txt<r 2> 32259628

grunt> a = load 'city.txt';
grunt> b = foreach a generate $4,$5;
grunt> dump b;
::
(澎湖縣,望安鄉)
(澎湖縣,望安鄉)
(澎湖縣,白沙鄉)
(澎湖縣,白沙鄉)

grunt> b_unique = distinct b;
grunt> dump b_unique;
::
(澎湖縣,望安鄉)
(澎湖縣,白沙鄉)
(澎湖縣,西嶼鄉)
(澎湖縣,馬公市)
(澎湖縣,)

grunt> c = filter b_unique by $1 is not null;
grunt> dump c;

3. 上載臺灣地區地名資料集

grunt> rmf city.csv
2015-11-13 16:50:36,044 [main] INFO org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file

grunt> store c into 'city.csv' using PigStorage(',');
grunt> cat city.csv
::
高雄市,茂林區
高雄市,茄萣區
高雄市,路竹區
高雄市,阿蓮區
高雄市,鳥松區
高雄市,鳳山區
高雄市,鹽埕區
高雄市,鼓山區
高雄市,那瑪夏區

Pig Latin 與 XML
使用 PiggyBank 的 User Define Function (UDF) 處理 XML 資料檔

1. 產生與裝載 XML 資料檔

$ nano catalog.xml

<?xml version="1.0"?>

<large-product>

</large-product>

<large-product>

</large-product>

<small-product>

</small-product>

<small-product>

</small-product>

<small-product>

</small-product>

</catalog>

$ hdfs dfs -put catalog.xml

2. 撰寫 Pig Script

$ nano catalog.pig

A = LOAD 'catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('small-product') AS (doc:chararray);

clean = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<small-product>\\s*<name>(.*)</name>\\s*<price>(.*)</price>\\s*</small-product>')) AS (name:chararray,price:int);

rmf alt_small_data.txt

store clean into 'alt_small_data.txt';

3. 執行 Pig Script

$ pig -f catalog.pig

$ pig -e cat alt_small_data.txt

bar1 10

bar2 20

bar3 30

Pig Latin 與 JSON
使用 JsonLoader 處理 JSON 資料集

分析簡易格式 JSON 資料集

$ nano first_table.json

{"food":"Tacos", "person":"Alice", "amount":3}

{"food":"Tomato Soup", "person":"Sarah", "amount":2}

{"food":"Grilled Cheese", "person":"Alex", "amount":5}

上載至 HDFS

$ hdfs dfs -put first_table.json

啟動 Pig

$ pig

grunt> a = LOAD 'first_table.json' USING JsonLoader('food:chararray, person:chararray, amount:int');

grunt> dump a;

(Tacos,Alice,3)

(Tomato Soup,Sarah,2)

(Grilled Cheese,Alex,5)

(,,)

離開 Pig
$ quit;

分析巢狀格式 JSON 資料集

$ cat second_table.json

{"recipe":"Tacos","ingredients":[{"name":"Beef"},{"name":"Lettuce"},{"name":"Cheese"}],"inventor":{"name":"Alex","age":25}}

{"recipe":"TomatoSoup","ingredients":[{"name":"Tomatoes"},{"name":"Milk"}],"inventor":{"name":"Steve","age":23}}

上載至 HDFS

$ hdfs dfs -put second_table.json

啟動 Pig

$ pig

grunt> a = LOAD 'second_table.json' USING JsonLoader('recipe:chararray, ingredients: {(name:chararray)}, inventor: (name:chararray, age:int)');

grunt> dump a;

(Tacos,{(Beef),(Lettuce),(Cheese)},(Alex,25))

(Tomato Soup,{(Tomatoes),(Milk)},(Steve,23))

grunt> b = foreach a generate $0,FLATTEN($1);

grunt> dump a;

(Tacos,Beef)

(Tacos,Lettuce)

(Tacos,Cheese)

離開 Pig
$ quit;

離開 Hadoop Client
$ exit

[參考網站]

1. Apache Pig Tutorial – Part http://www.rohitmenon.com/index.php/apache-pig-tutorial-part-1/

訂閱：文章 (Atom)

網頁

2015年11月25日 星期三

Big Data 一日實戰體驗營

2015年11月23日 星期一

WH 陣列雷達系統 - Hive 資料倉儲工具

2015年11月2日 星期一

雲端 UberOS 資戰機 - ETL 雷達系統 (Pig)

2015年11月25日星期三

2015年11月23日星期一

2015年11月2日星期一