在還沒有搞清楚 Hive 以前,一直以為 HadoopDB 底部用 Databases 會有所限制,例如有 A, B 和 C 三台電腦構成的 Cluster ,若分別在 A, B 和 C 上各別建立資料庫以及 T1 跟 T2 兩個 Table, 當我透過 Hive 進行 Join 的查詢時,會不會因為資料不在同一個資料料庫裡而查不到呢?或是 Table 不在同一個資料庫裡就無法 Join 呢?如果你已經讀過 Hive 的設計架構,那肯定很清楚在使用 Hive on Hadoop 時,並不需要擔心這些事。
這個實驗很簡單,就只是要測試 HadoopDB 是否真的能提供 Join 功能,這是一開始上司丟給我的問題。當時我還不了解 Hive 因此也稍微存疑。了解 Hive 後,方知這些問題是由 Hive 解決,當然 HadoopDB 就不會碰到一樣的問題啦。但還是花一點點時間把實驗作完囉!
實驗設計:
- 使用 3-node cluster ,分別是 Cluster01 , Cluster02 和 Cluster03
- 準備測資 d1 和 d2 以及分別建立 table 的欄位資訊
- d1_field_definition (t1, table 1)
- ID int,
NAME varchar(300)
- ID int,
- d1
- 1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
- 1 A
- d2_field_definition (t2 , table 2 )
- ID int,
ADDRESS varchar(300)
- ID int,
- d2
- 9 I_address
8 H_address
7 G_address
6 F_address
5 E_address
4 D_address
3 C_address
2 B_address
1 A_address
- 9 I_address
- d1_field_definition (t1, table 1)
- 一支幫你方便架設 HadoopDB 的 script
- 請參考 [Linux] 安裝 HadoopDB 於 Hadoop 0.19.2 @ Ubuntu 9.10 文章後頭那支大改的 script , 在此將它儲存成 batch_setup.py
- 在跑 batch_setup.py 時,需先建立 node.txt 檔案,裡頭紀錄 Cluster01 ~ Cluster03 機器的 IP
實驗過程與結果
- 以下指令全部皆在 Cluster01 這台機器的家目錄執行
- 將 d1 資料分割 3 份,以及匯入各台機器和建立 Hive 的 Table
- $ python batch_setup.py -H d1 -t t1 -i t1 -g
- 備份 HadoopDB.xml 等會要用
- $ cp HadoopDB.xml HadoopDB_t1.xml
- 使用 Hive 驗證
- $ ./SMS_dist/bin/hive
hive> show tables;
OK
t1
Time taken: 8.772 seconds
hive> select id,name from t1;
Total MapReduce jobs = 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201001201134_0008, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0008
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0008
2010-01-20 02:12:42,545 map = 0%, reduce =0%
2010-01-20 02:12:50,659 map = 33%, reduce =0%
2010-01-20 02:12:56,828 map = 67%, reduce =0%
2010-01-20 02:12:57,855 map = 100%, reduce =0%
Ended Job = job_201001201134_0008
OK
3 C
6 F
9 I
2 B
5 E
8 H
1 A
4 D
7 G
Time taken: 22.447 seconds
hive> quit; - 驗證完記得離開,因為 hive 預設只能一支 client 跟它 query,不 quit 會影響 batch_setup.py 的設定
- $ ./SMS_dist/bin/hive
- 將 d2 資料分割 3 份,以及匯入各台機器和建立 Hive 的 Table
- $ python batch_setup.py -H d2 -t t2 -i t2 -g
- 備份 HadoopDB.xml 等會要用
- $ cp HadoopDB.xml HadoopDB_t2.xml
- 使用 Hive 驗證
- $ ./SMS_dist/bin/hive
hive> show tables;
OK
t1
t2
Time taken: 9.313 seconds
hive> select id,address from t2;
Total MapReduce jobs = 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201001201134_0010, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0010
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0010
2010-01-20 02:16:45,971 map = 0%, reduce =0%
2010-01-20 02:16:57,115 map = 33%, reduce =0%
2010-01-20 02:17:03,240 map = 67%, reduce =0%
2010-01-20 02:17:04,283 map = 100%, reduce =0%
Ended Job = job_201001201134_0010
OK
9 I_address
8 H_address
5 E_address
4 D_address
3 C_address
7 G_address
6 F_address
2 B_address
1 A_address
Time taken: 25.357 seconds
hive> quit; - 驗證完記得離開,因為 hive 預設只能一支 client 跟它 query
- $ ./SMS_dist/bin/hive
- 處理 HadoopDB.xml
- 因為 batch_setup.py 只設計成單一資料的匯入,所以要自行處理,將 table 1 和 table 2 整合在一個檔案內
- <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DBClusterConfiguration xmlns="http://edu.yale.cs.db.hadoop/DBConfigurationSchema">
<Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location="Cluster01">
<Relations id="t1">
<Partitions url="jdbc:postgresql://Cluster01:5432/udb_t1_0" id="0"/>
</Relations>
<Relations id="t2">
<Partitions url="jdbc:postgresql://Cluster01:5432/udb_t2_0" id="0"/>
</Relations>
<Relations id="no_use"/>
</Nodes>
<Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location="Cluster02">
<Relations id="t1">
<Partitions url="jdbc:postgresql://Cluster02:5432/udb_t1_1" id="1"/>
</Relations>
<Relations id="t2">
<Partitions url="jdbc:postgresql://Cluster02:5432/udb_t2_1" id="1"/>
</Relations>
<Relations id="no_use"/>
</Nodes>
<Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location="Cluster03">
<Relations id="t1">
<Partitions url="jdbc:postgresql://Cluster03:5432/udb_t1_2" id="2"/>
</Relations>
<Relations id="t2">
<Partitions url="jdbc:postgresql://Cluster03:5432/udb_t2_2" id="2"/>
</Relations>
<Relations id="no_use"/>
</Nodes>
</DBClusterConfiguration> - $ hadoop fs -rmr HadoopDB.xml
Deleted hdfs://Cluster01:9000/user/hadoop/HadoopDB.xml
$ hadoop fs -put HadoopDB.xml HadoopDB.xml - 測試合併後的 HadoopDB.xml 是否能正常工作
- hive> select name,id from t1;
Total MapReduce jobs = 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201001201134_0011, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0011
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0011
2010-01-20 02:20:58,547 map = 0%, reduce =0%
2010-01-20 02:21:05,622 map = 33%, reduce =0%
2010-01-20 02:21:11,790 map = 67%, reduce =0%
2010-01-20 02:21:12,808 map = 100%, reduce =0%
Ended Job = job_201001201134_0011
OK
C 3
F 6
I 9
B 2
E 5
H 8
A 1
D 4
G 7
Time taken: 29.667 seconds
hive> select address,id from t2;
Total MapReduce jobs = 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201001201134_0012, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0012
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0012
2010-01-20 02:21:25,782 map = 0%, reduce =0%
2010-01-20 02:21:34,920 map = 33%, reduce =0%
2010-01-20 02:21:38,983 map = 67%, reduce =0%
2010-01-20 02:21:41,002 map = 100%, reduce =0%
Ended Job = job_201001201134_0012
OK
I_address 9
H_address 8
E_address 5
D_address 4
C_address 3
G_address 7
F_address 6
B_address 2
A_address 1
Time taken: 18.932 seconds
hive>
- hive> select name,id from t1;
經過上頭的準備工作完成,開始正式測試 Join 囉
hive> select t1.id, t1.name, t2.address from t1 join t2 on ( t1.id = t2.id );
Total MapReduce jobs = 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0013, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0013
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0013
2010-01-20 02:23:24,293 map = 0%, reduce =0%
2010-01-20 02:23:36,400 map = 17%, reduce =0%
2010-01-20 02:23:46,544 map = 33%, reduce =0%
2010-01-20 02:23:52,248 map = 50%, reduce =0%
2010-01-20 02:23:55,274 map = 67%, reduce =0%
2010-01-20 02:23:57,291 map = 83%, reduce =0%
2010-01-20 02:23:58,308 map = 100%, reduce =0%
2010-01-20 02:24:03,360 map = 100%, reduce =28%
2010-01-20 02:24:05,381 map = 100%, reduce =100%
Ended Job = job_201001201134_0013
OK
1 A A_address
2 B B_address
3 C C_address
4 D D_address
5 E E_address
6 F F_address
7 G G_address
8 H H_address
9 I I_address
Time taken: 44.725 seconds
驗證
udb_t1_0=# select * from t1;
id | name
----+------
1 | A
4 | D
7 | G
(3 rows)
udb_t2_0=# select * from t2;
id | address
----+-----------
7 | G_address
6 | F_address
2 | B_address
1 | A_address
(4 rows)
驗證的結果是對的,在 Cluster01 上只有 Table 1 的 1,4,7 資料,和 Table 2 的 7,6,2,1 資料,再加上這兩個 Table 是在不同的 databases 上,即 udb_t1_0 和 udb_t2_0 ,所以在資料並未集中在某台機器或其資料庫中,HadoopDB 還是可以處理好 Join 的工作啦,別忘了這是原先Hive就設計好的架構囉
以下是其他的測試
hive> select * from t1 join ( select t1.id , t1.name , t2.address
from t1 join t2 on ( t1.id = t2.id ) ) r1 on ( t1.id = r1.id ) ;
Total MapReduce jobs = 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0015, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0015
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0015
2010-01-20 02:50:55,389 map = 0%, reduce =0%
2010-01-20 02:51:07,511 map = 17%, reduce =0%
2010-01-20 02:51:11,560 map = 33%, reduce =0%
2010-01-20 02:51:18,632 map = 50%, reduce =0%
2010-01-20 02:51:21,685 map = 67%, reduce =0%
2010-01-20 02:51:23,724 map = 83%, reduce =0%
2010-01-20 02:51:25,750 map = 100%, reduce =0%
2010-01-20 02:51:30,794 map = 100%, reduce =17%
2010-01-20 02:51:35,856 map = 100%, reduce =100%
Ended Job = job_201001201134_0015
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0016, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0016
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0016
2010-01-20 02:51:41,127 map = 0%, reduce =0%
2010-01-20 02:51:51,224 map = 25%, reduce =0%
2010-01-20 02:52:04,345 map = 50%, reduce =0%
2010-01-20 02:52:08,409 map = 100%, reduce =0%
2010-01-20 02:52:09,441 map = 100%, reduce =8%
2010-01-20 02:52:21,548 map = 100%, reduce =100%
Ended Job = job_201001201134_0016
OK
1 A 1 A A_address
2 B 2 B B_address
3 C 3 C C_address
4 D 4 D D_address
5 E 5 E E_address
6 F 6 F F_address
7 G 7 G G_address
8 H 8 H H_address
9 I 9 I I_address
Time taken: 90.89 seconds
hive>
hive> select count(t1.id) from t1 join ( select t1.id , t1.name ,
t2.address from t1 join t2 on ( t1.id = t2.id ) where t2.id > 3 ) r1
on ( t1.id = r1.id ) ;
Total MapReduce jobs = 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0017, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0017
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0017
2010-01-20 02:54:57,465 map = 0%, reduce =0%
2010-01-20 02:55:06,563 map = 17%, reduce =0%
2010-01-20 02:55:18,722 map = 33%, reduce =0%
2010-01-20 02:55:26,829 map = 50%, reduce =0%
2010-01-20 02:55:28,860 map = 67%, reduce =0%
2010-01-20 02:55:29,878 map = 83%, reduce =0%
2010-01-20 02:55:30,908 map = 100%, reduce =0%
2010-01-20 02:55:34,947 map = 100%, reduce =11%
2010-01-20 02:55:45,039 map = 100%, reduce =100%
Ended Job = job_201001201134_0017
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0018, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0018
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0018
2010-01-20 02:55:49,246 map = 0%, reduce =0%
2010-01-20 02:55:58,324 map = 25%, reduce =0%
2010-01-20 02:56:09,456 map = 50%, reduce =0%
2010-01-20 02:56:10,481 map = 75%, reduce =0%
2010-01-20 02:56:12,516 map = 100%, reduce =0%
2010-01-20 02:56:19,594 map = 100%, reduce =8%
2010-01-20 02:56:28,678 map = 100%, reduce =100%
Ended Job = job_201001201134_0018
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0019, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0019
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0019
2010-01-20 02:56:34,726 map = 0%, reduce =0%
2010-01-20 02:56:45,829 map = 100%, reduce =0%
2010-01-20 02:56:58,937 map = 100%, reduce =100%
Ended Job = job_201001201134_0019
OK
6
Time taken: 125.945 seconds
hive>
沒有留言:
張貼留言