2010年1月20日 星期三

HadoopDB Join Testing on 3-Node Cluster

在還沒有搞清楚 Hive 以前,一直以為 HadoopDB 底部用 Databases 會有所限制,例如有 A, B 和 C 三台電腦構成的 Cluster ,若分別在 A, B 和 C 上各別建立資料庫以及 T1 跟 T2 兩個 Table, 當我透過 Hive 進行 Join 的查詢時,會不會因為資料不在同一個資料料庫裡而查不到呢?或是 Table 不在同一個資料庫裡就無法  Join 呢?如果你已經讀過 Hive 的設計架構,那肯定很清楚在使用 Hive on Hadoop 時,並不需要擔心這些事。


這個實驗很簡單,就只是要測試 HadoopDB 是否真的能提供 Join 功能,這是一開始上司丟給我的問題。當時我還不了解 Hive 因此也稍微存疑。了解 Hive 後,方知這些問題是由 Hive 解決,當然 HadoopDB 就不會碰到一樣的問題啦。但還是花一點點時間把實驗作完囉!


實驗設計:



  • 使用 3-node cluster ,分別是 Cluster01 , Cluster02 和 Cluster03

  • 準備測資 d1 和 d2 以及分別建立 table 的欄位資訊

    • d1_field_definition (t1, table 1)

      • ID int,
        NAME varchar(300)



    • d1

      • 1       A
        2       B
        3       C
        4       D
        5       E
        6       F
        7       G
        8       H
        9       I



    • d2_field_definition (t2 , table 2 )


      • ID int,
        ADDRESS varchar(300)



    • d2

      • 9       I_address
        8       H_address
        7       G_address
        6       F_address
        5       E_address
        4       D_address
        3       C_address
        2       B_address
        1       A_address





  • 一支幫你方便架設 HadoopDB 的 script




實驗過程與結果



  1. 以下指令全部皆在 Cluster01 這台機器的家目錄執行

  2. 將 d1 資料分割 3 份,以及匯入各台機器和建立 Hive 的 Table

    • $ python batch_setup.py -H d1 -t t1 -i t1 -g

    • 備份 HadoopDB.xml 等會要用


      • $ cp HadoopDB.xml HadoopDB_t1.xml



    • 使用 Hive 驗證

      • $ ./SMS_dist/bin/hive
        hive> show tables;
        OK
        t1
        Time taken: 8.772 seconds
        hive> select id,name from t1;
        Total MapReduce jobs = 1
        Number of reduce tasks is set to 0 since there's no reduce operator
        Starting Job = job_201001201134_0008, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0008
        Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0008
        2010-01-20 02:12:42,545 map = 0%,  reduce =0%
        2010-01-20 02:12:50,659 map = 33%,  reduce =0%
        2010-01-20 02:12:56,828 map = 67%,  reduce =0%
        2010-01-20 02:12:57,855 map = 100%,  reduce =0%
        Ended Job = job_201001201134_0008
        OK
        3       C
        6       F
        9       I
        2       B
        5       E
        8       H
        1       A
        4       D
        7       G
        Time taken: 22.447 seconds
        hive> quit;

      • 驗證完記得離開,因為 hive 預設只能一支 client 跟它 query,不 quit 會影響 batch_setup.py 的設定





  3. 將 d2 資料分割 3 份,以及匯入各台機器和建立 Hive 的 Table

    • $ python batch_setup.py -H d2 -t t2 -i t2 -g

    • 備份 HadoopDB.xml 等會要用

      • $ cp HadoopDB.xml HadoopDB_t2.xml



    • 使用 Hive 驗證

      • $ ./SMS_dist/bin/hive
        hive> show tables;
        OK
        t1
        t2
        Time taken: 9.313 seconds
        hive> select id,address from t2;
        Total MapReduce jobs = 1
        Number of reduce tasks is set to 0 since there's no reduce operator
        Starting Job = job_201001201134_0010, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0010
        Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0010
        2010-01-20 02:16:45,971 map = 0%,  reduce =0%
        2010-01-20 02:16:57,115 map = 33%,  reduce =0%
        2010-01-20 02:17:03,240 map = 67%,  reduce =0%
        2010-01-20 02:17:04,283 map = 100%,  reduce =0%
        Ended Job = job_201001201134_0010
        OK
        9       I_address
        8       H_address
        5       E_address
        4       D_address
        3       C_address
        7       G_address
        6       F_address
        2       B_address
        1       A_address
        Time taken: 25.357 seconds
        hive> quit;

      • 驗證完記得離開,因為 hive 預設只能一支 client 跟它 query





  4. 處理 HadoopDB.xml


    • 因為 batch_setup.py 只設計成單一資料的匯入,所以要自行處理,將 table 1 和 table 2 整合在一個檔案內

    • <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
      <DBClusterConfiguration xmlns="http://edu.yale.cs.db.hadoop/DBConfigurationSchema">
          <Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location="Cluster01">
              <Relations id="t1">
                  <Partitions url="jdbc:postgresql://Cluster01:5432/udb_t1_0" id="0"/>
              </Relations>
              <Relations id="t2">
                  <Partitions url="jdbc:postgresql://Cluster01:5432/udb_t2_0" id="0"/>
              </Relations>
              <Relations id="no_use"/>
          </Nodes>
          <Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location="Cluster02">
              <Relations id="t1">
                  <Partitions url="jdbc:postgresql://Cluster02:5432/udb_t1_1" id="1"/>
              </Relations>
              <Relations id="t2">
                  <Partitions url="jdbc:postgresql://Cluster02:5432/udb_t2_1" id="1"/>
              </Relations>
              <Relations id="no_use"/>
          </Nodes>
          <Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location="Cluster03">
              <Relations id="t1">
                  <Partitions url="jdbc:postgresql://Cluster03:5432/udb_t1_2" id="2"/>
              </Relations>
              <Relations id="t2">
                  <Partitions url="jdbc:postgresql://Cluster03:5432/udb_t2_2" id="2"/>
              </Relations>
              <Relations id="no_use"/>
          </Nodes>
      </DBClusterConfiguration>

    • $ hadoop fs -rmr HadoopDB.xml
      Deleted hdfs://Cluster01:9000/user/hadoop/HadoopDB.xml
      $ hadoop fs -put HadoopDB.xml HadoopDB.xml 

    • 測試合併後的 HadoopDB.xml 是否能正常工作

      • hive> select name,id from t1;
        Total MapReduce jobs = 1
        Number of reduce tasks is set to 0 since there's no reduce operator
        Starting Job = job_201001201134_0011, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0011
        Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0011
        2010-01-20 02:20:58,547 map = 0%,  reduce =0%
        2010-01-20 02:21:05,622 map = 33%,  reduce =0%
        2010-01-20 02:21:11,790 map = 67%,  reduce =0%
        2010-01-20 02:21:12,808 map = 100%,  reduce =0%
        Ended Job = job_201001201134_0011
        OK
        C       3
        F       6
        I       9
        B       2
        E       5
        H       8
        A       1
        D       4
        G       7
        Time taken: 29.667 seconds
        hive> select address,id from t2;
        Total MapReduce jobs = 1
        Number of reduce tasks is set to 0 since there's no reduce operator
        Starting Job = job_201001201134_0012, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0012
        Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0012
        2010-01-20 02:21:25,782 map = 0%,  reduce =0%
        2010-01-20 02:21:34,920 map = 33%,  reduce =0%
        2010-01-20 02:21:38,983 map = 67%,  reduce =0%
        2010-01-20 02:21:41,002 map = 100%,  reduce =0%
        Ended Job = job_201001201134_0012
        OK
        I_address       9
        H_address       8
        E_address       5
        D_address       4
        C_address       3
        G_address       7
        F_address       6
        B_address       2
        A_address       1
        Time taken: 18.932 seconds
        hive>






經過上頭的準備工作完成,開始正式測試 Join 囉


hive> select t1.id, t1.name, t2.address from t1 join t2 on ( t1.id = t2.id );
Total MapReduce jobs = 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0013, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0013
Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0013
2010-01-20 02:23:24,293 map = 0%,  reduce =0%
2010-01-20 02:23:36,400 map = 17%,  reduce =0%
2010-01-20 02:23:46,544 map = 33%,  reduce =0%
2010-01-20 02:23:52,248 map = 50%,  reduce =0%
2010-01-20 02:23:55,274 map = 67%,  reduce =0%
2010-01-20 02:23:57,291 map = 83%,  reduce =0%
2010-01-20 02:23:58,308 map = 100%,  reduce =0%
2010-01-20 02:24:03,360 map = 100%,  reduce =28%
2010-01-20 02:24:05,381 map = 100%,  reduce =100%
Ended Job = job_201001201134_0013
OK
1       A       A_address
2       B       B_address
3       C       C_address
4       D       D_address
5       E       E_address
6       F       F_address
7       G       G_address
8       H       H_address
9       I       I_address
Time taken: 44.725 seconds


驗證


udb_t1_0=# select * from t1;
 id | name
----+------
  1 | A
  4 | D
  7 | G
(3 rows)


udb_t2_0=# select * from t2;
 id |  address
----+-----------
  7 | G_address
  6 | F_address
  2 | B_address
  1 | A_address
(4 rows)


驗證的結果是對的,在 Cluster01 上只有 Table 1 的 1,4,7 資料,和 Table 2 的 7,6,2,1 資料,再加上這兩個 Table 是在不同的 databases 上,即 udb_t1_0 和  udb_t2_0 ,所以在資料並未集中在某台機器或其資料庫中,HadoopDB 還是可以處理好 Join 的工作啦,別忘了這是原先Hive就設計好的架構囉


以下是其他的測試


hive> select * from t1 join ( select t1.id , t1.name , t2.address
from t1 join t2 on ( t1.id = t2.id ) ) r1 on ( t1.id = r1.id ) ;
Total MapReduce jobs = 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0015, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0015
Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0015
2010-01-20 02:50:55,389 map = 0%,  reduce =0%
2010-01-20 02:51:07,511 map = 17%,  reduce =0%
2010-01-20 02:51:11,560 map = 33%,  reduce =0%
2010-01-20 02:51:18,632 map = 50%,  reduce =0%
2010-01-20 02:51:21,685 map = 67%,  reduce =0%
2010-01-20 02:51:23,724 map = 83%,  reduce =0%
2010-01-20 02:51:25,750 map = 100%,  reduce =0%
2010-01-20 02:51:30,794 map = 100%,  reduce =17%
2010-01-20 02:51:35,856 map = 100%,  reduce =100%
Ended Job = job_201001201134_0015
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0016, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0016
Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0016
2010-01-20 02:51:41,127 map = 0%,  reduce =0%
2010-01-20 02:51:51,224 map = 25%,  reduce =0%
2010-01-20 02:52:04,345 map = 50%,  reduce =0%
2010-01-20 02:52:08,409 map = 100%,  reduce =0%
2010-01-20 02:52:09,441 map = 100%,  reduce =8%
2010-01-20 02:52:21,548 map = 100%,  reduce =100%
Ended Job = job_201001201134_0016
OK
1       A       1       A       A_address
2       B       2       B       B_address
3       C       3       C       C_address
4       D       4       D       D_address
5       E       5       E       E_address
6       F       6       F       F_address
7       G       7       G       G_address
8       H       8       H       H_address
9       I       9       I       I_address
Time taken: 90.89 seconds
hive>


hive> select count(t1.id) from t1 join ( select t1.id , t1.name ,
t2.address from t1 join t2 on ( t1.id = t2.id ) where t2.id > 3 ) r1
on ( t1.id = r1.id ) ;
Total MapReduce jobs = 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0017, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0017
Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0017
2010-01-20 02:54:57,465 map = 0%,  reduce =0%
2010-01-20 02:55:06,563 map = 17%,  reduce =0%
2010-01-20 02:55:18,722 map = 33%,  reduce =0%
2010-01-20 02:55:26,829 map = 50%,  reduce =0%
2010-01-20 02:55:28,860 map = 67%,  reduce =0%
2010-01-20 02:55:29,878 map = 83%,  reduce =0%
2010-01-20 02:55:30,908 map = 100%,  reduce =0%
2010-01-20 02:55:34,947 map = 100%,  reduce =11%
2010-01-20 02:55:45,039 map = 100%,  reduce =100%
Ended Job = job_201001201134_0017
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0018, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0018
Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0018
2010-01-20 02:55:49,246 map = 0%,  reduce =0%
2010-01-20 02:55:58,324 map = 25%,  reduce =0%
2010-01-20 02:56:09,456 map = 50%,  reduce =0%
2010-01-20 02:56:10,481 map = 75%,  reduce =0%
2010-01-20 02:56:12,516 map = 100%,  reduce =0%
2010-01-20 02:56:19,594 map = 100%,  reduce =8%
2010-01-20 02:56:28,678 map = 100%,  reduce =100%
Ended Job = job_201001201134_0018
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0019, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0019
Kill Command = /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0019
2010-01-20 02:56:34,726 map = 0%,  reduce =0%
2010-01-20 02:56:45,829 map = 100%,  reduce =0%
2010-01-20 02:56:58,937 map = 100%,  reduce =100%
Ended Job = job_201001201134_0019
OK
6
Time taken: 125.945 seconds
hive>


沒有留言:

張貼留言