沒想到這麼快就又要裝 hadoop 啦!之前的經驗並不算是從頭安裝,整個環境都是別人建的,僅在該環境將 hadoop 從 0.18 版弄到 0.20 版而已。這次由於要測試 HBase REST 部分,因此又來先建一個來測試測試,且僅需安裝 Single Node 模式。若搭上前陣子安裝 Ubuntu 的經驗,或許勉強稱得上從無到有吧?!
網路上有非常不錯的文章,我主要是參考資料如下,但他是 0.20.0 的版本,經我測試的結果,剛好又有 Ubuntu 9.04 的問題,有些地方要變動,慶幸地花了些時間終於搞定了!建議先看看這些參考資料,這篇寫得東西純粹給用來給自己記憶而已
- Running Hadoop On Ubuntu Linux (Single-Node Cluster)
- Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
- Hadoop 0.20 Documentation - Pseudo-Distributed Operation
環境資訊
# uname -a
Linux changyy-desktop 2.6.28-15-generic #52-Ubuntu SMP Wed Sep 9 10:49:34 UTC 2009 i686 GNU/Linux
- 安裝 Openssh-server
- # sudo apt-get install openssh-server
- 因為我是用 Ubuntu 7.04 Desktop 並更新至 9.04 的,主要是 Desktop 版需要安裝一下 openssh-server 才能支援多人登入,若是 Server 版應該預設就有了吧,可用 ssh localhost 來測試
- 安裝 Java 環境
- # sudo apt-get install sun-java6-jdk
- 建立與設定 Hadoop 使用的帳號
- # sudo addgroup hadoop
- # sudo adduser --ingroup hadoop hadoop
- 設定 Hadoop 帳號登入不需帳密
- # su - hadoop
- 若已是 hadoop 帳號則可以略過此步
- # ssh-keygen -t rsa -P ''
- # cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
- 用 hadoop 帳號測試連線
- # ssh localhost
- 應該不需要輸入帳密即完成登入,若有問題可參考上述文章來解決。
- # su - hadoop
- 安裝 Hadoop 0.20.1
- 相關位置
- http://hadoop.apache.org/
- http://hadoop.apache.org/common/releases.html
- http://www.apache.org/dyn/closer.cgi/hadoop/core/
- 依上篇文章安裝在 /usr/loca 位置
- # cd /usr/local
- # sudo wget http://apache.stu.edu.tw/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz
- # sudo tar -xvf hadoop-0.20.1.tar.gz
- # sudo chown -R hadoop:hadoop hadoop-0.20.1
- # sudo ln -s hadoop-0.20.1/ hadoop
- 主要為了提供未來可以便利的切換版本,因此我先用 symbolic link ,若不懂的話,建議使用 sudo mv hadoop-0.20.1 hadoop 取代最後一步
- 刪掉下載的檔案
- # sudo rm -rf hadoop-0.20.1.tar.gz
- 相關位置
- 設定 Hadoop 0.21
- 請先切換成 hadoop 身份,若已是則可跳過此步
- # su - hadoop
- 設定環境變數
- # vim /usr/local/hadoop/conf/hadoop-env.sh
- Java 資訊
- export JAVA_HOME=/usr/lib/jvm/java-6-sun
- 停用 IPv6
- HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
- Java 資訊
- # vim /usr/local/hadoop/conf/hadoop-env.sh
- 設定資料位置、port等資訊
- 預定將資料擺在 /home/hadoop/db 中,可以先去建立一下,記的該目錄使用權歸於 hadoop 帳號
- # vim /usr/local/hadoop/conf/core-site.xml
- <property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/db/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
- <property>
- # vim /usr/local/hadoop/conf/mapred-site.xml
- <property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
- <property>
- 切記不要偷懶全寫在 core-site.xml ,因為我就是這樣導致預設的服務一直跑不起來,在 logs 中的 jobtracker 和 jobtracker 會一直看到的訊息
- ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.RuntimeException: Not a host:port pair: local
- FATAL org.apache.hadoop.mapred.JobTracker: java.lang.RuntimeException: Not a host:port pair: local
- 格式化儲存空間
- # /usr/local/hadoop/bin/hadoop namenode -format
- 啟動 Hadoop
- # /usr/local/hadoop/bin/start-all.sh
- 關閉 Hadoop
- # /usr/local/hadoop/bin/stop-all.sh
- 請先切換成 hadoop 身份,若已是則可跳過此步
- 啟動 Hadoop 後,可用以下指令來查看狀態
- # jps
- # netstat -plten | grep java
- # /usr/local/hadoop/bin/hadoop dfsadmin -report
上頭紅色部份就是跟參考資料不同的地方,至於在 Kernel 停用 IPv6 的事,似乎在 Ubuntu 9.04 得要重編核心,我測試的結果最後還是選用 Hadoop 的設定檔,其他資訊可以參考這篇
測試 Hadoop (別忘了要用 hadoop 身份以及啟動 Hadoop Framework 囉)
- 建立目錄
- # /usr/local/hadoop/bin/hadoop dfs -mkdir input
- 檢視
- # /usr/local/hadoop/bin/hadoop dfs -ls
- 丟個測資和檢視一下
- # /usr/local/hadoop/bin/hadoop dfs -put /usr/local/hadoop/conf/core-site.xml input/
- # /usr/local/hadoop/bin/hadoop dfs -lsr
- 跑範例和看結果
- # /usr/local/hadoop/bin/hadoop jar hadoop-0.20.1-examples.jar wordcount input output
- # /usr/local/hadoop/bin/hadoop dfs -cat output/*
收工
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
thanks for share.
回覆刪除