看來是非常認真的鄉民,跟在 Google Street View 拍攝的車子後面比了 Y 啦!不過這也代表 Google Maps 服務越來越深入台灣了!原先只有台北市,現在許多地方也慢慢補上囉
未來能應用的角度將越來越豐富,例如統整小吃位置等,甚至只需看著 Street View 上的招牌打電話聯絡!未來若 3G 或 WiMax 上網費率能夠越來越平民化,那生活就會越來越便利囉!
看來是非常認真的鄉民,跟在 Google Street View 拍攝的車子後面比了 Y 啦!不過這也代表 Google Maps 服務越來越深入台灣了!原先只有台北市,現在許多地方也慢慢補上囉
未來能應用的角度將越來越豐富,例如統整小吃位置等,甚至只需看著 Street View 上的招牌打電話聯絡!未來若 3G 或 WiMax 上網費率能夠越來越平民化,那生活就會越來越便利囉!
為了避免厲害的 Search Engine,只好偷偷地更名啦。成長營這種東西,高一參加過一次,若沒記錯應該也是寒假的時段,恰好跟這次稍稍地呼應。當時是辦來讓高中同學相互認識,甚至在男女分班的情況下,還有機會跟女生跳舞,沒記錯的話,就是那首呢喃。
這次的成長營,讓我收穫良多,盡管還是沒有接觸到太多太廣的人面,但透過活動,甚至吃飯的時候閒聊,我發現有許多地方還待發展的部分。記得有不少故事是那般地道述,想要成功就得在各個單位待一下,但如果已在某單位獲得不錯的成果或戰績時,又有多少人肯放下身段下放到其他單位呢?
很巧地在一次晚餐的聚會中,碰到從事產業業務相關的工作者,我很感興趣地是要甚麼樣的背景才適合,畢竟那不像是現在大專院校的各大系名,並沒有明確的資格定義,究竟要怎樣挑選負責的人才呢?盡管這個疑問並未解決,但似乎也不必急著解決,就像各大公司的業務,不就肯做就行了?
另外,在結訓前一晚的創意競賽中,恰好整隊就只有我跟同事是從事技術研究的部分,我發現自己經驗太窄,不像業務部門那樣輕鬆地拉廣眼界,每次思考一些東西時,完全地想要控制實作性,所以提出的架構是完全可以實作的,甚至已經變成整合型服務,更讓我想起填寫工作計畫書時,往往填寫的部分早已知道怎樣做了!那這樣未來又怎樣能跳得更遠呢?在這樣思考模式中,很容易就安靜,不想多說,因為想不到了?或許更因為技術的背景局限自己呢?這的確需要保留更多的赤子之心,多看看,多做夢啊。比較好玩的,不曉得是不是陽盛陰衰,還是裡頭太多當完兵?一堆題目都充滿 18 禁的話題,甚至一直圍繞在「抓猴」的話題,囧
說到這也會讓我感受到我們這組異性的相處還滿自然的,可能是其他人本來就認識,或是熟女、已婚了吧?當某個男生講出兩性話題時,有位女性夥伴會直接回他「低級!」,可能對我求學環境幾乎是男女分班的,突然間感到這樣的互動是多麼地自然,有點享受在這樣的氣氛中,實在太難得了!除此之外,原先打算跟同事住在一間晚上可以聊聊天,但不預期地就是被分開,想去問換房間的事,還被工作人員誤會說要睡在同一張床,真是給他誇張了點,最後則是因為對方的手機未開機,就只好享受這意外的安排吧!
未來啊?我的目標尚未明確,或許我待的單位就是會一直如此地下去,希望自己能夠更快地適應這種模式吧,是拓荒者,更是敢死隊啊。
在還沒有搞清楚 Hive 以前,一直以為 HadoopDB 底部用 Databases 會有所限制,例如有 A, B 和 C 三台電腦構成的 Cluster ,若分別在 A, B 和 C 上各別建立資料庫以及 T1 跟 T2 兩個 Table, 當我透過 Hive 進行 Join 的查詢時,會不會因為資料不在同一個資料料庫裡而查不到呢?或是 Table 不在同一個資料庫裡就無法 Join 呢?如果你已經讀過 Hive 的設計架構,那肯定很清楚在使用 Hive on Hadoop 時,並不需要擔心這些事。
這個實驗很簡單,就只是要測試 HadoopDB 是否真的能提供 Join 功能,這是一開始上司丟給我的問題。當時我還不了解 Hive 因此也稍微存疑。了解 Hive 後,方知這些問題是由 Hive 解決,當然 HadoopDB 就不會碰到一樣的問題啦。但還是花一點點時間把實驗作完囉!
實驗設計:
實驗過程與結果
經過上頭的準備工作完成,開始正式測試 Join 囉
hive> select t1.id, t1.name, t2.address from t1 join t2 on ( t1.id = t2.id );
Total MapReduce jobs = 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0013, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0013
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0013
2010-01-20 02:23:24,293 map = 0%, reduce =0%
2010-01-20 02:23:36,400 map = 17%, reduce =0%
2010-01-20 02:23:46,544 map = 33%, reduce =0%
2010-01-20 02:23:52,248 map = 50%, reduce =0%
2010-01-20 02:23:55,274 map = 67%, reduce =0%
2010-01-20 02:23:57,291 map = 83%, reduce =0%
2010-01-20 02:23:58,308 map = 100%, reduce =0%
2010-01-20 02:24:03,360 map = 100%, reduce =28%
2010-01-20 02:24:05,381 map = 100%, reduce =100%
Ended Job = job_201001201134_0013
OK
1 A A_address
2 B B_address
3 C C_address
4 D D_address
5 E E_address
6 F F_address
7 G G_address
8 H H_address
9 I I_address
Time taken: 44.725 seconds
驗證
udb_t1_0=# select * from t1;
id | name
----+------
1 | A
4 | D
7 | G
(3 rows)
udb_t2_0=# select * from t2;
id | address
----+-----------
7 | G_address
6 | F_address
2 | B_address
1 | A_address
(4 rows)
驗證的結果是對的,在 Cluster01 上只有 Table 1 的 1,4,7 資料,和 Table 2 的 7,6,2,1 資料,再加上這兩個 Table 是在不同的 databases 上,即 udb_t1_0 和 udb_t2_0 ,所以在資料並未集中在某台機器或其資料庫中,HadoopDB 還是可以處理好 Join 的工作啦,別忘了這是原先Hive就設計好的架構囉
以下是其他的測試
hive> select * from t1 join ( select t1.id , t1.name , t2.address
from t1 join t2 on ( t1.id = t2.id ) ) r1 on ( t1.id = r1.id ) ;
Total MapReduce jobs = 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0015, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0015
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0015
2010-01-20 02:50:55,389 map = 0%, reduce =0%
2010-01-20 02:51:07,511 map = 17%, reduce =0%
2010-01-20 02:51:11,560 map = 33%, reduce =0%
2010-01-20 02:51:18,632 map = 50%, reduce =0%
2010-01-20 02:51:21,685 map = 67%, reduce =0%
2010-01-20 02:51:23,724 map = 83%, reduce =0%
2010-01-20 02:51:25,750 map = 100%, reduce =0%
2010-01-20 02:51:30,794 map = 100%, reduce =17%
2010-01-20 02:51:35,856 map = 100%, reduce =100%
Ended Job = job_201001201134_0015
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0016, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0016
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0016
2010-01-20 02:51:41,127 map = 0%, reduce =0%
2010-01-20 02:51:51,224 map = 25%, reduce =0%
2010-01-20 02:52:04,345 map = 50%, reduce =0%
2010-01-20 02:52:08,409 map = 100%, reduce =0%
2010-01-20 02:52:09,441 map = 100%, reduce =8%
2010-01-20 02:52:21,548 map = 100%, reduce =100%
Ended Job = job_201001201134_0016
OK
1 A 1 A A_address
2 B 2 B B_address
3 C 3 C C_address
4 D 4 D D_address
5 E 5 E E_address
6 F 6 F F_address
7 G 7 G G_address
8 H 8 H H_address
9 I 9 I I_address
Time taken: 90.89 seconds
hive>
hive> select count(t1.id) from t1 join ( select t1.id , t1.name ,
t2.address from t1 join t2 on ( t1.id = t2.id ) where t2.id > 3 ) r1
on ( t1.id = r1.id ) ;
Total MapReduce jobs = 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0017, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0017
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0017
2010-01-20 02:54:57,465 map = 0%, reduce =0%
2010-01-20 02:55:06,563 map = 17%, reduce =0%
2010-01-20 02:55:18,722 map = 33%, reduce =0%
2010-01-20 02:55:26,829 map = 50%, reduce =0%
2010-01-20 02:55:28,860 map = 67%, reduce =0%
2010-01-20 02:55:29,878 map = 83%, reduce =0%
2010-01-20 02:55:30,908 map = 100%, reduce =0%
2010-01-20 02:55:34,947 map = 100%, reduce =11%
2010-01-20 02:55:45,039 map = 100%, reduce =100%
Ended Job = job_201001201134_0017
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0018, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0018
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0018
2010-01-20 02:55:49,246 map = 0%, reduce =0%
2010-01-20 02:55:58,324 map = 25%, reduce =0%
2010-01-20 02:56:09,456 map = 50%, reduce =0%
2010-01-20 02:56:10,481 map = 75%, reduce =0%
2010-01-20 02:56:12,516 map = 100%, reduce =0%
2010-01-20 02:56:19,594 map = 100%, reduce =8%
2010-01-20 02:56:28,678 map = 100%, reduce =100%
Ended Job = job_201001201134_0018
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201001201134_0019, Tracking URL = http://Cluster01:50030/jobdetails.jsp?jobid=job_201001201134_0019
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=Cluster01:9001 -kill job_201001201134_0019
2010-01-20 02:56:34,726 map = 0%, reduce =0%
2010-01-20 02:56:45,829 map = 100%, reduce =0%
2010-01-20 02:56:58,937 map = 100%, reduce =100%
Ended Job = job_201001201134_0019
OK
6
Time taken: 125.945 seconds
hive>
最近開始收到一堆垃圾信件,其中有一項是非常擾人的!那就是 Yahoo Group 的邀請函!特別是一堆愛寄垃圾信的人,他們就只要去新增一個 Yahoo Group 後,把他們要寄的人通通加進去,之後就只要對 Group 發信就可以變成寄群組信那樣,達到寄廣告信的目的
可惡的是 Yahoo Group 預設是被加的人,自動默認加入,然後要取消它還得自己去回信取消!搞得自己還得動手處理,今天忍不住想去寄信給 Yahoo ,慶幸地發現已經有人咆哮過了,哈。
目前這種 Yahoo Group 邀請的預設狀態,就跟 Facebook 處理隱私權給我的觀感一樣 :P 感覺什麼都要很 open ,表面上說這是趨勢,骨子裡的主意還不是想合理地使用個人資料吧。
不小心扯遠了,此解決方式:
滿建議這種邀請函再加上使用者確認的 link 不就好了嗎,收到邀請函還需要點選 link 認同才正式加入 Group ,這樣讓收到廣告信的人可以輕鬆刪信,對於真正想加入 Group 的,也可以多點一下就處理好了。
太習慣寫 php 了,其實也有一陣子用 php 當作 script 管機器,最近則是試過 python ,所以,就來試試 bash 吧!
首先會碰到的問題是如何處理參數,這部分可以在 Getopt and getopts 看到很豐富的範例,接下來還可以逛逛 鳥哥的 Linux 私房菜 - 第十三章、學習 Shell Scripts #善用判斷式 ,就可以完成很多事囉!接著我想做的字串取代,就可以使用 sed 來處理,相關參考資料 Bash Shell: Replace a string with another string in all files using sed and perl -pie
最後,留個範例給自己吧
#!/bin/bash
WORK_DIR=
OPTS_ENABLE="false"
usage_help()
{
echo "Usage> $0 -i \"Path\" ..."
echo " -i \"/tmp\" # path for write a config file"
}
args=`getopt i:e $*`
if test $? != 0
then
usage_help
exit 1
fi
set -- $args
for i do
case "$i" in
-i) shift; WORK_DIR=$1 ;shift;;
-e) shift; OPTS_ENABLE="true" ;shift;;
esac
done
# check install path
if test ! -r $WORK_DIR || test ! -x $WORK_DIR || test ! -w $WORK_DIR ; then
echo "Please Check WorkDir: [$WORK_DIR]"
exit 1
fi
FILE_CONFIG=$WORK_DIR/config
# bakcup & replace
if test -r $FILE_CONFIG ; then
cp $FILE_CONFIG $FILE_CONFIG.bak.`date +%Y%m%d%H%M%S`
sed -e "s/KEY/$OPTS_ENABLE/g" $FILE_CONFIG > $FILE_CONFIG
fi
其中,若存在 bash shell 裡的變數是帶有路徑的, ex: x=/path/bin.exe , 那使用 sed -e "s/KEY/$x/g" $FILE_CONFIG > $FILE_CONFIG 會出問題,解法就是先把 x 變數中的 '/' 取代成 '\/' 吧!
x=$(echo $x | sed "s/\//\\\\\//g")
這陣子開始發現 Hinedo 無法播放廣播了,很容易猜想是 Hichannel 改變了播放廣播的架構,這導致不只 Hinedo 不能播放,其他利用類似原理提供廣播的播放軟體也一樣失效。原本打算抽空去看的,但最後一直沒播出時間,今天無意間閒逛,發現在 PTT 的 EZsoft 看到有人提出修正囉!
請更新 Play.vbs ,把該檔第 6 列進行更新
原樣:base = "http://hichannel.hinet.net/player/radio/index.jsp?radio_id="
更新:base = "http://hichannel.hinet.net/player/radio/mediaplay.jsp?radio_id="
希望 Hinedo 官網也能盡快更新囉!
想說以前都用 UltraEdit 編輯遊戲存檔,沒想到 VIM 也能這樣做。
只要編輯時,下 :%! xxd ,就會以 Hex Mode 顯示檔案,然後編輯完想返回可以再用 :%! xxd -r 接著再存檔囉
圖片來源:http://books.com.tw/
昨天早上,上班前望一望桌上的這本書,已被擺在那兩個禮拜,只看了序。也不知是什麼樣的動力,明明也穿好了鞋,最後還是拎了它出門。這本是高中同學去年底推薦我看一看的書,那天我們正談論著運動彩,只是我真的太少看書,連他在高中畢業前送我的「溫一壺月光下酒」我也都還沒看完,更別說前一陣子完婚的好友送我的「性格組合論」。我大概比較喜歡觀察週遭勝過窩在書房看書吧,但明明我就是個阿宅。
每次要返家,心中會有一點點掙扎,因為從晚上五點下班後,大概花二十分吃飯,偶爾擔心假日車潮會只買個麵包牛奶,再花個 30 分鐘的機車車程到達火車站,並且要找一找停車位,最後再花 2~3 小時搭車回家,所以,大部分從下班到回家休息需要花費五個小時,但我還是喜歡回家。昨天在火車站等車時,開始從頭看這本書,前面還算吸引人,但後面開始提到一些數學式子的名稱時,我卻開始感覺枯燥!這本書將讀者定位在不懂或略懂數學的人,所以並沒有仔細描述那些式子,不曉得是不是這個關係,反而讓我看不下去?
那種感覺就像碩班第一篇自己挑選閱讀的 paper 時,前面的長篇大論,我根本沒耐心閱讀,反而看到數學式子會很專心了解它的意義,也有可能就像看 code 時,我只想去 trace 流程而不想多看一眼註解!然而,這本書並不這般影響我的,我還是配著寒風在車站上的涼椅上看著,畢竟也睡不著吧!
偶爾上了車後,發現坐在我身旁的女生拎著一本比這本書厚三倍的,我偷偷描幾眼,竟然看到直方圓、標準差、四分位差、機率甚至書後的查表!驚覺也太巧了吧。雖然我還不確定她到底是念哪個系所,但我相信跟我手上的這本書有微妙的相關。不過車上比較溫暖,她沒看幾眼也睡了,當然我也不落人後。
猜測著她未來可能的職業時,回想起前陣子跟同事吃飯閒聊起看牙醫預約的事,他說竟然要護士小姐花人力去安排,對我們而言,三兩下就能寫出堪用的預約系統,只是隨即而來的,卻是人力過盛所帶來的失業潮,這倒呼應起我為了架設一些環境寫了一套簡單的 Script ,讓安裝的人可以打不到 30 個字的一道指令,按一下 Enter 後,自動產生至少 15 個指令,甚至某些條件上還可以蹦出 50 個指令以上。當時我想了想,的確,有些程式是很方便的,但它方便到不該存在。最簡單的例子就是遊戲外掛,如果普及後,那遊戲就只剩下數字在說話了。或許,哪一天也會出現致命的關鍵程式,讓一大堆工程師失業吧!?
回到書中,字裡行間,透露出要認真地規劃實驗步驟!而後我也發現,生活上真的充斥著許多數學公式,無論科學或社會學,都想用數學勾勒出事件的模型,像我那位同學,也想找我寫一些預測的程式,但我很快地推掉,理由是修 Dataminig 那門課,當老師提到 Final Project 時,第一句話就是跟大家說不要去做股市的分析。雖然我也多少知道什麼黃金交叉的用法,說穿了那只是平均數值的把戲吧。
回到現實生活中,對於數學的使用,我發現我還有點薄弱,頂多看著車票座位可以推一下靠窗靠走道,或是寫程式上算算硬碟傳輸速度外,似乎就沒多少了。
生活的回顧,沒想到從這本書開始。
工作後兩個月,有點因為環境上的福利在想要不要一直做下去,但第三個月末,開始在想是不是要大膽一點創業。目前聽到幾個人生規劃,有開披隡店的故事,也有幫傳統產業做 SkypeOut 整合的,更有靠寫遊戲搭在大型機台的,細想,難到目前手頭上的能力,不足以發光發熱嗎?我想這個問題,我得多觀察與繼續學習個一、兩年吧,恰好用在這個時期多看看多體驗!還有一點很重要的,那就是公司的福利要靠自己爭取 :P 所以,事情順遂時,那就準時下班吧!人生中的每一天,花 1/3 養身體,再花 1/3 賺錢養活自己跟家人,最後 1/3 的多樣性,就得靠自己爭取囉。
+ =HadoopDB
圖片來源:http://hadoop.apache.org/ 與 http://wiki.postgresql.org/
雖然,上頭圖片的結合並不代表 HadoopDB 的原意,但 HadoopDB 開發上使用了 PostgreSQL 作為範例。那什麼是 HadoopDB 呢?那就是提出並實作在 DataNode 上架設 Database 的模式,以此整合起來的一個服務架構。傳統 Databases 技術上的鑽研已不是短短幾年可以道盡,然而,目前火紅的 Hadoop ,在資料儲存上預設是採用 HDFS 模式,就是簡單的檔案儲存模式。如果把資料的管理擺在資料庫,那能不能將過去在資料庫上的效能帶入 Hadoop 裡呢?這就是 HadoopDB 想嘗試的方向。HadoopDB:An architectural hybrid of MapReduce and DBMS technologies,也有人用這樣來描述 HadoopDB:An Open Source Parallel Database。
圖片來源:http://hadoopdb.sourceforge.net/guide/
上圖是 HadoopDB 的架構圖,有一個 SMS Planner 將 SQL 語法轉成 MapReduce Jobs ,而底層又有一次轉換,將 MapReduce 存取轉成 SQL 語法,在此不多探討,有興趣的請直接看它的 Paper 囉。
由於它是在 Hadoop-0.19.x 開發的,因此我還是用 Hadoop-0.19.2 來架設,至於架設部分可以參考這篇:[Linux] 安裝 Hadoop 0.20.1 Multi-Node Cluster @ Ubuntu 9.10,其中 0.19.2 與 0.20.1 安裝上只有些微的差別,在上述文章中的 hadoop-0.20.1/conf/core-site.xml 與 hadoop-0.20.1/conf/mapred-site.xml
的內容,只需改寫在 hadoop-0.19.2/conf/hadoop-site.xml 即可。接著下面的介紹也將延續上則安裝教學,以 3-Node Cluster ,分別以 Cluster01、Cluster02 和 Cluster03 作為範例敘述,並且各台使用 hadoop 帳號來操作。
上述是採用 unchunked 當作範例,如果改用 chunked 模式,假設有三台機器,並且打算使用三個資料庫,那設定上就會變成 3 x 3 的數量,也就是共有 9 個資料庫要設定,包括建立資料庫、建立資料表,匯入資料等等的,因此實作上應該寫隻 script 處理,在 HadoopDB 網站上有一個建立 /my_data 目錄範例,我把它稍微改寫:
#!/usr/bin/python
import sys, os, thread, commands
import getopt
DEBUG_MODE = True
completed = {}
create_db_cmd_list = [ ''
, 'createdb testdb'
, 'echo "CREATE TABLE Helo ( ID int );" | psql testdb'
, 'dropdb testdb'
]
cmd_list = []
cmd_list.extend( create_db_cmd_list )
def ParseHadoopXML( file_path ) :
return
def executeThread(node, *args ):
#Make sure key is accessible and is the correct key name.
#os.system("ssh -i key -o 'StrictHostKeyChecking=no' %s \'mkdir /my_data \'" %(node))
#You could replace the mkdir command with another command line or add more command lines,
# as long as you prefix the command with the ssh connection.
if DEBUG_MODE :
print "\tShow Only"
for cmd in cmd_list :
if cmd == None or cmd == '' :
continue;
cmd = cmd.strip()
if cmd == '' :
continue;
cmd_exec = "ssh %s \'%s\'" % (node , cmd )
print "\t" , cmd_exec
if DEBUG_MODE == False :
os.system( cmd_exec )
completed[node] = "true"
def main( argv=None ):
hostfile = "nodes.txt"
internalips = open(hostfile,'r').readlines()
for i in internalips:
os.system('sleep 1')
node_info = i.strip() ,
thread.start_new_thread(executeThread, node_info )
while (len(completed.keys()) < len(internalips)):
os.system('sleep 2')
print "Execution Completed"
if __name__ == "__main__":
main()
最後,我則是來個大改寫,若要使用十分建議先在用在虛擬環境,看看流程對不對,當然,最好是自己先手動設定過,等流程清楚後再設計符合自己的需求。使用上要先準備的資料:
純粹 show 出將會執行的指令,請留意它將會刪除哪些目錄、資料庫等等
$ python this.py --source_dir_in_hdfs src
真正運行
$ python this.py --source_dir_in_hdfs src -go
預設是 unchunked ,若想設定可以用 --chunk_num 設定
$ python this.py --source_dir_in_hdfs src --chunk_num 3
實際運行例子:
- hadoop@Cluster01:~$ cat nodes.txt
192.168.56.168
192.168.56.169
192.168.56.170- hadoop@Cluster01:~$ cat table_create
ID int,
NAME varchar(250)- hadoop@Cluster01:~$ python batch_setup.py --source_dir_in_hdfs src
Current Status is just Debug Mode for show all commands
please set '-g' or '--go' option to execute them after check all commands.(look at the 'rm -rf' and 'hadoop fs -rmr')
$ /usr/bin/java -cp /home/hadoop/lib/hadoopdb.jar edu.yale.cs.hadoopdb.catalog.SimpleCatalogGenerator /tmp/Catalog.properties
=> Start to put the HadoopDB.xml into HDFS
$ /home/hadoop/bin/hadoop fs -rmr HadoopDB.xml
$ /home/hadoop/bin/hadoop fs -put HadoopDB.xml HadoopDB.xml
=> The data source(src) would be partitioned into 3 parts(tmp_out_hadoopdb) by the delimiter (\n)
$ /home/hadoop/bin/hadoop fs -rmr tmp_out_hadoopdb
$ /home/hadoop/bin/hadoop jar /home/hadoop/lib/hadoopdb.jar edu.yale.cs.hadoopdb.dataloader.GlobalHasher src tmp_out_hadoopdb 3 '\n' 0
=> To configure your nodes...
ssh 192.168.56.168 "dropdb udb_hadoopdb_0"
ssh 192.168.56.168 "createdb udb_hadoopdb_0"
ssh 192.168.56.168 "echo \"create table hadoopdb ( id int, name varchar(250) );\" | psql udb_hadoopdb_0"
ssh 192.168.56.168 "rm -rf /tmp/out_for_global_parition"
ssh 192.168.56.168 "/home/hadoop/bin/hadoop fs -get tmp_out_hadoopdb/part-00000 /tmp/out_for_global_parition"
ssh 192.168.56.168 "echo \"COPY hadoopdb FROM '/tmp/out_for_global_parition' WITH DELIMITER E'\t';\" | psql udb_hadoopdb_0"
ssh 192.168.56.168 "rm -rf /tmp/out_for_global_parition"
ssh 192.168.56.170 "dropdb udb_hadoopdb_2"
ssh 192.168.56.170 "createdb udb_hadoopdb_2"
ssh 192.168.56.170 "echo \"create table hadoopdb ( id int, name varchar(250) );\" | psql udb_hadoopdb_2"
ssh 192.168.56.170 "rm -rf /tmp/out_for_global_parition"
ssh 192.168.56.170 "/home/hadoop/bin/hadoop fs -get tmp_out_hadoopdb/part-00002 /tmp/out_for_global_parition"
ssh 192.168.56.170 "echo \"COPY hadoopdb FROM '/tmp/out_for_global_parition' WITH DELIMITER E'\t';\" | psql udb_hadoopdb_2"
ssh 192.168.56.170 "rm -rf /tmp/out_for_global_parition"
ssh 192.168.56.169 "dropdb udb_hadoopdb_1"
ssh 192.168.56.169 "createdb udb_hadoopdb_1"
ssh 192.168.56.169 "echo \"create table hadoopdb ( id int, name varchar(250) );\" | psql udb_hadoopdb_1"
ssh 192.168.56.169 "rm -rf /tmp/out_for_global_parition"
ssh 192.168.56.169 "/home/hadoop/bin/hadoop fs -get tmp_out_hadoopdb/part-00001 /tmp/out_for_global_parition"
ssh 192.168.56.169 "echo \"COPY hadoopdb FROM '/tmp/out_for_global_parition' WITH DELIMITER E'\t';\" | psql udb_hadoopdb_1"
ssh 192.168.56.169 "rm -rf /tmp/out_for_global_parition"
$ /home/hadoop/bin/hadoop fs -rmr tmp_out_hadoopdb
=> To setup the external table for Hive
$ /home/hadoop/bin/hadoop fs -mkdir /db
$ /home/hadoop/bin/hadoop fs -rmr /db/hadoopdb
$ echo "drop table hadoopdb;" | /home/hadoop/SMS_dist/bin/hive
$ echo "create external table hadoopdb ( id int, name string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS INPUTFORMAT 'edu.yale.cs.hadoopdb.sms.connector.SMSInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/db/hadoopdb'; " | /home/hadoop/SMS_dist/bin/hive
=> All Execution Completed...
#!/usr/bin/python
# At Python 2.6.4
# Yuan-Yi Chang
# 2010/01/07 15:09
#
import sys, os, thread, commands
import re, os.path
from optparse import OptionParser
BIN_JAVA = '/usr/bin/java'
BIN_HADOOP = '/home/hadoop/bin/hadoop'
BIN_HIVE = '/home/hadoop/SMS_dist/bin/hive'
JAR_HADOOPDB = '/home/hadoop/lib/hadoopdb.jar'
completed = {}
cmd_for_node = {}
def initHadoopDB( data_in_hdfs = None , data_delimiter = '\n' , data_field_delimiter = '\t' ,
data_partition_out = None ,
nodes_in_file = 'nodes.txt' , chunks_per_node = 3 ,
table_name = None , table_field_info = None ,
db_user = 'hadoop' , db_pass='1234' , db_field_delimiter = '|' , hive_db_dir_in_hdfs = '/db' ,
tmp_path_for_catelog = '/tmp/Catalog.properties' ,
out_hadoop_xml = 'HadoopDB.xml' , hadoop_xml_in_hdfs = 'HadoopDB.xml' ,
DEBUG_MODE = True ) :
if data_in_hdfs is None :
print 'Please input the path of the data source in HDFS'
return False
if data_partition_out is None :
print 'Please input the path for the data source parition in HDFS'
return False
if table_name is None or re.match( '/[a-z0-9_]+/' , table_name ) :
print 'Please input the table name with [a-z0-9_] only'
return False
if table_field_info is None or os.path.isfile( table_field_info ) is False :
print 'Please check the "table_field_info" : ' + str(table_field_info)
return False
if os.path.isfile( nodes_in_file ) is False :
print 'Please check the "nodes_in_file" : ' + nodes_in_file
return False
if chunks_per_node < 0 :
print 'Please check the "chunks_per_node" : ' + chunks_per_node + ' , 0 for no chunk'
return False
data_delimiter = data_delimiter.replace( '\n' , '\\n' ).replace( '\t' , '\\t' )
data_field_delimiter = data_field_delimiter.replace( '\t' , '\\t' ).replace( '\n' , '\\n' )
db_field_delimiter = db_field_delimiter.replace( '\t' , '\\t' ).replace( '\n' , '\\n' )
make_catelog = ''
#Properties for Catalog Generation'
##################################
make_catelog += 'nodes_file='+nodes_in_file+'\n'
if chunks_per_node < 2 :
make_catelog += 'relations_chunked=no_use' + '\n'
make_catelog += 'relations_unchunked='+table_name + '\n'
else:
make_catelog += 'relations_unchunked=' + 'no_use' + '\n'
make_catelog += 'relations_chunked='+table_name + '\n'
make_catelog += 'catalog_file=' + out_hadoop_xml + '\n'
##
#DB Connection Parameters
##
make_catelog += 'port=5432' + '\n'
make_catelog += 'username=' + db_user + '\n'
make_catelog += 'password=' + db_pass + '\n'
make_catelog += 'driver=org.postgresql.Driver' + '\n'
make_catelog += 'url_prefix=jdbc\\:postgresql\\://'+ '\n'
##
#Chunking properties
##
make_catelog += 'chunks_per_node=' + str(chunks_per_node) + '\n'
make_catelog += 'unchunked_db_prefix=udb_' + table_name + '_' + '\n'
make_catelog += 'chunked_db_prefix=cdb_'+ table_name + '_' + '\n'
##
#Replication Properties
##
make_catelog += 'dump_script_prefix=/root/dump' + '\n'
make_catelog += 'replication_script_prefix=/root/load_replica_' + '\n'
make_catelog += 'dump_file_u_prefix=/mnt/dump_udb' + '\n'
make_catelog += 'dump_file_c_prefix=/mnt/dump_cdb'+ '\n'
##
#Cluster Connection
##
make_catelog += 'ssh_key=id_rsa-gsg-keypair' + '\n'
try:
f = open( tmp_path_for_catelog , 'w' )
f.write( make_catelog )
f.close()
except:
print 'Error to write a catelog:'+tmp_path_for_catelog
return False
cmd_exec = BIN_JAVA + ' -cp ' + JAR_HADOOPDB + '
edu.yale.cs.hadoopdb.catalog.SimpleCatalogGenerator ' +
tmp_path_for_catelog
if DEBUG_MODE :
print '$ ' + cmd_exec
else:
os.system( cmd_exec )
if os.path.isfile( out_hadoop_xml ) is False :
print 'Please check the "out_hadoop_xml" : ' + out_hadoop_xml
return False
print '\n=> Start to put the HadoopDB.xml into HDFS\n'
if DEBUG_MODE :
print '$ ' + BIN_HADOOP + ' fs -rmr ' + hadoop_xml_in_hdfs
print '$ ' + BIN_HADOOP + ' fs -put ' + out_hadoop_xml + ' ' + hadoop_xml_in_hdfs
else:
os.system( BIN_HADOOP + ' fs -rmr ' + hadoop_xml_in_hdfs )
os.system( BIN_HADOOP + ' fs -put ' + out_hadoop_xml + ' ' + hadoop_xml_in_hdfs )
partition_num = 0
node_list = []
try:
tmp_list = open( nodes_in_file ,'r').readlines()
for line in tmp_list :
line = line.strip()
if line <> '' :
node_list.append( line )
partition_num = len( node_list )
except:
print 'Please check the "nodes_in_file" : ' + nodes_in_file
return False
if partition_num > 1 :
cmd_exec = BIN_HADOOP + ' jar ' + JAR_HADOOPDB + '
edu.yale.cs.hadoopdb.dataloader.GlobalHasher ' + data_in_hdfs + ' ' +
data_partition_out + ' ' + str(partition_num) + ' \'' + data_delimiter
+ '\' 0 '
print '\n=> The data
source('+data_in_hdfs+') would be partitioned into
'+str(partition_num)+' parts('+data_partition_out+') by the delimiter
('+data_delimiter+')\n'
if DEBUG_MODE :
print '$ ' + BIN_HADOOP + ' fs -rmr ' + data_partition_out
print '$ ' + cmd_exec
else:
os.system( BIN_HADOOP + ' fs -rmr ' + data_partition_out )
os.system( cmd_exec )
else:
print '\n=> The number of datanodes should be > 1\n'
return False
HadoopDB_Info = ''
try:
HadoopDB_Info = open( out_hadoop_xml , 'r' ).read()
except:
print 'Error at read "out_hadoop_xml" : ' + out_hadoop_xml
return False
if HadoopDB_Info is '' :
print 'The info in the file is empty : ' + HadoopDB_Info
return False
DB_TABLE_CREATE_INFO = ''
try:
DB_TABLE_CREATE_INFO = open( table_field_info , 'r' ).read().strip()
except:
print 'Error at read "table_field_info" : ' + table_field_info
return False
if DB_TABLE_CREATE_INFO is '' :
print 'The info in the file is empty : ' + DB_TABLE_CREATE_INFO
return False
DB_TABLE_CREATE_INFO = DB_TABLE_CREATE_INFO.replace( "\n" , ' ' ).replace( '"' , '\\"' ).lower()
DB_TABLE_CREATE_INFO = 'create table ' + table_name + ' ( ' + DB_TABLE_CREATE_INFO + ' );'
#print node_list
partition_index = 0
for node in node_list:
cmd_for_node[ node ] = []
if chunks_per_node is 0 : # use unchunked mode
db_list = re.findall( '' + node +':[\d]+/(udb_' + table_name + '_'+'[\w]+)' , HadoopDB_Info )
for sub_db in db_list :
# Create Database & Table
cmd_for_node[ node ].append( 'dropdb ' + sub_db )
cmd_for_node[ node ].append( 'createdb ' + sub_db )
cmd_for_node[ node ].append( 'echo "'+DB_TABLE_CREATE_INFO+'" | psql '+ sub_db )
cmd_for_node[ node ].append( 'rm -rf /tmp/out_for_global_parition' )
cmd_for_node[ node ].append( BIN_HADOOP + ' fs -get ' +
data_partition_out + '/part-%0.5d /tmp/out_for_global_parition' %
partition_index )
cmd_for_node[ node ].append( 'echo
"COPY '+table_name+' FROM \'/tmp/out_for_global_parition\' WITH
DELIMITER E\''+data_field_delimiter+'\';" | psql '+ sub_db )
cmd_for_node[ node ].append( 'rm -rf /tmp/out_for_global_parition' )
else:
db_list = re.findall( '' + node +':[\d]+/(cdb_' + table_name + '_'+'[\w]+)' , HadoopDB_Info )
if db_list <> None :
cmd_for_node[ node ].append( 'rm -rf /tmp/*out_for_global_parition' )
cmd_for_node[ node ].append( BIN_HADOOP + ' fs -get ' +
data_partition_out + '/part-%0.5d /tmp/out_for_global_parition' %
partition_index )
cmd_for_node[ node ].append( 'cd
/tmp; ' + BIN_JAVA + ' -cp ' + JAR_HADOOPDB + '
edu.yale.cs.hadoopdb.dataloader.LocalHasher out_for_global_parition ' +
str( chunks_per_node ) + ' \'' + data_delimiter + '\' 0 ' )
sub_part = 0
for sub_db in db_list :
# Create Database & Table
cmd_for_node[ node ].append( 'dropdb ' + sub_db )
cmd_for_node[ node ].append( 'createdb ' + sub_db )
cmd_for_node[ node ].append( 'echo "'+DB_TABLE_CREATE_INFO+'" | psql '+ sub_db )
cmd_for_node[ node ].append( 'echo "COPY
'+table_name+' FROM \'/tmp/'+str(sub_part)+'-out_for_global_parition\'
WITH DELIMITER E\''+data_field_delimiter+'\';" | psql '+ sub_db )
sub_part = sub_part + 1
#cmd_for_node[ node ].append( 'rm -rf /tmp/'+str(sub_part)+'-out_for_global_parition' )
cmd_for_node[ node ].append( 'rm -rf /tmp/*out_for_global_parition' )
partition_index = partition_index + 1
print '\n=> To configure your nodes...\n'
for node in node_list:
thread.start_new_thread( executeThreadForNode , ( node, DEBUG_MODE ) )
while (len(completed.keys()) < len(node_list) ) :
os.system('sleep 2')
if DEBUG_MODE :
print '$ ' + BIN_HADOOP + ' fs -rmr ' + data_partition_out
else:
os.system( BIN_HADOOP + ' fs -rmr ' + data_partition_out )
print '\n=> To setup the external table for Hive\n'
if DEBUG_MODE :
print '$ ' + BIN_HADOOP + ' fs -mkdir ' + hive_db_dir_in_hdfs
print '$ ' + BIN_HADOOP + ' fs -rmr ' + hive_db_dir_in_hdfs + '/' + table_name
else:
os.system( BIN_HADOOP + ' fs -mkdir ' + hive_db_dir_in_hdfs )
os.system( BIN_HADOOP + ' fs -rmr ' + hive_db_dir_in_hdfs + '/' + table_name )
cmd_exec = ' echo "drop table '+table_name+';" | ' + BIN_HIVE
if DEBUG_MODE :
print '$ ' + cmd_exec
else:
os.system( cmd_exec )
create_hive_external_table = ' ROW FORMAT DELIMITED FIELDS TERMINATED BY \'' + db_field_delimiter + '\''
create_hive_external_table += ' STORED AS '
create_hive_external_table += ' INPUTFORMAT \'edu.yale.cs.hadoopdb.sms.connector.SMSInputFormat\' '
create_hive_external_table += ' OUTPUTFORMAT \'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\' '
create_hive_external_table += ' LOCATION \'' + hive_db_dir_in_hdfs + '/' + table_name + '\'; '
DB_TABLE_CREATE_INFO = DB_TABLE_CREATE_INFO.replace( ";" , ' '
).replace( 'precision' , '' ).replace( 'create table' , 'create
external table' )
DB_TABLE_CREATE_INFO = re.sub( 'varchar\([\d]+\)|text' , 'string' , DB_TABLE_CREATE_INFO )
create_hive_external_table = DB_TABLE_CREATE_INFO + create_hive_external_table
cmd_exec = ' echo "'+create_hive_external_table+'" | ' + BIN_HIVE
if DEBUG_MODE :
print '$ ' + cmd_exec
else:
os.system( cmd_exec )
def executeThreadForNode(node,DEBUG_MODE=True, *args ):
for cmd in cmd_for_node[node] :
if cmd == None or cmd == '' :
continue;
cmd = cmd.strip()
if cmd == '' :
continue;
cmd = cmd.replace( '"' , '\\"' )
cmd_exec = "ssh %s \"%s\"" % (node , cmd )
print "\t" , cmd_exec
if DEBUG_MODE == False :
os.system( cmd_exec )
completed[node] = "true"
def main( argv=None ):
parser = OptionParser()
parser.add_option( "-H" , "--source_dir_in_hdfs" ,
dest="source_dir_in_hdfs" , default=None, help="dir for data source in
HDFS" )
parser.add_option( "-D" , "--source_data_delimiter" ,
dest="source_data_delimiter" , default='\n' , help="record delimtier
for the source" )
parser.add_option( "-F" ,
"--source_field_delimiter" , dest="source_field_delimiter" ,
default='\t' , help="field delimiter for a record" )
parser.add_option( "-P" , "--source_partition_dir" ,
dest="source_partition_dir" , default="tmp_out_hadoopdb" , help="temp
dir in HDFS for source partition" )
parser.add_option( "-N"
, "--node_list_file" , dest="node_list_file" , default="nodes.txt" ,
help="path for a file saved each node's IP address" )
parser.add_option( "-c" , "--chunk_num" , dest="chunk_num" , default=0 , help="number of databases for each node" )
parser.add_option( "-t" , "--table_name" , dest="table_name" ,
default="hadoopdb" , help="table name for creation on Hive and
databases" )
parser.add_option( "-i" , "--table_field_info_file"
, dest="table_field_info_file" , default="table_create", help="file for
table field definition only" )
parser.add_option( "-u" ,
"--db_username" , dest="db_username" , default="hadoop" ,
help="username for login the databases on each node" )
parser.add_option( "-p" , "--db_password" , dest="db_password" ,
default="1234" , help="password for login the databases on each node" )
parser.add_option( "-d" , "--db_field_delimiter" ,
dest="db_field_delimiter" , default="|" , help="field delimiter for the
databases" )
parser.add_option( "-w" , "--hive_db_dir" ,
dest="hive_db_dir" , default='/db' , help="path in HDFS for Hive to
save the tables" )
parser.add_option( "-f" , "--catalog_properties"
, dest="catalog_properties" , default='/tmp/Catalog.properties' ,
help="output file for Catalog.Properties" )
parser.add_option(
"-x" , "--hadoopdb_xml" , dest="hadoopdb_xml" , default="HadoopDB.xml"
, help="output file for HadoopDB.xml" )
parser.add_option( "-y"
, "--hadoopdb_xml_in_hdfs" , dest="hadoopdb_xml_in_hdfs" ,
default="HadoopDB.xml" , help="filename for HadoopDB.xml in HDFS" )
parser.add_option( "-g" , "--go" , action="store_false" , dest="mode"
, default=True , help="set it to execute the commands" )
( options, args ) = parser.parse_args()
#print options
#return
#initHadoopDB( data_in_hdfs='src' , data_partition_out='tmp_out' ,
table_name='justtest' , table_field_info='table_create' )
if options.source_dir_in_hdfs is None :
print "Please input the source dir in HDFS by '--source_dir_in_hdfs' "
return
if os.path.isfile( options.node_list_file ) is False :
print "Please check the '" + options.node_list_file + "' path and setup by '--node_list_file'"
if options.mode is True :
print "\n Current Status is just Debug Mode for show all
commands\n please set '-g' or '--go' option to execute them after check
all commands.(look at the 'rm -rf' and 'hadoop fs -rmr')\n"
initHadoopDB( data_in_hdfs = options.source_dir_in_hdfs,
data_delimiter = options.source_data_delimiter, data_field_delimiter =
options.source_field_delimiter,data_partition_out =
options.source_partition_dir, nodes_in_file = options.node_list_file,
chunks_per_node = options.chunk_num, table_name = options.table_name,
table_field_info = options.table_field_info_file, db_user =
options.db_username, db_pass = options.db_password, db_field_delimiter
= options.db_field_delimiter, hive_db_dir_in_hdfs =
options.hive_db_dir, tmp_path_for_catelog = options.catalog_properties,
out_hadoop_xml = options.hadoopdb_xml, hadoop_xml_in_hdfs =
options.hadoopdb_xml_in_hdfs, DEBUG_MODE = options.mode )
print "\n\n=> All Execution Completed..."
if __name__ == "__main__":
main()
另外有一些常見問題也順便紀錄:
hadoop@Cluster01:~$ echo "drop table justest;" | /home/hadoop/SMS_dist/bin/hive
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201001081445_409015456.txt
hive> drop table justest;
FAILED:
Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to
start database 'metastore_db', see the next exception for details.
NestedThrowables:
java.sql.SQLException: Failed to start database 'metastore_db', see the next exception for details.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>
這類訊息是因為同時有兩個 client 在用 hive,同一時間只能有一個 client 在操作。
靠機器人救 Karma!由於 Karma 值的不穩定,乾脆實做一隻 Plurk 機器人好了,這次實做參考 [PHP] Official Plurk API 之PHP - cURL 使用教學 。
Update @ 2011/07/26:新增縮網址的程式碼,因為新聞連結長度超過 144 字數的限制。
整體上就是實做一個撈 Yahoo 新聞的小程式,理論上是要用 RSS 才對,但因為我只想看焦點新聞,所以就改撈他的首頁進行處理,可以自行更新 getNews 函數吧!而這隻程式可以搭配 Windows 的排程或是 Unix 的 Crontab 運作,而程式環境上還要求 PHP-Curl 的支援囉!由於它非常陽春,所以我連帳密錯誤都沒確認就是了 :P
<?php
$cookie_db = '/tmp/http_cookie';
$log_db = '/tmp/job_log';
$hash_db = '/tmp/http_yahoo';
$bin_grep = '/bin/grep';
$plurk_api_key = 'YOUR_PLURK_API_KEY';
$plurk_id = 'LOGIN_ID';
$plurk_passwd = 'LOGIN_PASSWORD';
if( file_exists( $log_db ) )
return;
$result = do_act(
'http://www.plurk.com/API/Users/login' ,
'POST' ,
array(
'api_key' => $plurk_api_key ,
'username' => $plurk_id ,
'password' => $plurk_passwd
) ,
$cookie_db
);
$source = getNews();
foreach( $source as $data )
{
if( !check_exists( $data['url'] ) )
{
$target_url = NULL;
if( strlen( $data['url'] ) < 100 )
$target_url = $data['url'];
else
$target_url = getTinyurl( $data['url'] );
if( !empty( $target_url ) )
{
$plurk = '[News]'.$target_url.' ('.$data['title'].')';
$result = do_act(
'http://www.plurk.com/API/Timeline/plurkAdd' ,
'POST' ,
array(
'api_key' => $plurk_api_key ,
'qualifier' => 'shares' , // loves , likes , shares , ...
'content' => $plurk
) ,
$cookie_db
);
}
put_db( $data['url'] );
}
}
exit;
function check_exists( $url )
{
global $hash_db , $bin_grep;
if( !file_exists( $hash_db ) )
return false;
$cmd = $bin_grep.' -c "'. md5($url) .'" '.$hash_db;
$result = shell_exec( $cmd );
$result = trim( $result );
//echo "[$cmd]\n[$result]\n";
return !empty( $result );
}
function put_db( $url )
{
global $hash_db , $log_db ;
if( !file_exists( $hash_db ) )
$fp = fopen( $hash_db , 'w' );
else
$fp = fopen( $hash_db , 'a' );
if( $fp )
{
fwrite( $fp , md5($url) . "\n" );
fclose( $fp );
}
else
file_put_contents( $log_db , "Error" );
}
function getNews()
{
global $log_db;
$out = array();
$raw = file_get_contents( 'http://tw.yahoo.com/' );
$pattern = '<label>';
$raw = stristr( $raw , $pattern );
if( empty( $raw ) )
{
file_put_contents( $log_db , "Parser Error 1" );
return $out;
}
$raw = substr( $raw , strlen( $pattern ) );
$pattern = '<ol>';
$finish = strpos( $raw , $pattern );
if( $finish < 0 )
{
file_put_contents( $log_db , "Parser Error 2" );
return $out;
}
$raw = substr( $raw , 0 , $finish );
if( empty( $raw ) )
return $out;
$pattern = '{<h3[^>]*>[^<]*<a href="(.*?)"[^>]*>(.*?)</a></h3>}is';
if( preg_match_all( $pattern , $raw , $matches ) )
{
for( $i=0 , $cnt=count( $matches[1] ) ; $i<$cnt ; ++$i )
{
array_push( $out , array(
'url' => strstr( $matches[1][$i] , 'http:' ) ,
'title' => $matches[2][$i] )
);
}
}
else
file_put_contents( $log_db , "Parser Error 3" );
return $out;
}
function do_act( $target_url , $type , $data , $cookie_file = NULL )
{
$ch = curl_init();
if( $type == 'GET' ) // GET
{
$target_url .= http_build_query( $data );
curl_setopt($ch, CURLOPT_URL, $target_url );
}
else // POST
{
curl_setopt( $ch , CURLOPT_URL , $target_url );
curl_setopt( $ch , CURLOPT_POST , true );
curl_setopt( $ch , CURLOPT_POSTFIELDS , http_build_query( $data ) );
}
if( isset( $cookie_file ) ) // cookie
{
curl_setopt( $ch , CURLOPT_COOKIEFILE , $cookie_file );
curl_setopt( $ch , CURLOPT_COOKIEJAR , $cookie_file );
}
curl_setopt( $ch , CURLOPT_RETURNTRANSFER , true );
//curl_setopt( $ch , CURLOPT_FOLLOWLOCATION , true );
//curl_setopt( $ch , CURLOPT_SSL_VERIFYPEER , false );
$result = curl_exec( $ch );
curl_close( $ch );
return $result;
}
function getTinyurl( $url )
{
$new_url = @file_get_contents( 'http://tinyurl.com/api-create.php?url='.urlencode($url) );
$new_url = trim( $new_url );
if( !empty( $new_url ) )
return $new_url;
return NULL;
}
?>
圖片來源:http://hadoop.apache.org/
今天共用了三台機器,終於真正架了一個 Hadoop Cluster ,在這之前都只是安裝在一台機器上:[Linux] 安裝單機版 Hadoop 0.20.1 Single-Node Cluster (Pseudo-Distributed) @ Ubuntu 9.04。由於想嘗試 HadoopDB
,因此在工作場所裡透過 VMWare Workstation 安裝 Hadoop 0.19.2 版,並且碰到一堆有的沒有的問題,連
Hadoop 的 wordcount 範例都跑不起來!盡管最後有設定成功,但太不熟了,回寢室後再試一次,順便替自己架設個人把玩的開發環境啦!
環境:Windows 7 x64 + VirtualBox
3.12 r56127 ,預計安裝 3 台機器的 Cluster 囉。原本打算使用 ubuntu-9.10-server-amd64.iso
,但 VirtualBox 啟動有些問題,因此改成 ubuntu-9.10-server-i386.iso ,並且在安裝過程中,給予 20
GB 硬碟空間,以及最後建立 hadoop 使用者,接下來的架設方式是設定好一台虛擬機器後,透過複製虛擬機器的方式來新增機器。
安裝好一台虛擬機器後,緊接著進行更新與軟體的安裝
以下是一些參考資料
除此之外,在公司安裝 Hadoop 0.19.2 時,碰到了以下問題
但真正發現,出問題的是因為 datanode 有一台的 /etc/hosts 設定有問題,修正後就沒問題啦。
其他在 hadoop 0.20.1 碰到的問題