Google+ Followers

2014年1月24日 星期五

[MongoDB] 使用 PyMongo 把玩 MapReduce @ Ubuntu 12.04

稍微接觸了一下 MongoDB 後,對此感到無比的興奮?與 Hadoop 相比,感覺上手度滿高的,不過,使用 PyMongo 的情況下,還是需要比較適合熟悉 Javascript 的人,因為 Mapper 跟 Reducer 的撰寫仍是用 Javascript 的,更正確來說是 BSON's JavaScript code type.

既然是 MapReduce,那就先來個 word count 吧!說真的,word count 也是我學 Hadoop 時跑的第一個範例。

簡單試了一下(文字來源:Taiwan wiki - Names):

$ python import.py -
{"field":"There are various names for the island of Taiwan in use today, derived from explorers or rulers by each particular period. The former name Formosa (福爾摩沙) dates from 1544, when Portuguese sailors sighted the main island of Taiwan and named it Ilha Formosa, which means \"Beautiful Island\".[21] In the early 17th century, the Dutch East India Company established a commercial post at Fort Zeelandia (modern Anping, Tainan) on a coastal islet called \"Tayouan\" in the local Siraya language; the name was later extended to the whole island as \"Taiwan\".[22] Historically, \"Taiwan\" has also been written as 大灣, 臺員, 大員, 臺圓, 大圓 and 臺窩灣."}
{"field":"The official name of the state is the \"Republic of China\"; it has also been known under various names throughout its existence. Shortly after the ROC's establishment in 1912, while it was still located on the Asian mainland, the government used the abbreviation \"China\" (\"Zhongguó\") to refer to itself. During the 1950s and 1960s, it was common to refer to it as \"Nationalist China\" (or \"Free China\") to differentiate it from \"Communist China\" (or \"Red China\").[23] It was present at the UN under the name \"China\" until 1971, when it lost its seat to the People's Republic of China. Since then, the name \"China\" has been commonly used internationally to refer only to the People's Republic of China.[24] Over subsequent decades, the Republic of China has become commonly known as \"Taiwan\", after the island that composes most of its territory. The Republic of China participates in most international forums and organizations under the name \"Chinese Taipei\" due to diplomatic pressure from the PRC. For instance, it is the name under which it has competed at the Olympic Games since 1984, and its name as an observer at the World Health Organization.[25]"}
Import:  2
[ObjectId('52e265a69c3fe514be2534fe'), ObjectId('52e265a69c3fe514be2534ff')]


$ python map-reduce-word-count.py --show-result --delete-result
Collection(Database(MongoClient('localhost', 27017), u'db'), u'tmp_2014-01-24_131025')
{u'_id': u'1544', u'value': {u'count': 1.0}}
{u'_id': u'17th', u'value': {u'count': 1.0}}
{u'_id': u'1912', u'value': {u'count': 1.0}}
{u'_id': u'1950s', u'value': {u'count': 1.0}}
{u'_id': u'1960s', u'value': {u'count': 1.0}}
{u'_id': u'1971', u'value': {u'count': 1.0}}
{u'_id': u'1984', u'value': {u'count': 1.0}}
{u'_id': u'21', u'value': {u'count': 1.0}}
{u'_id': u'22', u'value': {u'count': 1.0}}
{u'_id': u'23', u'value': {u'count': 1.0}}
{u'_id': u'24', u'value': {u'count': 1.0}}
{u'_id': u'25', u'value': {u'count': 1.0}}
{u'_id': u';', u'value': {u'count': 1.0}}
{u'_id': u'Anping', u'value': {u'count': 1.0}}
{u'_id': u'Asian', u'value': {u'count': 1.0}}
{u'_id': u'Beautiful', u'value': {u'count': 1.0}}
{u'_id': u'China', u'value': {u'count': 12.0}}
{u'_id': u'Chinese', u'value': {u'count': 1.0}}
{u'_id': u'Communist', u'value': {u'count': 1.0}}
{u'_id': u'Company', u'value': {u'count': 1.0}}
{u'_id': u'During', u'value': {u'count': 1.0}}
{u'_id': u'Dutch', u'value': {u'count': 1.0}}
{u'_id': u'East', u'value': {u'count': 1.0}}
{u'_id': u'For', u'value': {u'count': 1.0}}


其中 mongodb-study/blob/master/tools/import.py 只是把資料輸入到 mongodb 中,預設 database = db, collection = test。比較重要的 mapper, reducer 則是寫在 mongodb-study/blob/master/tools/map-reduce-word-count.py 程式中。

而 map-reduce-word-count.py 中,比較重要的則是 mapper 與 reducer 的定義,雖然是用 pymongo ,但這邊仍是用 Javascript 描述的,這大概是 pymongo 的最大缺點吧?印象中翻到 Java 版的 MongoDB 操作,可直接用 Java 來撰寫,這點影響還滿大的,簡言之,用 Pymongo 要撰寫 mapper、reducer 就是得先會一點 Javascript 才行。

使用 pymongo 撰寫 mapper 的方式,有一個小提醒就是 this.field 等於可以取到原先 input 的資料,但是,這邊的 this.field 需要強制轉型,這樣才能接著用 split 切字出來,另外,也可以自定 func 來呼叫使用,如此一來,就變成只剩字串處理的技巧了 :)

mapper = Code (
"""
function() {
var func = {
'author':function() {
return 'changyy';
}
};
(“”+this.field).split(/[\s\[\],\(\)"\.]+/).forEach(function(v){
//emit(func.author(), 1 );
if(v && v.length )
emit(v, {'count':1});
} );
}
"""
)


reducer = Code(
"""
function(key, value) {
var total = 0;
for(var i = 0 ; i < value.length ; ++i ) {
total += value[i].count;
}
return {'count':total};
}
"""
)

沒有留言:

張貼留言