2014年11月3日 星期一

[Linux] 架設 MongoDB Sharded Cluster 與驗證資料分散性 @ Ubuntu 14.04

今年年初時,把玩過 MongoDB Replica Set ,也試過一些 Map-Reduce 計算等,最近終於又開始抽空實驗了 :P 當時看到不少 Sharded Cluster 搭建在 MongoDB Replica Set 身上,深感 process (server) 的數量不低,就遲遲沒跳下去玩,最近才發現就算只是單一個 MongoDB Server 也可以當作 shared Cluster 一員,就方便啦!

此次 MongoDB Sharded Cluster 架構:
  • 跑兩個 MongoDB Server,分別在 port 10001, 10002
  • 跑一個 MongoDB Config Server - port 20001
  • 跑一個 MongoDB Routing Server - port 30001
以下操作全部都在一台 ubuntu 14.04 server:

$ rm -rf /tmp/mongo && mkdir -p /tmp/mongo/db/1 /tmp/mongo/db/2 /tmp/mongo/db/c /tmp/mongo/log

$ mongod --port 10001 --dbpath /tmp/mongo/db/1 --logpath /tmp/mongo/log/1 &
$ mongod --port 10002 --dbpath /tmp/mongo/db/2 --logpath /tmp/mongo/log/2 &
$ mongod --port 20001 --configsvr --dbpath /tmp/mongo/db/c --logpath /tmp/mongo/log/c &

$ mongos --port 30001 --configdb 127.0.0.1:20001 &


設定 partition 用法:

$ mongo --port 30001
mongos> sh.status()
--- Sharding Status ---
  sharding version: {
        "_id" : 1,
        "version" : 4,
        "minCompatibleVersion" : 4,
        "currentVersion" : 5,
        "clusterId" : ObjectId("###############")
}
  shards:
  databases:
        {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
mongos> sh.addShard('127.0.0.1:10001')
{ "shardAdded" : "shard0000", "ok" : 1 }
mongos> sh.addShard('127.0.0.1:10002')
{ "shardAdded" : "shard0001", "ok" : 1 }
mongos> sh.status()
--- Sharding Status ---
  sharding version: {
        "_id" : 1,
        "version" : 4,
        "minCompatibleVersion" : 4,
        "currentVersion" : 5,
        "clusterId" : ObjectId("#####################")
}
  shards:
        {  "_id" : "shard0000",  "host" : "127.0.0.1:10001" }
        {  "_id" : "shard0001",  "host" : "127.0.0.1:10002" }
  databases:
        {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
mongos> sh.enableSharding('ydb')
{ "ok" : 1 }
mongos> sh.status()
--- Sharding Status ---
  sharding version: {
        "_id" : 1,
        "version" : 4,
        "minCompatibleVersion" : 4,
        "currentVersion" : 5,
        "clusterId" : ObjectId("######################")
}
  shards:
        {  "_id" : "shard0000",  "host" : "127.0.0.1:10001" }
        {  "_id" : "shard0001",  "host" : "127.0.0.1:10002" }
  databases:
        {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
        {  "_id" : "test",  "partitioned" : false,  "primary" : "shard0000" }
        {  "_id" : "ydb",  "partitioned" : true,  "primary" : "shard0000" }
mongos> use ydb
switched to db ydb
mongos> db.ycollection.ensureIndex( { _id : "hashed" } )
{
        "raw" : {
                "127.0.0.1:10001" : {
                        "createdCollectionAutomatically" : true,
                        "numIndexesBefore" : 1,
                        "numIndexesAfter" : 2,
                        "ok" : 1
                }
        },
        "ok" : 1
}
mongos> sh.shardCollection("ydb.ycollection", { "_id": "hashed" } )
{ "collectionsharded" : "ydb.ycollection", "ok" : 1 }
mongos> sh.status()
--- Sharding Status ---
  sharding version: {
        "_id" : 1,
        "version" : 4,
        "minCompatibleVersion" : 4,
        "currentVersion" : 5,
        "clusterId" : ObjectId("#####################")
}
  shards:
        {  "_id" : "shard0000",  "host" : "127.0.0.1:10001" }
        {  "_id" : "shard0001",  "host" : "127.0.0.1:10002" }
  databases:
        {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
        {  "_id" : "test",  "partitioned" : false,  "primary" : "shard0000" }
        {  "_id" : "ydb",  "partitioned" : true,  "primary" : "shard0000" }
                ydb.ycollection
                        shard key: { "_id" : "hashed" }
                        chunks:
                                shard0000       2
                                shard0001       2
                        { "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387902") } on : shard0000 Timestamp(2, 2)
                        { "_id" : NumberLong("-4611686018427387902") } -->> { "_id" : NumberLong(0) } on : shard0000 Timestamp(2, 3)
                        { "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387902") } on : shard0001 Timestamp(2, 4)
                        { "_id" : NumberLong("4611686018427387902") } -->> { "_id" : { "$maxKey" : 1 } } on : shard0001 Timestamp(2, 5)


簡易驗證資料分散性:

mongos>

db.ycollection.insert({user:"a", item: "1", rate: 0.3})
db.ycollection.insert({user:"a", item: "3", rate: 0.5})
db.ycollection.insert({user:"a", item: "4", rate: 0.9})
db.ycollection.insert({user:"b", item: "2", rate: 0.1})
db.ycollection.insert({user:"b", item: "3", rate: 0.6})
db.ycollection.insert({user:"b", item: "5", rate: 0.2})
db.ycollection.insert({user:"c", item: "3", rate: 0.2})
db.ycollection.insert({user:"c", item: "4", rate: 0.7})

mongos> db.ycollection.find()
{ "_id" : ObjectId("#################"), "user" : "a", "item" : "3", "rate" : 0.5 }
{ "_id" : ObjectId("#################"), "user" : "a", "item" : "1", "rate" : 0.3 }
{ "_id" : ObjectId("#################"), "user" : "a", "item" : "4", "rate" : 0.9 }
{ "_id" : ObjectId("#################"), "user" : "b", "item" : "2", "rate" : 0.1 }
{ "_id" : ObjectId("#################"), "user" : "b", "item" : "5", "rate" : 0.2 }
{ "_id" : ObjectId("#################"), "user" : "b", "item" : "3", "rate" : 0.6 }
{ "_id" : ObjectId("#################"), "user" : "c", "item" : "3", "rate" : 0.2 }
{ "_id" : ObjectId("#################"), "user" : "c", "item" : "4", "rate" : 0.7 }


接著,想要驗證資料的確是儲存在不同台 mongodb server ,作法就是把資料透過 mongodump 出來(bson),再用 bsondump 印出來驗證:

$ cd /tmp
$ mongodump --port 10001 -d ydb -c ycollection
$ bsondump dump/ydb/ycollection.bson
{ "_id" : ObjectId( "#############" ), "user" : "a", "item" : "1", "rate" : 0.3 }
{ "_id" : ObjectId( "#############" ), "user" : "b", "item" : "2", "rate" : 0.1 }
{ "_id" : ObjectId( "#############" ), "user" : "b", "item" : "3", "rate" : 0.6 }
{ "_id" : ObjectId( "#############" ), "user" : "c", "item" : "4", "rate" : 0.7 }
4 objects found

$ mongodump --port 10002 -d ydb -c ycollection
$ bsondump dump/ydb/ycollection.bson
{ "_id" : ObjectId( "#############" ), "user" : "a", "item" : "3", "rate" : 0.5 }
{ "_id" : ObjectId( "#############" ), "user" : "a", "item" : "4", "rate" : 0.9 }
{ "_id" : ObjectId( "#############" ), "user" : "b", "item" : "5", "rate" : 0.2 }
{ "_id" : ObjectId( "#############" ), "user" : "c", "item" : "3", "rate" : 0.2 }
4 objects found


因此可完成驗證,資料的確分在不同區!有興趣還可以用 MongoDB Aggregate (Map-Reduce) 實作共現矩陣(Co-Occurrence Matrix)及簡易的推薦系統 來驗證 partition 演算法的正確性。

沒有留言:

張貼留言