想要作 MapReduce 的工作,大概拿 Hadoop Streaming 試試,於是想要把資料弄成 line-based 模式,接著想到資料壓縮處理,然後就想測一下到底哪種比較合適
- base64
- json
- bz2
- gzip
雖然腦子裡大概有譜了,但還是測一下好了
#!/usr/bin/env python
from timeit import Timer
import json
import base64
import bz2
import zlib
s = '1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ~!@#$%^&*()_+|'
def do_base64():
encoded = base64.b64encode( s )
decoded = base64.b64decode( encoded )
def do_json():
encoded = json.dumps( s )
decoded = json.loads( encoded )
def do_bz2():
encoded = bz2.compress( s )
decoded = bz2.decompress( encoded )
def do_gzip():
encoded = zlib.compress( s )
decoded = zlib.decompress( encoded )
if __name__ == '__main__':
t1 = Timer( "do_base64()" , "from __main__ import do_base64" )
try:
print "Encode & Decode By base64: " + str( t1.timeit() )
except:
t1.print_exc()
t2 = Timer( "do_json()" , "from __main__ import do_json" )
try:
print "Encode & Decode By json: " + str( t2.timeit() )
except:
t2.print_exc()
t3 = Timer( "do_bz2()" , "from __main__ import do_bz2" )
try:
print "Encode & Decode By bz2: " + str( t3.timeit() )
except:
t3.print_exc()
t4 = Timer( "do_gzip()" , "from __main__ import do_gzip" )
try:
print "Encode & Decode By gzip: " + str( t4.timeit() )
except:
t4.print_exc()
在 AMD x4 955 + 4 GB DDR3 1200 搭配 Ubuntu 10.04 i386:
$ python t.py
Encode & Decode By base64: 2.40118098259
Encode & Decode By json: 12.9051868916
Encode & Decode By bz2: 105.709769011
Encode & Decode By gzip: 19.3650279045
看來 base64 還是挺不錯的選擇,過去對他的印象是資料編碼後大小會長 50% 左右。另外,timeit 預設是跑 1,000,000 次。(由於測資沒有重複性,大概對壓縮類的不公平 XD)
相關資料
- timeit — Measure execution time of small code snippets
- zlib — Compression compatible with gzip
- bz2 — Compression compatible with bzip2
- json — JSON encoder and decoder
- base64 — RFC 3548: Base16, Base32, Base64 Data Encodings
沒有留言:
張貼留言