2010年9月16日 星期四

[Python] 編碼效能測試

想要作 MapReduce 的工作,大概拿 Hadoop Streaming 試試,於是想要把資料弄成 line-based 模式,接著想到資料壓縮處理,然後就想測一下到底哪種比較合適



  • base64

  • json

  • bz2

  • gzip


雖然腦子裡大概有譜了,但還是測一下好了


#!/usr/bin/env python

from timeit import Timer
import json
import base64
import bz2
import zlib

s = '1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ~!@#$%^&*()_+|'
def do_base64():
        encoded = base64.b64encode( s )
        decoded = base64.b64decode( encoded )

def do_json():
        encoded = json.dumps( s )
        decoded = json.loads( encoded )

def do_bz2():
        encoded = bz2.compress( s )
        decoded = bz2.decompress( encoded )

def do_gzip():
        encoded = zlib.compress( s )
        decoded = zlib.decompress( encoded )

if __name__ == '__main__':
        t1 = Timer( "do_base64()" , "from __main__ import do_base64" )
        try:
                print "Encode & Decode By base64: " + str( t1.timeit() )
        except:
                t1.print_exc()
        t2 = Timer( "do_json()" , "from __main__ import do_json" )
        try:
                print "Encode & Decode By json: " + str( t2.timeit() )
        except:
                t2.print_exc()

        t3 = Timer( "do_bz2()" , "from __main__ import do_bz2" )
        try:
                print "Encode & Decode By bz2: " + str( t3.timeit() )
        except:
                t3.print_exc()

        t4 = Timer( "do_gzip()" , "from __main__ import do_gzip" )
        try:
                print "Encode & Decode By gzip: " + str( t4.timeit() )
        except:
                t4.print_exc()


在 AMD x4 955 + 4 GB DDR3 1200 搭配 Ubuntu 10.04 i386:


$ python t.py
Encode & Decode By base64: 2.40118098259
Encode & Decode By json: 12.9051868916
Encode & Decode By bz2: 105.709769011
Encode & Decode By gzip: 19.3650279045


看來 base64 還是挺不錯的選擇,過去對他的印象是資料編碼後大小會長 50% 左右。另外,timeit 預設是跑 1,000,000 次。(由於測資沒有重複性,大概對壓縮類的不公平 XD)


相關資料



沒有留言:

張貼留言