2010年9月27日 星期一

[Python] MARC21 與 ISO 2709 筆記

這陣子接觸圖書館服務,其中關於書目清單底層匯出的格式採用 MARC 格式,也是 ISO 2709 格式,相關資料如下:



花一點時間,總算看懂了。請看 Library of Congress >> MARC >> Authority >> LeaderMARC的結構 來對照,因為有時我竟然看不太懂中文!


圖書館系統理論上都支援 MARC 的匯出,其中匯出的資料採用 ISO 2709 格式,而 ISO 2709 就是以前磁帶備份的格式。MARC 匯出的資料格式,如同 Wikipedia - ISO_2709 底部那個看不懂的範例,因為 MARC 本身就叫 MAchine-Readable Cataloging 而非 Human-Readable Cataloging,但也有接近人眼看得懂得 MARC XML 格式,但不在這篇的討論。


MARC的結構 看看老故事,得知資料都是 Sequence 並且每一筆前 24 bytes 就等同於 record begin delimiter。而 MARC 每一筆 Record 共分成 header + dictionary + data 三個部份。而 header 裡 12-16 bytes 就是紀錄接下來的 dictionary 的大小是多少,當然也可以用它計算出直接取得 data 位置。而 dictionary 主要都是 12 bytes 為單位,分別是 3 bytes, 4 bytes, 5 bytes,但 dictionary 紀錄的大小是 "12 的倍數 + 1",細節可在 MARC的結構 得知。


切 Records:


def pre_process():
        target = 'marc_data'
        f = open( target , 'rb' )
        rec_cnt = 0
        total_size = 0
        print "### 012345678901234567890123 ###"
        while True:
                header = f.read(24)
                total_size  = total_size + len( header )
                if not header:
                        break

                record_size = int( header[0:5] )
                record_data = f.read( record_size - 24 )

                total_size  = total_size + len( record_data )
                rec_cnt = rec_cnt + 1

                print "---",header,"---",record_size
                if False :
                        o = open( '/tmp/marc.'+str(rec_cnt) , 'wb' )
                        o.write( header )
                        o.write( record_data )
                        o.close()
                #print record_data

        print "Total:",total_size,", Record Cnt:",rec_cnt
        f.close


從 header 這 24 bytes 資料,其前五個 Bytes 記錄的就是該 Record 大小(包括header)


對指定的 Record 分析 Header & 回傳指定 field 的 values:


def getFieldValue( rawdata , field = None , dictField = None ):
        if dictField is None:

                header = rawdata[0:24]
                field_length = int( header[20:21] )
                field_offset = int( header[21:22] )
                data_begin_offset = int( header[12:17] )
                raw_field_info = rawdata[24:data_begin_offset - 1]      # skip field end delimiter

                dictField = {}
                for i in range( 0 , len(raw_field_info) , 12 ):
                        begin = i
                        end = i+3
                        sub_field_name = raw_field_info[ begin : end ]

                        begin = end
                        end = begin + field_length
                        sub_field_data_length = raw_field_info[ begin : end ]

                        begin = end
                        end = begin + field_offset
                        sub_field_data_offset = raw_field_info[ begin : end ]

                        if sub_field_name not in dictField:
                                dictField[ sub_field_name ] = []
                        dictField[ sub_field_name ].append( [ int(sub_field_data_length) , int(sub_field_data_offset) + data_begin_offset ] )

        out = []
        if field is not None and field in dictField:
                #print dictField[field]
                for data_length_and_offset in dictField[field]:
                        out.append( rawdata[ data_length_and_offset[1] : data_length_and_offset[0] + data_length_and_offset[1] ] )

        return ( out , dictField )


用法:


tmp = None
value , tmp = getFieldValue( rawdata , '003' , tmp )
value , tmp = getFieldValue( rawdata , '005' , tmp )

...


其中 rawdata 是完整的資料,包括 header + dinctionary + data 三部分;value 是一個 array ,因為有些指定的 field name 可能出現多次,所以就用 array 記錄; tmp 是用來暫存 dictionary 資料,可以省下重新處理來增加效率的


建個 class 使用:


class MARC( object ):
        def __init__ ( self , file_list=[] ):
                self.file_list = file_list if file_list is not None and len(file_list) > 0 else []
                self.fd = None
                self.RE_FIELD_DATA = re.compile( '\x1f.([^\x1e\x1f]+)' )

        def get_raw_entries( self , cnt = None ):
                out = []
                cnt = int(cnt) if cnt is not None else 0
                while True:
                        if self.fd is None:
                                if  self.file_list is None or len( self.file_list ) == 0 :
                                        return out
                                try:
                                        self.fd = open( self.file_list[0] , 'rb' )
                                        self.file_list = self.file_list[1:]
                                except Exception as inst:
                                        print inst
                                        return out
                        try:
                                header = self.fd.read( 24 )
    
                                if not header:  # EOF
                                        self.fd.close()
                                        self.fd = None
                                else:
                                        record_size = int( header[0:5] )
                                        record_data = self.fd.read( record_size - 24 )
                                        out.append( header + record_data )
                        except Exception as inst:
                                print inst
                                return out
    
                        if cnt != 0 and len(out) == cnt:
                                return out

        def get_field_value( self , rawdata , field , dictField = None ):
                if dictField is None:

                        header = rawdata[0:24]
                        field_length = int( header[20:21] )
                        field_offset = int( header[21:22] )
                        data_begin_offset = int( header[12:17] )
                        raw_field_info = rawdata[24:data_begin_offset - 1]      # skip field end delimiter

                        dictField = {}
                        for i in range( 0 , len(raw_field_info) , 12 ):
                                begin = i
                                end = i+3
                                sub_field_name = raw_field_info[ begin : end ]

                                begin = end
                                end = begin + field_length
                                sub_field_data_length = raw_field_info[ begin : end ]

                                begin = end
                                end = begin + field_offset
                                sub_field_data_offset = raw_field_info[ begin : end ]

                                if sub_field_name not in dictField:
                                        dictField[ sub_field_name ] = []
                                raw_value = [ int(sub_field_data_length) , int(sub_field_data_offset) + data_begin_offset ]
                                dictField[ sub_field_name ].append( raw_value )

                out = []
                if field is not None and field in dictField:
                        for data_length_and_offset in dictField[field]:
                                out.append( rawdata[ data_length_and_offset[1] : data_length_and_offset[0] + data_length_and_offset[1] ] )

                return ( out , dictField )


使用方式:


marc = MARC( [target_file] )

for rawdata in marc.get_raw_entries():
        tmp = None
        value , tmp = marc.get_field_value( rawdata , 'FIELD_ID' , tmp )
        if len(value) > 0:
                for raw in re.findall( marc.RE_FIELD_DATA , value[0] ):
                        print raw
                        break


最後一提,其實有 pymarc libary 可以用:http://pypi.python.org/pypi/pymarc/,而我要做的事也差不多搞定,所以就不用那個 lib 囉


沒有留言:

張貼留言