這陣子接觸圖書館服務,其中關於書目清單底層匯出的格式採用 MARC 格式,也是 ISO 2709 格式,相關資料如下:
- Wikipedia - MARC_standards
- MARC 21 Format for Bibliographic Data: Table of Contents (Network Development and MARC Standards Office, Library of Congress)
- MARC的結構
- Wikipedia - ISO_2709
- Library of Congress >> MARC >> Authority >> Leader
花一點時間,總算看懂了。請看 Library of Congress >> MARC >> Authority >> Leader 和 MARC的結構 來對照,因為有時我竟然看不太懂中文!
圖書館系統理論上都支援 MARC 的匯出,其中匯出的資料採用 ISO 2709 格式,而 ISO 2709 就是以前磁帶備份的格式。MARC 匯出的資料格式,如同 Wikipedia - ISO_2709 底部那個看不懂的範例,因為 MARC 本身就叫 MAchine-Readable Cataloging 而非 Human-Readable Cataloging,但也有接近人眼看得懂得 MARC XML 格式,但不在這篇的討論。
從 MARC的結構 看看老故事,得知資料都是 Sequence 並且每一筆前 24 bytes 就等同於 record begin delimiter。而 MARC 每一筆 Record 共分成 header + dictionary + data 三個部份。而 header 裡 12-16 bytes 就是紀錄接下來的 dictionary 的大小是多少,當然也可以用它計算出直接取得 data 位置。而 dictionary 主要都是 12 bytes 為單位,分別是 3 bytes, 4 bytes, 5 bytes,但 dictionary 紀錄的大小是 "12 的倍數 + 1",細節可在 MARC的結構 得知。
切 Records:
def pre_process():
target = 'marc_data'
f = open( target , 'rb' )
rec_cnt = 0
total_size = 0
print "### 012345678901234567890123 ###"
while True:
header = f.read(24)
total_size = total_size + len( header )
if not header:
break
record_size = int( header[0:5] )
record_data = f.read( record_size - 24 )
total_size = total_size + len( record_data )
rec_cnt = rec_cnt + 1
print "---",header,"---",record_size
if False :
o = open( '/tmp/marc.'+str(rec_cnt) , 'wb' )
o.write( header )
o.write( record_data )
o.close()
#print record_data
print "Total:",total_size,", Record Cnt:",rec_cnt
f.close
從 header 這 24 bytes 資料,其前五個 Bytes 記錄的就是該 Record 大小(包括header)
對指定的 Record 分析 Header & 回傳指定 field 的 values:
def getFieldValue( rawdata , field = None , dictField = None ):
if dictField is None:
header = rawdata[0:24]
field_length = int( header[20:21] )
field_offset = int( header[21:22] )
data_begin_offset = int( header[12:17] )
raw_field_info = rawdata[24:data_begin_offset - 1] # skip field end delimiter
dictField = {}
for i in range( 0 , len(raw_field_info) , 12 ):
begin = i
end = i+3
sub_field_name = raw_field_info[ begin : end ]
begin = end
end = begin + field_length
sub_field_data_length = raw_field_info[ begin : end ]
begin = end
end = begin + field_offset
sub_field_data_offset = raw_field_info[ begin : end ]
if sub_field_name not in dictField:
dictField[ sub_field_name ] = []
dictField[ sub_field_name ].append( [ int(sub_field_data_length) , int(sub_field_data_offset) + data_begin_offset ] )
out = []
if field is not None and field in dictField:
#print dictField[field]
for data_length_and_offset in dictField[field]:
out.append( rawdata[ data_length_and_offset[1] : data_length_and_offset[0] + data_length_and_offset[1] ] )
return ( out , dictField )
用法:
tmp = None
value , tmp = getFieldValue( rawdata , '003' , tmp )
value , tmp = getFieldValue( rawdata , '005' , tmp )
...
其中 rawdata 是完整的資料,包括 header + dinctionary + data 三部分;value 是一個 array ,因為有些指定的 field name 可能出現多次,所以就用 array 記錄; tmp 是用來暫存 dictionary 資料,可以省下重新處理來增加效率的
建個 class 使用:
class MARC( object ):
def __init__ ( self , file_list=[] ):
self.file_list = file_list if file_list is not None and len(file_list) > 0 else []
self.fd = None
self.RE_FIELD_DATA = re.compile( '\x1f.([^\x1e\x1f]+)' )
def get_raw_entries( self , cnt = None ):
out = []
cnt = int(cnt) if cnt is not None else 0
while True:
if self.fd is None:
if self.file_list is None or len( self.file_list ) == 0 :
return out
try:
self.fd = open( self.file_list[0] , 'rb' )
self.file_list = self.file_list[1:]
except Exception as inst:
print inst
return out
try:
header = self.fd.read( 24 )
if not header: # EOF
self.fd.close()
self.fd = None
else:
record_size = int( header[0:5] )
record_data = self.fd.read( record_size - 24 )
out.append( header + record_data )
except Exception as inst:
print inst
return out
if cnt != 0 and len(out) == cnt:
return out
def get_field_value( self , rawdata , field , dictField = None ):
if dictField is None:
header = rawdata[0:24]
field_length = int( header[20:21] )
field_offset = int( header[21:22] )
data_begin_offset = int( header[12:17] )
raw_field_info = rawdata[24:data_begin_offset - 1] # skip field end delimiter
dictField = {}
for i in range( 0 , len(raw_field_info) , 12 ):
begin = i
end = i+3
sub_field_name = raw_field_info[ begin : end ]
begin = end
end = begin + field_length
sub_field_data_length = raw_field_info[ begin : end ]
begin = end
end = begin + field_offset
sub_field_data_offset = raw_field_info[ begin : end ]
if sub_field_name not in dictField:
dictField[ sub_field_name ] = []
raw_value = [ int(sub_field_data_length) , int(sub_field_data_offset) + data_begin_offset ]
dictField[ sub_field_name ].append( raw_value )
out = []
if field is not None and field in dictField:
for data_length_and_offset in dictField[field]:
out.append( rawdata[ data_length_and_offset[1] : data_length_and_offset[0] + data_length_and_offset[1] ] )
return ( out , dictField )
使用方式:
marc = MARC( [target_file] )
for rawdata in marc.get_raw_entries():
tmp = None
value , tmp = marc.get_field_value( rawdata , 'FIELD_ID' , tmp )
if len(value) > 0:
for raw in re.findall( marc.RE_FIELD_DATA , value[0] ):
print raw
break
最後一提,其實有 pymarc libary 可以用:http://pypi.python.org/pypi/pymarc/,而我要做的事也差不多搞定,所以就不用那個 lib 囉
沒有留言:
張貼留言