有個工作任務要做 Google Sheets 資料比對,最簡單的方式就把他們匯出後用 git diff 來比對即可,想試著用 AI 產生一隻 python 小工具,只要輸入 Google Sheets URL 或是 Google Sheets URL 內關鍵的辨識 ID (在此稱作 spreadsheet id),就能夠下載該 Google Spreadsheet 內所有 sheet 資料
然後,要下載指定 sheet 必須得知每個 sheet gid ,這個問 ChatGPT-4o 或 Claude.ai 老半天還是沒法解,包過上傳 html static code,最後自己還是跳下來收尾人工刻一下,原理:
- 先設法下載到 HTML Code
- 透過 docs-sheet-tab-caption 抓出 Sheet Name
- 透過 var bootstrapData = {...}; 得知內有 Sheet Name 與 Gid 的資料
- 再用 [0,0,\"gid\",[ 格式,找到 gid
% python3 main.py
usage: main.py [-h] (--google-spreadsheet-url GOOGLE_SPREADSHEET_URL | --google-spreadsheet-id GOOGLE_SPREADSHEET_ID) [--output OUTPUT]
main.py: error: one of the arguments --google-spreadsheet-url --google-spreadsheet-id is required
% python3 main.py --google-spreadsheet-id 'XXXXXXXXXXXXXXXXXXXXXXX'
[INFO] Downloaded sheet: sheet01 to sheets_csv/sheet01.csv
[INFO] Downloaded sheet: sheet02 to sheets_csv/sheet02.csv
[INFO] Downloaded sheet: sheet03 to sheets_csv/sheet03.csv
[INFO] Downloaded sheet: sheet04 to sheets_csv/sheet04.csv
[INFO] Downloaded sheet: sheet05 to sheets_csv/sheet05.csv
```% cat main.pyimport argparseimport reimport requestsimport jsonimport osdef extract_spreadsheet_id(url):match = re.search(r'/d/([a-zA-Z0-9-_]+)', url)return match.group(1) if match else Nonedef get_spreadsheet_info(spreadsheet_id):url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/edit"headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}try:response = requests.get(url, headers=headers)response.raise_for_status()html_content = response.text# Extract sheet namessheet_names = re.findall(r'docs-sheet-tab-caption[^>]+>([^<]+)</div>', html_content)#print(f"Found sheet names: {sheet_names}")# Extract mergedConfigconfig_match = re.search(r'var bootstrapData\s*=\s*({.*?});', html_content, re.DOTALL)if config_match:config_str = config_match.group(1)sheet_info = {}try:for index, sheet_name in enumerate(sheet_names):#print(f"Processing sheet: {sheet_name}, index: {index}")beginPattern = f'[{index},0,\\"'endPattern = f'\\",['beginIndex = config_str.find(beginPattern)endIndex = config_str.find(endPattern, beginIndex)gidValue = config_str[beginIndex + len(beginPattern):endIndex]sheet_info[sheet_name] = gidValuereturn sheet_infoexcept Exception as e:print(f"[INFO] Error extracting sheet information: {e}")return Noneelse:print("[INFO] Could not find bootstrapData in the HTML content")return Noneexcept requests.RequestException as e:print(f"[INFO] Error fetching the spreadsheet: {e}")return Nonedef download_sheet_as_csv(spreadsheet_id, sheet_name, gid, output_folder):csv_url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?format=csv&gid={gid}"csv_response = requests.get(csv_url)if csv_response.status_code == 200:output_path = os.path.join(output_folder, f"{sheet_name}.csv")with open(output_path, 'wb') as f:f.write(csv_response.content)print(f"[INFO] Downloaded sheet: {sheet_name} to {output_path}")else:print(f"[INFO] Failed to download sheet: {sheet_name}. Status code: {csv_response.status_code}")def main():parser = argparse.ArgumentParser(description="Extract Google Spreadsheet information")group = parser.add_mutually_exclusive_group(required=True)group.add_argument("--google-spreadsheet-url", help="Google Spreadsheet URL")group.add_argument("--google-spreadsheet-id", help="Google Spreadsheet ID")parser.add_argument('--output', type=str, default='sheets_csv', help='The directory to save the CSV files')args = parser.parse_args()if args.google_spreadsheet_url:spreadsheet_id = extract_spreadsheet_id(args.google_spreadsheet_url)else:spreadsheet_id = args.google_spreadsheet_idif not spreadsheet_id:print("[INFO] Invalid Google Spreadsheet URL or ID")returnsheet_info = get_spreadsheet_info(spreadsheet_id)if sheet_info:for name, gid in sheet_info.items():download_sheet_as_csv(spreadsheet_id, name, gid, args.output)else:print("[INFO] Failed to extract sheet information")if __name__ == "__main__":main()```
沒有留言:
張貼留言