2024年8月23日 星期五

Python 開發筆記 - 不透過 Google API Key 下載公開 Google Sheets 資料,將每個 Sheet 匯出 csv 格式

有個工作任務要做 Google Sheets 資料比對,最簡單的方式就把他們匯出後用 git diff 來比對即可,想試著用 AI 產生一隻 python 小工具,只要輸入 Google Sheets URL 或是 Google Sheets URL 內關鍵的辨識 ID (在此稱作 spreadsheet id),就能夠下載該 Google Spreadsheet 內所有 sheet 資料

然後,要下載指定 sheet 必須得知每個 sheet gid ,這個問 ChatGPT-4o 或 Claude.ai 老半天還是沒法解,包過上傳 html static code,最後自己還是跳下來收尾人工刻一下,原理:

  • 先設法下載到 HTML Code
  • 透過 docs-sheet-tab-caption 抓出 Sheet Name
  • 透過 var bootstrapData = {...}; 得知內有 Sheet Name 與 Gid 的資料
  • 再用 [0,0,\"gid\",[ 格式,找到 gid
連續動作:

% python3 main.py 

usage: main.py [-h] (--google-spreadsheet-url GOOGLE_SPREADSHEET_URL | --google-spreadsheet-id GOOGLE_SPREADSHEET_ID) [--output OUTPUT]

main.py: error: one of the arguments --google-spreadsheet-url --google-spreadsheet-id is required


% python3 main.py --google-spreadsheet-id 'XXXXXXXXXXXXXXXXXXXXXXX'

[INFO] Downloaded sheet: sheet01 to sheets_csv/sheet01.csv

[INFO] Downloaded sheet: sheet02 to sheets_csv/sheet02.csv

[INFO] Downloaded sheet: sheet03 to sheets_csv/sheet03.csv

[INFO] Downloaded sheet: sheet04 to sheets_csv/sheet04.csv

[INFO] Downloaded sheet: sheet05 to sheets_csv/sheet05.csv


程式碼:

```
% cat main.py 
import argparse
import re
import requests
import json
import os

def extract_spreadsheet_id(url):
    match = re.search(r'/d/([a-zA-Z0-9-_]+)', url)
    return match.group(1) if match else None

def get_spreadsheet_info(spreadsheet_id):
    url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/edit"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        html_content = response.text
        
        # Extract sheet names
        sheet_names = re.findall(r'docs-sheet-tab-caption[^>]+>([^<]+)</div>', html_content)
        #print(f"Found sheet names: {sheet_names}")

        # Extract mergedConfig
        config_match = re.search(r'var bootstrapData\s*=\s*({.*?});', html_content, re.DOTALL)
        if config_match:
            config_str = config_match.group(1)
            sheet_info = {}
            try:
                for index, sheet_name in enumerate(sheet_names):
                    #print(f"Processing sheet: {sheet_name}, index: {index}")
                    beginPattern = f'[{index},0,\\"'
                    endPattern = f'\\",['
                    beginIndex = config_str.find(beginPattern)
                    endIndex = config_str.find(endPattern, beginIndex)
                    gidValue = config_str[beginIndex + len(beginPattern):endIndex]
                    sheet_info[sheet_name] = gidValue
                return sheet_info
            except Exception as e:
                print(f"[INFO] Error extracting sheet information: {e}")
                return None
        else:
            print("[INFO] Could not find bootstrapData in the HTML content")
            return None
    except requests.RequestException as e:
        print(f"[INFO] Error fetching the spreadsheet: {e}")
        return None

def download_sheet_as_csv(spreadsheet_id, sheet_name, gid, output_folder):
    csv_url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?format=csv&gid={gid}"
    csv_response = requests.get(csv_url)
    
    if csv_response.status_code == 200:
        output_path = os.path.join(output_folder, f"{sheet_name}.csv")
        with open(output_path, 'wb') as f:
            f.write(csv_response.content)
        print(f"[INFO] Downloaded sheet: {sheet_name} to {output_path}")
    else:
        print(f"[INFO] Failed to download sheet: {sheet_name}. Status code: {csv_response.status_code}")

def main():
    parser = argparse.ArgumentParser(description="Extract Google Spreadsheet information")
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument("--google-spreadsheet-url", help="Google Spreadsheet URL")
    group.add_argument("--google-spreadsheet-id", help="Google Spreadsheet ID")
    parser.add_argument('--output', type=str, default='sheets_csv', help='The directory to save the CSV files')
    args = parser.parse_args()

    if args.google_spreadsheet_url:
        spreadsheet_id = extract_spreadsheet_id(args.google_spreadsheet_url)
    else:
        spreadsheet_id = args.google_spreadsheet_id

    if not spreadsheet_id:
        print("[INFO] Invalid Google Spreadsheet URL or ID")
        return

    sheet_info = get_spreadsheet_info(spreadsheet_id)
    if sheet_info:
        for name, gid in sheet_info.items():
            download_sheet_as_csv(spreadsheet_id, name, gid, args.output)
    else:
        print("[INFO] Failed to extract sheet information")

if __name__ == "__main__":
    main()
```

沒有留言:

張貼留言