剛好有同好在詢問如何處理認證碼的事,就順手複習一下 node.js 與 Puppeteer 動態爬網頁,並且把認證碼圖片存起來,在靠 Tesseract 分析上述的文字。
以前略知 光學字元辨識(OCR) 的方式,但一直沒實戰,這次就順手摸一下:github.com/changyy/node-puppeteer-with-tesseract-tool
目前只單純練一下功,還不到實際使用的地步,因為 Tesseract 分析上仍有精準度問題,純做個小工具方便把玩。此程式碼只提供快速定位到 <img> (getElementById) 的方式,接著動態存成圖片,再呼叫工具分析它。
此外,已經存在本機了,後續也能靠 tesseract 指令反覆測試參數。
安裝:
% sudo port install tesseract tesseract-eng tesseract-chi-tra tesseract-chi-sim
% sw_vers
ProductName: macOS
ProductVersion: 13.5.1
BuildVersion: 22G90
% tesseract --version
tesseract 5.3.2
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.11 : libwebp 1.3.1 : libopenjp2 2.5.0
Found SSE4.1
Found libarchive 3.6.2 zlib/1.2.11 liblzma/5.4.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.4
Found libcurl/8.1.2 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 nghttp2/1.51.0
使用:
% nvm use v20
Now using node v20.5.1 (npm v9.8.0)
% npm install
% npm run main
> main
> node main.js
Usage> node /private/tmp/node-puppeteer-with-tesseract-tool/main.js "WebURL" "ImageObjectID"
% node main.js
Usage> node /private/tmp/node-puppeteer-with-tesseract-tool/main.js "WebURL" "ImageObjectID"
測試:
% node main.js 'https://xxx/login' 'verifyImgCode' [INFO] WebURL: "https://xxx/login", The id of the DOM <img>: "verifyImgCode" Puppeteer old Headless deprecation warning: In the near future `headless: true` will default to the new Headless mode for Chrome instead of the old Headless implementation. For more information, please see https://developer.chrome.com/articles/new-headless/. Consider opting in early by passing `headless: "new"` to `puppeteer.launch()` If you encounter any bugs, please report them to https://github.com/puppeteer/puppeteer/issues/new/choose. on.domcontentloaded on.framenavigated: about:blank on.load ... page.screenshot browser.close tesseract.recognize result: 058B47
練精準度:
% tesseract /tmp/verify-code.png stdout -c tessedit_char_whitelist=0123456789