2021年6月12日 星期六

Node.js 筆記 - 使用 Puppeteer 監控瀏覽網頁的 Requests 跟 HTML Code @ macOS 11.4, node v14.14.0, npm 6.14.5

緣起於在追蹤網站架構時,每次都靠 chrome browser 開發人員工具去追蹤特定的 request 還好,若需要大量瀏覽找尋一些蛛絲馬跡時,就會很煩!以前很閒時,就會跳下去寫 chrome extension 或是熱衷於 C++ 時,就會寫一點點  Chromium Embedded Framework 來寫個小型瀏覽器,最後想起來試試看 Puppeteer 吧!

使用 Puppeteer 很輕鬆,除了是很常見的 Javascript 語言外,Puppeteer 套件提供了網頁端常見的事件偵測、所發的 Request 清單。就這樣拼拼湊湊即可完工:

(async () => {
const browser = await puppeteer.launch({headless: false});
//const page = await browser.newPage();
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
await page.emulate(device);
await page.setRequestInterception(true);

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-domcontentloaded
// https://developer.mozilla.org/en-US/docs/Web/API/Window/DOMContentLoaded_event
//
// The DOMContentLoaded event fires when the initial HTML document has been completely loaded and parsed, without waiting for stylesheets, images, and subframes to finish loading.
//
page.on('domcontentloaded', () => {
console.log('on.domcontentloaded');
});

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-domcontentloaded
// https://developer.mozilla.org/en-US/docs/Web/API/Window/load_event
//
// The load event is fired when the whole page has loaded, including all dependent resources such as stylesheets and images. This is in contrast to DOMContentLoaded, which is fired as soon as the page DOM has been loaded, without waiting for resources to finish loading.
//
page.on('load', async () => {
console.log('on.load');
watchTags(page, watch_tags);
});

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-framenavigated
// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-class-frame
page.on('framenavigated', frame => {
console.log('on.framenavigated: '+frame.url());
});

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-request
// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-class-httprequest
page.on('request', request => {
watchRequest(page.url(), request.url());
if (skip_resource_type[ request.resourceType() ])
request.abort();
else
request.continue();
});

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-response
// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-class-httpresponse
page.on('response', response => {
//console.log('on.response: '+response.url());
});

while (true) {
if (url_init.length > 0) {
const target_url = url_init.shift();
await page.goto(target_url, {
waitUntil: 'networkidle0',
});
console.log('networkidle0');
}
if (url_init.length > 0)
await sleep(5000);
else
await sleep(50);
}
await browser.close();
})();

如此,搭配 command line 使用,就可很快的監控想要的 request 規則,或是 HTML DOM Tree 的變化,例如監控 <script> 的讀取位置在哪:

% node index.js -url "https://tw.yahoo.com" -tag "script.src"
...
on.load
[WATCH][TAG] script: src
Web URL: [https://tw.yahoo.com/]
[
  { src: 'https://s.yimg.com/wi/ytc.js' },
  {
    src: 'https://tw.mobi.yahoo.com/polyfill.min.js?features=array.isarray%2Carray.prototype.every%2Carray.prototype.foreach%2Carray.prototype.indexof%2Carray.prototype.map%2Cdate.now%2Cfunction.prototype.bind%2Cobject.keys%2Cstring.prototype.trim%2Cobject.defineproperty%2Cobject.defineproperties%2Cobject.create%2Cobject.freeze%2Carray.prototype.filter%2Carray.prototype.reduce%2Cobject.assign%2Cpromise%2Crequestanimationframe%2Carray.prototype.some%2Cobject.getownpropertynames%2Clocale-data-en-us%2Cintl%2Clocale-data-zh-hant-tw&version=2.1.23'
  },
  { src: 'https://s.yimg.com/aaq/yc/2.9.0/zh.js' },
  {
    src: 'https://s.yimg.com/ud/fp/js/vendor.174522d6d76b51858b93.min.js'
  },
  { src: 'https://s.yimg.com/ud/fp/js/common.js' },
  { src: 'https://s.yimg.com/oa/consent.js' },
  { src: 'https://consent.cmp.oath.com/cmpStub.min.js' },
  { src: 'https://consent.cmp.oath.com/cmp.js' },
  { src: 'https://s.yimg.com/ss/rapid3.js' },
  { src: 'https://s.yimg.com/rq/darla/boot.js' },
  {
    src: 'https://s.yimg.com/ud/fp/js/main.6622c6aeda051386afaa.min.js'
  },
  { src: 'https://mbp.yimg.com/sy/os/yaft/yaft-0.3.22.min.js' },
  {
    src: 'https://mbp.yimg.com/sy/os/yaft/yaft-plugin-aftnoad-0.1.3.min.js'
  },
  { src: 'https://s.yimg.com/aaq/vzm/perf-vitals_1.0.0.js' },
  {
    src: 'https://fc.yahoo.com/sdarla/php/client.php?dm=1&lang=zh-Hant-TW'
  },
  {
    src: 'https://s.yimg.com/aaq/c/e7fecc0.caas-abu_highlander.min.js'
  }
]
...

沒有留言:

張貼留言