2021年6月12日 星期六

Node.js 筆記 - 使用 Puppeteer 監控瀏覽網頁的 Requests 跟 HTML Code @ macOS 11.4, node v14.14.0, npm 6.14.5

緣起於在追蹤網站架構時,每次都靠 chrome browser 開發人員工具去追蹤特定的 request 還好,若需要大量瀏覽找尋一些蛛絲馬跡時,就會很煩!以前很閒時,就會跳下去寫 chrome extension 或是熱衷於 C++ 時,就會寫一點點  Chromium Embedded Framework 來寫個小型瀏覽器,最後想起來試試看 Puppeteer 吧!

使用 Puppeteer 很輕鬆,除了是很常見的 Javascript 語言外,Puppeteer 套件提供了網頁端常見的事件偵測、所發的 Request 清單。就這樣拼拼湊湊即可完工:

(async () => {
const browser = await puppeteer.launch({headless: false});
//const page = await browser.newPage();
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
await page.emulate(device);
await page.setRequestInterception(true);

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-domcontentloaded
// https://developer.mozilla.org/en-US/docs/Web/API/Window/DOMContentLoaded_event
//
// The DOMContentLoaded event fires when the initial HTML document has been completely loaded and parsed, without waiting for stylesheets, images, and subframes to finish loading.
//
page.on('domcontentloaded', () => {
console.log('on.domcontentloaded');
});

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-domcontentloaded
// https://developer.mozilla.org/en-US/docs/Web/API/Window/load_event
//
// The load event is fired when the whole page has loaded, including all dependent resources such as stylesheets and images. This is in contrast to DOMContentLoaded, which is fired as soon as the page DOM has been loaded, without waiting for resources to finish loading.
//
page.on('load', async () => {
console.log('on.load');
watchTags(page, watch_tags);
});

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-framenavigated
// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-class-frame
page.on('framenavigated', frame => {
console.log('on.framenavigated: '+frame.url());
});

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-request
// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-class-httprequest
page.on('request', request => {
watchRequest(page.url(), request.url());
if (skip_resource_type[ request.resourceType() ])
request.abort();
else
request.continue();
});

// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-event-response
// https://pptr.dev/#?product=Puppeteer&version=v10.0.0&show=api-class-httpresponse
page.on('response', response => {
//console.log('on.response: '+response.url());
});

while (true) {
if (url_init.length > 0) {
const target_url = url_init.shift();
await page.goto(target_url, {
waitUntil: 'networkidle0',
});
console.log('networkidle0');
}
if (url_init.length > 0)
await sleep(5000);
else
await sleep(50);
}
await browser.close();
})();

如此,搭配 command line 使用,就可很快的監控想要的 request 規則,或是 HTML DOM Tree 的變化,例如監控 <script> 的讀取位置在哪:

% node index.js -url "https://tw.yahoo.com" -tag "script.src"
...
on.load
[WATCH][TAG] script: src
Web URL: [https://tw.yahoo.com/]
[
  { src: 'https://s.yimg.com/wi/ytc.js' },
  {
    src: 'https://tw.mobi.yahoo.com/polyfill.min.js?features=array.isarray%2Carray.prototype.every%2Carray.prototype.foreach%2Carray.prototype.indexof%2Carray.prototype.map%2Cdate.now%2Cfunction.prototype.bind%2Cobject.keys%2Cstring.prototype.trim%2Cobject.defineproperty%2Cobject.defineproperties%2Cobject.create%2Cobject.freeze%2Carray.prototype.filter%2Carray.prototype.reduce%2Cobject.assign%2Cpromise%2Crequestanimationframe%2Carray.prototype.some%2Cobject.getownpropertynames%2Clocale-data-en-us%2Cintl%2Clocale-data-zh-hant-tw&version=2.1.23'
  },
  { src: 'https://s.yimg.com/aaq/yc/2.9.0/zh.js' },
  {
    src: 'https://s.yimg.com/ud/fp/js/vendor.174522d6d76b51858b93.min.js'
  },
  { src: 'https://s.yimg.com/ud/fp/js/common.js' },
  { src: 'https://s.yimg.com/oa/consent.js' },
  { src: 'https://consent.cmp.oath.com/cmpStub.min.js' },
  { src: 'https://consent.cmp.oath.com/cmp.js' },
  { src: 'https://s.yimg.com/ss/rapid3.js' },
  { src: 'https://s.yimg.com/rq/darla/boot.js' },
  {
    src: 'https://s.yimg.com/ud/fp/js/main.6622c6aeda051386afaa.min.js'
  },
  { src: 'https://mbp.yimg.com/sy/os/yaft/yaft-0.3.22.min.js' },
  {
    src: 'https://mbp.yimg.com/sy/os/yaft/yaft-plugin-aftnoad-0.1.3.min.js'
  },
  { src: 'https://s.yimg.com/aaq/vzm/perf-vitals_1.0.0.js' },
  {
    src: 'https://fc.yahoo.com/sdarla/php/client.php?dm=1&lang=zh-Hant-TW'
  },
  {
    src: 'https://s.yimg.com/aaq/c/e7fecc0.caas-abu_highlander.min.js'
  }
]
...

2021年6月4日 星期五

[Linux] 升級 Red Hat Enterprise Server 的 OpenSSL, OpenSSH 服務 @ Red Hat Enterprise Linux Server release 5.3

這是一台滿舊的機器,大概 10 年前吧。同事想要 git clone 時,出現了失敗問題,追蹤一下是機器的 openssl 過舊,由於機器有他的任務在,不好意思亂更新系統資源,怕爆!所以採用 tarball 的安裝方式。

openssl 太舊時,連用 wget/curl 去下載 https 來源時都會失敗的,只好靠 http 或是 scp 搬東西進去。

$ lsb_release -a
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 5.3 (Tikanga)
Release: 5.3
Codename: Tikanga

$ openssl version
OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008

只好挑個 /opt/tarball 安裝啦,沒想到要編譯 openssl 又會碰到 perl 版本不夠新:

$ wget http://www.cpan.org/src/5.0/perl-5.10.0.tar.gz
$ cd ~/perl-5.10.0
$ ./configure.gnu --prefix=/opt/tarball
$ make -j4 install

接著再安裝個 zlib ,採用 static 方式方便後續大家使用:

$ wget http://zlib.net/zlib-1.2.11.tar.gz
$ cd ~/zlib-1.2.11
$ ./configure --prefix=/opt/tarball --static
$ make install

編譯 openssl:

$ wget http://www.openssl.org/source/openssl-1.1.1k.tar.gz 
$ cd ~/openssl-OpenSSL_1_1_1k 
$ PATH=/opt/tarball/bin:$PATH ./config --prefix=/opt/tarball
$ make -j4 install

編譯 openssh 工具包:https://github.com/openssh/openssh-portable/releases

$ cd ~/openssh-portable-V_8_6_P1
$ autoreconf
$ LD_LIBRARY_PATH=/opt/tarball/lib:$LD_LIBRARY_PATH ./configure --prefix=/opt/tarball/ --with-ssl-dir=/opt/tarball
$ LD_LIBRARY_PATH=/opt/tarball/lib:$LD_LIBRARY_PATH make -j4 install
$ file /opt/tarball/bin/ssh
/opt/tarball/bin/ssh: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.9, stripped

接著偷懶拉 code 方式:

$ GIT_SSH_COMMAND="LD_LIBRARY_PATH=/opt/tarball/lib:$LD_LIBRARY_PATH /opt/tarball/bin/ssh" git clone ...

或是在環境變數添加好:

LD_LIBRARY_PATH=/opt/tarball/lib:$LD_LIBRARY_PATH 
PATH=/opt/tarball/bin:$PATH

搞定!

2021年5月9日 星期日

[韓劇] 認識的妻子

前幾天想說好久沒用愛奇藝,上去逛了一下,想說看一下「消失的情人節」,結果幾分鐘被朋友提醒可以去 NETFLIX 看高畫質 XD 就這樣又遠離了愛奇藝。倒是在 NETFLIX 看完後,又被推薦服務推坑「認識的妻子」,主因是去年底看過韓劇「Start-Up」,內裡的另一位反派恰好也在這齣戲裡演出。

這是一部 2018 年初的韓劇。這故事跟那時很夯的穿越劇有點關係,但真正吸引到我的,卻是第一集描述雙薪家庭養小孩的困境,真的養了小孩更有感的,感受到劇情安排的那種生活壓力。

不知是不是年紀過 35 的關係,對於大叔熟女演的戲都特別有興趣看完 XD 就這樣花個一天追了 12 集,完食!隨後逛了一下男女主角的 IG ,才發現現實的靜態的照片反而吸引不了我,看來還是劇情關係,讓人喜歡笑起來古靈精怪的女主角,且我不太愛學生時代的裝扮,反而是 OL 時期的精明搭配特有的笑容令人印象深刻。

此時會觀看這齣戲也滿有緣分的,像是幾天前聽股癌 Podcast 回憶起在軍中看了 about time ,讓我回想起對時間的印象,像是電影裡提到小孩出生後,就不適合回到過去,避免改變到未來的小孩狀態。接著恰逢表弟的第一個小孩來到世上,為了體貼老婆,包括月子中心住滿30天、放棄親餵等規劃,著實已把生活的目標跟品質都規劃好也定位好,以及另一位年收稅率達 20% 級距的網路名人在哀哀叫初乳擠得多痛苦和 20% 後就沒了眾多社會福利,政府這稅務規劃到底是助生還阻生 XD

總總點滴,都讓人好好回憶著家庭、小孩、婆媳、雙薪、工作的天秤究竟要怎樣平衡著,就這樣一口氣就看完了這戲!

2021年4月22日 星期四

[PHP] 下載 Maxmind GeoIP Legacy Databases 和 ngx_http_geoip_module 相關處理 @ macOS 11.2, PHP 7.3.24

由於 Maxmind 在推 GeoIP2 ,不開放 .dat 的 GeoIP DB 了,以前下載位置:

  • http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz
  • http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
現在官方都說要改用 GeoLite2 DB ,其格式:
  • GeoIP2 Binary (.mmdb)
  • GeoIP2 CSV 
目前先偷懶不處理 nginx ,因為 nginx 採用 GeoIP Legacy Databases。

有找到一個網站 https://www.miyuru.lk/geoiplegacy,從中下載:
連續動作:

% php -v
WARNING: PHP is not recommended
PHP is included in macOS for compatibility with legacy software.
Future versions of macOS will not include PHP.
PHP 7.3.24-(to be removed in future macOS) (cli) (built: Dec 21 2020 21:33:25) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.3.24, Copyright (c) 1998-2018 Zend Technologies

% php composer.phar require geoip/geoip:~1.16

% wget https://dl.miyuru.lk/geoip/maxmind/city/maxmind4.dat.gz
% gunzip -d maxmind4.dat.gz

% cat test-ip-via-geoip.php
<?php
require 'vendor/autoload.php';
$gi = geoip_open("maxmind4.dat",GEOIP_STANDARD);
$country = geoip_country_code_by_addr($gi, $argv[1]);
echo "lookup [".$argv[1]."], result: [$country]\n";

% php test-ip-via-geoip.php 8.8.8.8
lookup [8.8.8.8], result: [US]

% nslookup tw.yahoo.com
Server: 8.8.8.8
Address: 8.8.8.8#53

Non-authoritative answer:
tw.yahoo.com canonical name = atsv2-fp-shed.wg1.b.yahoo.com.
Name: atsv2-fp-shed.wg1.b.yahoo.com
Address: 180.222.102.201
Name: atsv2-fp-shed.wg1.b.yahoo.com
Address: 180.222.102.202

% php test-ip-via-geoip.php 180.222.102.202
lookup [180.222.102.202], result: [IN]

% curl ipinfo.io/180.222.102.202
{
  "ip": "180.222.102.202",
  "hostname": "media-router-fp74.prod.media.vip.tp2.yahoo.com",
  "city": "Taoyuan City",
  "region": "Taiwan",
  "country": "TW",
  "loc": "24.9937,121.2970",
  "org": "AS24506 YAHOO! TAIWAN",
  "timezone": "Asia/Taipei",
  "readme": "https://ipinfo.io/missingauth"
}

% nslookup facebook.com
Server: 8.8.8.8
Address: 8.8.8.8#53

Non-authoritative answer:
Name: facebook.com
Address: 31.13.87.36

% curl ipinfo.io/31.13.87.36
{
  "ip": "31.13.87.36",
  "hostname": "edge-star-mini-shv-01-tpe1.facebook.com",
  "city": "Hong Kong",
  "region": "Central and Western",
  "country": "HK",
  "loc": "22.2783,114.1747",
  "org": "AS32934 Facebook, Inc.",
  "timezone": "Asia/Hong_Kong",
  "readme": "https://ipinfo.io/missingauth"
}

% nslookup www.gov.tw
Server: 8.8.8.8
Address: 8.8.8.8#53

Non-authoritative answer:
Name: www.gov.tw
Address: 223.200.155.55

% php test-ip-via-geoip.php 223.200.155.55
lookup [223.200.155.55], result: [TW]

% curl ipinfo.io/223.200.155.55
{
  "ip": "223.200.155.55",
  "hostname": "223-200-155-55.hinet-ip.hinet.net",
  "city": "Hualien City",
  "region": "Taiwan",
  "country": "TW",
  "loc": "23.9769,121.6044",
  "org": "AS4782 Data Communication Business Group",
  "timezone": "Asia/Taipei",
  "readme": "https://ipinfo.io/missingauth"
}

2021年4月9日 星期五

[PHP] 透過 Maxmind GeoIP DB 統計用戶資訊 @ macOS 11.2, PHP 7.3.24

真是超級久沒用 maxmind.com 的 GeoIP DB 了,一時之間還以為沒有免費使用的方式,追了一下是要註冊帳號才能下載,而 maxmind 也有提供自動化每天更新 GeoIP DB 的機制。

在此對一份類似 Access Logs 做分析,將其轉成 CSV 格式來分析,其中 CSV 裡頭有 id 跟 remote_ip 兩個欄位,將 remote_ip 分析完後直接歸類屬於哪個 country_code,程式碼很簡單:

% php -v

WARNING: PHP is not recommended
PHP is included in macOS for compatibility with legacy software.
Future versions of macOS will not include PHP.
PHP 7.3.24-(to be removed in future macOS) (cli) (built: Dec 21 2020 21:33:25) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.3.24, Copyright (c) 1998-2018 Zend Technologies

% cat composer.json
{
  "require": {
    "geoip2/geoip2": "~2.0"
  }
}

% php composer.phar install
No lock file found. Updating dependencies instead of installing from lock file. Use composer update over composer install if you do not have a lock file.
Loading composer repositories with package information
Updating dependencies
Lock file operations: 4 installs, 0 updates, 0 removals
- Locking composer/ca-bundle (1.2.9)
- Locking geoip2/geoip2 (v2.11.0)
- Locking maxmind-db/reader (v1.10.0)
- Locking maxmind/web-service-common (v0.8.1)
Writing lock file
Installing dependencies from lock file (including require-dev)
Package operations: 4 installs, 0 updates, 0 removals
- Installing composer/ca-bundle (1.2.9): Extracting archive
- Installing maxmind/web-service-common (v0.8.1): Extracting archive
- Installing maxmind-db/reader (v1.10.0): Extracting archive
- Installing geoip2/geoip2 (v2.11.0): Extracting archive
2 package suggestions were added by new dependencies, use `composer suggest` to see details.

Generating autoload files
1 package you are using is looking for funding.
Use the `composer fund` command to find out more!

% cat job.php
<?php

require 'vendor/autoload.php';

// https://github.com/maxmind/GeoIP2-php
use GeoIp2\Database\Reader;
$reader = new Reader('GeoLite2-City.mmdb');

// https://www.php.net/manual/en/function.fgetcsv.php
$row = 1;
$header_row = NULL;
$header = array();
$country_group = array(); 
$lookup = array();

$tracking_time_cost_per_run = microtime(true);
echo "[".date("Y-m-d H:i:s")."]\tinit\n";
if (($handle = fopen("access_log.csv", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 10240, ",")) !== FALSE) {
if (is_null($header_row)) {
$header_row = $data;
foreach($header_row as $index => $name) {
$header[$name] = $index;
}
continue;
}
$remote_ip = $data[ $header['remote_ip'] ];
$data_id = $data[ $header['id'] ];

try {
$record = $reader->city($remote_ip);
} catch (Exception $e) {
continue;
}
if (!is_null($record) && property_exists($record, 'country') && !is_null($record->country) && isset($record->country->isoCode)) {
if (!isset($country_group[$record->country->isoCode]))
$country_group[$record->country->isoCode] = array();
$hash_key = $remote_ip.'-'.$record->country->isoCode;
if (!isset($lookup[$hash_key])) {
$lookup[$hash_key] = 1;
array_push($country_group[$record->country->isoCode], $data_id);
}
}
++$row;
if ($row % 100000 == 0) {
echo "[".date("Y-m-d H:i:s")."]\t".number_format($row).", time cost: ".(microtime(true) - $tracking_time_cost_per_run)."\n";
$tracking_time_cost_per_run = microtime(true);
file_put_contents('/tmp/lookup-result.json', json_encode($country_group));
}
}
fclose($handle);

echo "[".date("Y-m-d H:i:s")."]\t".number_format($row)."\n";
echo "Total #: $row\n";
print_r($header);
file_put_contents('lookup-result.json', json_encode($country_group));
}

產出:
% time php job.php
[2021-04-09 21:03:51] init
[2021-04-09 21:05:18] 100,000, time cost: 86.867056131363
[2021-04-09 21:06:41] 200,000, time cost: 83.395349025726 
...

% jq '' /tmp/lookup-result.json | head -n10

{
  "TW": [
    "1",
    "68",
    "101",
    "121",
    "147",
    "193",
    "236",
    "259", 
... 

就把一堆 record id 擺在某個 country_code 下面,每 85秒處理完 10萬筆資料。