2021年4月9日 星期五

[PHP] 透過 Maxmind GeoIP DB 統計用戶資訊 @ macOS 11.2, PHP 7.3.24

真是超級久沒用 maxmind.com 的 GeoIP DB 了,一時之間還以為沒有免費使用的方式,追了一下是要註冊帳號才能下載,而 maxmind 也有提供自動化每天更新 GeoIP DB 的機制。

在此對一份類似 Access Logs 做分析,將其轉成 CSV 格式來分析,其中 CSV 裡頭有 id 跟 remote_ip 兩個欄位,將 remote_ip 分析完後直接歸類屬於哪個 country_code,程式碼很簡單:

% php -v

WARNING: PHP is not recommended
PHP is included in macOS for compatibility with legacy software.
Future versions of macOS will not include PHP.
PHP 7.3.24-(to be removed in future macOS) (cli) (built: Dec 21 2020 21:33:25) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.3.24, Copyright (c) 1998-2018 Zend Technologies

% cat composer.json
{
  "require": {
    "geoip2/geoip2": "~2.0"
  }
}

% php composer.phar install
No lock file found. Updating dependencies instead of installing from lock file. Use composer update over composer install if you do not have a lock file.
Loading composer repositories with package information
Updating dependencies
Lock file operations: 4 installs, 0 updates, 0 removals
- Locking composer/ca-bundle (1.2.9)
- Locking geoip2/geoip2 (v2.11.0)
- Locking maxmind-db/reader (v1.10.0)
- Locking maxmind/web-service-common (v0.8.1)
Writing lock file
Installing dependencies from lock file (including require-dev)
Package operations: 4 installs, 0 updates, 0 removals
- Installing composer/ca-bundle (1.2.9): Extracting archive
- Installing maxmind/web-service-common (v0.8.1): Extracting archive
- Installing maxmind-db/reader (v1.10.0): Extracting archive
- Installing geoip2/geoip2 (v2.11.0): Extracting archive
2 package suggestions were added by new dependencies, use `composer suggest` to see details.

Generating autoload files
1 package you are using is looking for funding.
Use the `composer fund` command to find out more!

% cat job.php
<?php

require 'vendor/autoload.php';

// https://github.com/maxmind/GeoIP2-php
use GeoIp2\Database\Reader;
$reader = new Reader('GeoLite2-City.mmdb');

// https://www.php.net/manual/en/function.fgetcsv.php
$row = 1;
$header_row = NULL;
$header = array();
$country_group = array(); 
$lookup = array();

$tracking_time_cost_per_run = microtime(true);
echo "[".date("Y-m-d H:i:s")."]\tinit\n";
if (($handle = fopen("access_log.csv", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 10240, ",")) !== FALSE) {
if (is_null($header_row)) {
$header_row = $data;
foreach($header_row as $index => $name) {
$header[$name] = $index;
}
continue;
}
$remote_ip = $data[ $header['remote_ip'] ];
$data_id = $data[ $header['id'] ];

try {
$record = $reader->city($remote_ip);
} catch (Exception $e) {
continue;
}
if (!is_null($record) && property_exists($record, 'country') && !is_null($record->country) && isset($record->country->isoCode)) {
if (!isset($country_group[$record->country->isoCode]))
$country_group[$record->country->isoCode] = array();
$hash_key = $remote_ip.'-'.$record->country->isoCode;
if (!isset($lookup[$hash_key])) {
$lookup[$hash_key] = 1;
array_push($country_group[$record->country->isoCode], $data_id);
}
}
++$row;
if ($row % 100000 == 0) {
echo "[".date("Y-m-d H:i:s")."]\t".number_format($row).", time cost: ".(microtime(true) - $tracking_time_cost_per_run)."\n";
$tracking_time_cost_per_run = microtime(true);
file_put_contents('/tmp/lookup-result.json', json_encode($country_group));
}
}
fclose($handle);

echo "[".date("Y-m-d H:i:s")."]\t".number_format($row)."\n";
echo "Total #: $row\n";
print_r($header);
file_put_contents('lookup-result.json', json_encode($country_group));
}

產出:
% time php job.php
[2021-04-09 21:03:51] init
[2021-04-09 21:05:18] 100,000, time cost: 86.867056131363
[2021-04-09 21:06:41] 200,000, time cost: 83.395349025726 
...

% jq '' /tmp/lookup-result.json | head -n10

{
  "TW": [
    "1",
    "68",
    "101",
    "121",
    "147",
    "193",
    "236",
    "259", 
... 

就把一堆 record id 擺在某個 country_code 下面,每 85秒處理完 10萬筆資料。

沒有留言:

張貼留言