2020年4月4日 星期六

再看一次「心之谷」/ 「側耳傾聽」



最近宮崎駿搬上了 NETFLIX ,就順手滑了一下 XD 明明昨晚在看「地獄」(瘟疫+WHO),今晚卻在看宮崎駿心之谷。

一開始只是想回顧片中經典的合唱曲,看著看著也把後半段也都追完了。令人回想起初中生活,當時引響最深的就是心之谷的這首歌。述說著自己當年的青澀與茫無目標的生活,想做什麼,卻什麼也沒頭緒。

此時此刻,時間貢獻給工作、家庭,近幾年提不起勁的,隨手的 side project 也了無新意,很適合回顧這動畫,想想、動動身,別再遲疑了。

2020年3月28日 星期六

[Python] 數據分析筆記 - 透過 pandas, scikit-learn 和 xgboost 分析 Kaggle airbnb-recruiting-new-user-bookings 案例

記得 2017 年也曾註冊 Kaggle 帳號,在上面挑個題目試試手氣,當時也有選中 Airbnb 來研究,可惜當年沒堅持下去,太多有趣的事了 XD 今年春天就來複習一下。

當初我是先從史丹佛 Andrew Ng 的課程看的,但大概只看個幾集就沒繼續,再過一陣子後就是台大教授林軒田的機器學習基石,我印象中有看完,因為我還在遲疑要不要接著看另一個進階課程,那時過境遷,沒再用都忘光了!
這些課程看著看著就不小心恍神了,接著自己僅用著一些原理去土砲...殊不知 Pandas 跟 scikit-learn 套件有多好用,當時只粗略用 Pandas 當 csv parser ,剩下的資料轉換、陣列計算還自己刻 numpy 架構去運算。所幸,終於有個更適合我這種懶人的微課程,那就是 Kaggle 的 Faster Data Science Education
我大概只需要從第三章 Intermediate Machine Learning 走完一遍就得到我想要的東西了。接著想找個戰場試試手氣,就又回想起 Kaggle 的 airbnb-recruiting-new-user-bookings 數據
接著用 airbnb-recruiting-new-user-bookings 關鍵字問個 google ,會發現到現在也有非常多人拿他當例子來分析,經典果真歷久不衰!大概不用破百行,就可以組出 airbnb 數據分析達八成的水準:

匯入函式庫:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

import numpy as np
import pandas as pd

import datetime


匯入 csv 檔案:

train_users = pd.read_csv('input/train_users_2.csv')

將 age 內容調整,包含去除輸入錯誤(太大或大小者),例如明顯輸入的是西元年,就順便幫轉一下:

data_checker = train_users.select_dtypes(include=['number']).copy()
data_checker = data_checker[ (data_checker.age > 1000) & (data_checker.age < 2010) ]
data_checker['age'] = 2015 - data_checker['age'] # 推論當年的資料,用 2015 年來相減對方不小心輸錯的出生年來得到年紀

for idx,row in data_checker.iterrows():
        train_users.at[idx,'age'] = row['age']

data_checker = train_users.select_dtypes(include=['number']).copy()
data_checker = data_checker[ (data_checker.age >= 2010) | (data_checker.age >= 100) | (data_checker.age < 13) ]
data_checker['age'] = np.nan
for idx,row in data_checker.iterrows():
        train_users.at[idx,'age'] = row['age']


處理時間欄位,轉成 datetime 型態,並轉成 weekday:

data_checker = train_users.loc[:, 'timestamp_first_active'].copy()
data_checker = pd.to_datetime( (data_checker // 1000000), format='%Y%m%d')
train_users['timestamp_first_active'] = data_checker

str_to_datetime_fields = ['date_account_created', 'date_first_booking']

for field in str_to_datetime_fields:
        train_users[field] = pd.to_datetime(train_users[field])

# to weekday

train_users['first_active_weekday'] = train_users['timestamp_first_active'].dt.dayofweek
for field in str_to_datetime_fields:
        train_users[field+'_weekday'] = train_users[field].dt.dayofweek

# remove datetime fields

train_users.drop(str_to_datetime_fields, axis=1, inplace=True)
train_users.drop(['timestamp_first_active'], axis=1, inplace=True)


處理 label 資料,主要轉成 one-hot encoding,並且去除一些數值,統一轉成 NaN:

categorical_features = [
        'affiliate_channel',
        'affiliate_provider',
        #'country_destination',
        'first_affiliate_tracked',
        'first_browser',
        'first_device_type',
        'gender',
        'language',
        'signup_app',
        'signup_method'
]
for categorical_feature in categorical_features:
        train_users[categorical_feature].replace('-unknown-', np.nan, inplace=True)
        train_users[categorical_feature].replace('NaN', np.nan, inplace=True)
        train_users[categorical_feature] = train_users[categorical_feature].astype('category')

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
# Convert categorical variable into dummy/indicator variables.
train_users = pd.get_dummies(train_users, columns=categorical_features)


開始順練模型,建議先透過 sample 跑小量

# X = train_users.copy()
X = train_users.sample(n=3000,random_state=0).copy()
y = X['country_destination'].copy()
X = X.drop(['country_destination'], axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y)


print("Start to train...")

job_start = datetime.datetime.now()

my_model = XGBClassifier()
my_model.fit(X_train, y_train)

print("training done, time cost: ", (datetime.datetime.now() - job_start))

job_start = datetime.datetime.now()

predictions = my_model.predict(X_valid)
print("predict done, time cost: ", (datetime.datetime.now() - job_start))

print("score:", accuracy_score(predictions, y_valid))


運行結果:

Start to train...
training done, time cost:  0:00:14.270242
predict done, time cost:  0:00:00.056500
score: 0.8466666666666667


沒想到只需做一些處理,運算玩就有八成準確率了!以上還沒使用 sessions 資料。完整程式碼請參考:github.com/changyy/study-kaggle-airbnb-recruiting-new-user-bookings

2020年3月22日 星期日

[GoogleSheet] 透過 Query 取出動態欄位的極限值,以 GOOGLEFINANCE 為例

GoogleSheet, =QUERY(GOOGLEFINANCE("NASDAQ:ZM", "low","1/1/2020",365,"DAILY"),"Select Min(Col2) label Min(Col2)''",1)

故事是這樣的,目前透過 GoogleSheet 維護一些資訊,有些服務可以立即幫你長很多筆資料,但有時只需要其中一項極值資訊。

例如 GOOGLEFINANCE - https://support.google.com/docs/answer/3093281 可以一口氣列出幾定天數的數據,如:

=GOOGLEFINANCE("NASDAQ:ZM", "low","1/1/2020",365,"DAILY")

若只想看極值,就可以再透過 Query - https://support.google.com/docs/answer/3093343 處理

最低極值:

=QUERY(GOOGLEFINANCE("NASDAQ:ZM", "low","1/1/2020",365,"DAILY"),"Select Min(Col2) label Min(Col2)''",1)

最高極值:

=QUERY(GOOGLEFINANCE("NASDAQ:ZM", "high","1/1/2020",365,"DAILY"),"Select Max(Col2) label Max(Col2)''",1)

收工!真是美好的一天

2020年3月21日 星期六

[C] 嵌入式 Web Server + cgi-bin + JSON = thttpd + cgic + cJSON 開發筆記 @ macOS

記得七八年前,是在嵌入式平台上寫 C++ 的 CGI 練習,現在則是在試著寫 C 語言。以下幾個套件還滿夠用的,筆記一下:
而 cgic 跟 cJSON 都是很精簡的單一檔案,非常方便,建議直接看他們的 header file ,就可以很快了解其概念,其中 cJSON 適合看 README ,裡頭提到了需要面對的記憶體管理。

以下是 CMake 編譯範例:

# https://raw.githubusercontent.com/boutell/cgic/master/cgic.h
# https://raw.githubusercontent.com/boutell/cgic/master/cgic.c
set(CGIC_VERSION "cgic-2.07")

include_directories(deps/${CGIC_VERSION}/include)
add_library(cgic
        deps/${CGIC_VERSION}/src/cgic.c
)

# https://raw.githubusercontent.com/DaveGamble/cJSON/v1.7.12/cJSON.h
# https://raw.githubusercontent.com/DaveGamble/cJSON/v1.7.12/cJSON.c
set(CJSON_VERSION "cjson-1.7.12")
include_directories(deps/${CJSON_VERSION}/include)
add_library(cjson
deps/${CJSON_VERSION}/src/cJSON.c
)


如此,後續要編譯時,就可以:

add_executable(study-app.cgi
        src/study-app.c
)
target_link_libraries(study-app.cgi
        cgic
        cjson
        curl
)


相關筆記在此: changyy/thttpd-cgi-study/blob/master/src/study-app.c

2020年3月17日 星期二

使用 curl 仿 DLNA controller 指令播放影片: DLNA protocol + SOAP protocol

大概去年就拿 DLNA 當做個小題目研究了一下,最近工作又需要,再度翻出來複習一下。使用了別人寫的 python 小工具,非常小巧精美:github.com/cherezov/dlnap

裝置搜尋:

$ python dlnap.py
No compatible devices found.
$ python dlnap.py
Discovered devices:
 [a] Name1 @ 192.168.1.2
 [a] Name2 @ 192.168.1.3


播放影片:

$ python dlnap.py -d Name1 --play http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4
No compatible devices found.

$ python dlnap.py -d Name1 --play http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4
Name1 @ 192.168.1.2
POST AVTransport/control HTTP/1.1
User-Agent: dlnap.py/0.15
Accept: */*
Content-Type: text/xml; charset="utf-8"
HOST: 192.168.1.2:60099
Content-Length: 558
SOAPACTION: "urn:schemas-upnp-org:service:AVTransport:1#SetAVTransportURI"
Connection: close

<?xml version="1.0" encoding="utf-8"?>
         <s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/" s:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
            <s:Body>
               <u:SetAVTransportURI xmlns:u="urn:schemas-upnp-org:service:AVTransport:1">
                  <InstanceID>0</InstanceID><CurrentURIMetaData></CurrentURIMetaData><CurrentURI>http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4</CurrentURI>
               </u:SetAVTransportURI>
            </s:Body>
         </s:Envelope>
POST AVTransport/control HTTP/1.1
User-Agent: dlnap.py/0.15
Accept: */*
Content-Type: text/xml; charset="utf-8"
HOST: 192.168.1.2:60099
Content-Length: 401
SOAPACTION: "urn:schemas-upnp-org:service:AVTransport:1#Play"
Connection: close

<?xml version="1.0" encoding="utf-8"?>
         <s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/" s:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
            <s:Body>
               <u:Play xmlns:u="urn:schemas-upnp-org:service:AVTransport:1">
                  <InstanceID>0</InstanceID><Speed>1</Speed>
               </u:Play>
            </s:Body>
         </s:Envelope>


可惜的,在測試播放影片時總不太順利,推論應當是兩個指令太快,改用 curl 指令測試就很正常:

1. 先設定影片

$ cat SetAVTransportURI.record
<?xml version="1.0" encoding="utf-8"?>
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/" s:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<s:Body>
<u:SetAVTransportURI xmlns:u="urn:schemas-upnp-org:service:AVTransport:1">
<InstanceID>0</InstanceID>
<CurrentURIMetaData></CurrentURIMetaData>
<CurrentURI>http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4</CurrentURI>
</u:SetAVTransportURI>
</s:Body>
</s:Envelope>

$ curl -v -X POST --data @SetAVTransportURI.record  -H 'SOAPACTION: "urn:schemas-upnp-org:service:AVTransport:1#SetAVTransportURI"' -H 'Content-Type: text/xml; charset="utf-8"' http://192.168.1.2:60099/AVTransport/control
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 192.168.1.2:60099...
* TCP_NODELAY set
* Connected to 192.168.1.2 (192.168.1.2) port 60099 (#0)
> POST /AVTransport/control HTTP/1.1
> Host: 192.168.1.2:60099
> User-Agent: curl/7.68.0
> Accept: */*
> SOAPACTION: "urn:schemas-upnp-org:service:AVTransport:1#SetAVTransportURI"
> Content-Type: text/xml; charset="utf-8"
> Content-Length: 476
>
* upload completely sent off: 476 out of 476 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< EXT:
< CONTENT-TYPE: text/xml; charset="utf-8"
< SERVER: POSIX, UPnP/1.0, Intel MicroStack/1.0.2777
< Content-Length: 336
<
<?xml version="1.0" encoding="utf-8"?>
<s:Envelope s:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
   <s:Body>
      <u:SetAVTransportURIResponse xmlns:u="urn:schemas-upnp-org:service:AVTransport:1">

      </u:SetAVTransportURIResponse>
   </s:Body>
* Connection #0 to host 192.168.1.2 left intact
</s:Envelope>


2. 呼叫播放

$ cat Play.record
<?xml version="1.0" encoding="utf-8"?>
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/" s:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<s:Body>
<u:Play xmlns:u="urn:schemas-upnp-org:service:AVTransport:1">
<InstanceID>0</InstanceID>
<Speed>1</Speed>
</u:Play>
</s:Body>
</s:Envelope>

$ curl -v -X POST --data @Play.record  -H 'SOAPACTION: "urn:schemas-upnp-org:service:AVTransport:1#Play"' -H 'Content-Type: text/xml; charset="utf-8"' http://192.168.1.2:60099/AVTransport/control
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 192.168.1.2:60099...
* TCP_NODELAY set
* Connected to 192.168.1.2 (192.168.1.2) port 60099 (#0)
> POST /AVTransport/control HTTP/1.1
> Host: 192.168.1.2:60099
> User-Agent: curl/7.68.0
> Accept: */*
> SOAPACTION: "urn:schemas-upnp-org:service:AVTransport:1#Play"
> Content-Type: text/xml; charset="utf-8"
> Content-Length: 316
>
* upload completely sent off: 316 out of 316 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< EXT:
< CONTENT-TYPE: text/xml; charset="utf-8"
< SERVER: POSIX, UPnP/1.0, Intel MicroStack/1.0.2777
< Content-Length: 310
<
<?xml version="1.0" encoding="utf-8"?>
<s:Envelope s:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
   <s:Body>
      <u:PlayResponse xmlns:u="urn:schemas-upnp-org:service:AVTransport:1">

      </u:PlayResponse>
   </s:Body>
* Connection #0 to host 192.168.1.2 left intact
</s:Envelope>