當初我是先從史丹佛 Andrew Ng 的課程看的,但大概只看個幾集就沒繼續,再過一陣子後就是台大教授林軒田的機器學習基石,我印象中有看完,因為我還在遲疑要不要接著看另一個進階課程,那時過境遷,沒再用都忘光了!
- Andrew Ng - machine-learning
- 林軒田 - 機器學習基石上 (Machine Learning Foundations)---Mathematical Foundations
我大概只需要從第三章 Intermediate Machine Learning 走完一遍就得到我想要的東西了。接著想找個戰場試試手氣,就又回想起 Kaggle 的 airbnb-recruiting-new-user-bookings 數據
接著用 airbnb-recruiting-new-user-bookings 關鍵字問個 google ,會發現到現在也有非常多人拿他當例子來分析,經典果真歷久不衰!大概不用破百行,就可以組出 airbnb 數據分析達八成的水準:
匯入函式庫:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
import numpy as np
import pandas as pd
import datetime
匯入 csv 檔案:
train_users = pd.read_csv('input/train_users_2.csv')
將 age 內容調整,包含去除輸入錯誤(太大或大小者),例如明顯輸入的是西元年,就順便幫轉一下:
data_checker = train_users.select_dtypes(include=['number']).copy()
data_checker = data_checker[ (data_checker.age > 1000) & (data_checker.age < 2010) ]
data_checker['age'] = 2015 - data_checker['age'] # 推論當年的資料,用 2015 年來相減對方不小心輸錯的出生年來得到年紀
for idx,row in data_checker.iterrows():
train_users.at[idx,'age'] = row['age']
data_checker = train_users.select_dtypes(include=['number']).copy()
data_checker = data_checker[ (data_checker.age >= 2010) | (data_checker.age >= 100) | (data_checker.age < 13) ]
data_checker['age'] = np.nan
for idx,row in data_checker.iterrows():
train_users.at[idx,'age'] = row['age']
處理時間欄位,轉成 datetime 型態,並轉成 weekday:
data_checker = train_users.loc[:, 'timestamp_first_active'].copy()
data_checker = pd.to_datetime( (data_checker // 1000000), format='%Y%m%d')
train_users['timestamp_first_active'] = data_checker
str_to_datetime_fields = ['date_account_created', 'date_first_booking']
for field in str_to_datetime_fields:
train_users[field] = pd.to_datetime(train_users[field])
# to weekday
train_users['first_active_weekday'] = train_users['timestamp_first_active'].dt.dayofweek
for field in str_to_datetime_fields:
train_users[field+'_weekday'] = train_users[field].dt.dayofweek
# remove datetime fields
train_users.drop(str_to_datetime_fields, axis=1, inplace=True)
train_users.drop(['timestamp_first_active'], axis=1, inplace=True)
處理 label 資料,主要轉成 one-hot encoding,並且去除一些數值,統一轉成 NaN:
categorical_features = [
'affiliate_channel',
'affiliate_provider',
#'country_destination',
'first_affiliate_tracked',
'first_browser',
'first_device_type',
'gender',
'language',
'signup_app',
'signup_method'
]
for categorical_feature in categorical_features:
train_users[categorical_feature].replace('-unknown-', np.nan, inplace=True)
train_users[categorical_feature].replace('NaN', np.nan, inplace=True)
train_users[categorical_feature] = train_users[categorical_feature].astype('category')
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
# Convert categorical variable into dummy/indicator variables.
train_users = pd.get_dummies(train_users, columns=categorical_features)
開始順練模型,建議先透過 sample 跑小量
# X = train_users.copy()
X = train_users.sample(n=3000,random_state=0).copy()
y = X['country_destination'].copy()
X = X.drop(['country_destination'], axis=1)
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
print("Start to train...")
job_start = datetime.datetime.now()
my_model = XGBClassifier()
my_model.fit(X_train, y_train)
print("training done, time cost: ", (datetime.datetime.now() - job_start))
job_start = datetime.datetime.now()
predictions = my_model.predict(X_valid)
print("predict done, time cost: ", (datetime.datetime.now() - job_start))
print("score:", accuracy_score(predictions, y_valid))
運行結果:
Start to train...
training done, time cost: 0:00:14.270242
predict done, time cost: 0:00:00.056500
score: 0.8466666666666667
沒想到只需做一些處理,運算玩就有八成準確率了!以上還沒使用 sessions 資料。完整程式碼請參考:github.com/changyy/study-kaggle-airbnb-recruiting-new-user-bookings