參考這篇:Titanic best working Classifier by Sina
整個過程十分享受如何用平均、標準差、去除雜訊、資料補齊、正規化、標籤數據化等等。讓我回想起之前寫的筆記,用了 LabelEncoder 等方式,結果...只要用 Pandas 搭配 map 架構就一口氣做光了 XD 並非 LabelEncoder 無用武之地,而是當你清楚資料屬性時,可以善用 Pandas 的架構去達成。而此例未用 OneHotEncoder 架構。
簡易筆記:
train = pd.read_csv('../input/train.csv', header = 0, dtype={'Age': np.float64})
test = pd.read_csv('../input/test.csv' , header = 0, dtype={'Age': np.float64})
full_data = [train, test]
for dataset in full_data:
dataset['Name_length'] = dataset['Name'].apply(len)
dataset['Has_Cabin'] = dataset["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
用個 full_data = [train, test] 再搭配 "for dataset in full_data " 的好處是可以一口氣整理完 train/test dataset 的轉換,十方便利,之前完全沒想到這招。新增欄位就透過 pandas 架構直接添加,非常直觀,但沒想到可以一口氣搭配 apply 架構去處理,這樣程式超簡潔的:
dataset['Name_length'] = dataset['Name'].apply(len)
dataset['Has_Cabin'] = dataset["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
另外,在字串處理時,可以搭配 replace 或 regular expression(apply) 做前置處理(正規化):
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
if title_search:
return title_search.group(1)
return ""
dataset['Title'] = dataset['Name'].apply(get_title)
#print(dataset['Title'].unique())
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
#print(dataset['Title'].unique())
整理完一輪後,把字串轉數據,就可以透過 map 來轉換,其中 fillna 則是把剩下沒對應到的都填 0 ,簡潔啊:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
接著來處理年紀與票價的部分,由於偏隱私,容易無資料,這時看到作者就開始用亂數去填補年紀,並依照著標準差資訊來做,可以維持資料分布,高招!而票價則用中位數去填補:
age_avg = dataset['Age'].mean()
age_std = dataset['Age'].std()
age_null_count = dataset['Age'].isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
#
# dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
#
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
# See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#
dataset.loc[ dataset['Age'][np.isnan(dataset['Age'])].index , 'Age' ] = age_null_random_list
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;
dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
最後,再提一下新增屬性的部分,pandas 真的很方便,可以單純把某欄位的資訊計算一番,添加到新的欄位,也有透過 dataset.loc 來取得特定資料來重新設定:
dataset['Name_length'] = dataset['Name'].apply(len)
dataset['Has_Cabin'] = dataset["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
dataset.loc[ dataset['Age'][np.isnan(dataset['Age'])].index , 'Age' ] = age_null_random_list
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
讀完這篇真是功力大增啊!以上是數據整理的流程,但是,還有個重點沒提到,那就是作者是整理完後,立馬看看某欄位跟目標欄位(Survived)的關係,才是真的靠數學來做事:
print (train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean())
Sex Survived
0 female 0.742038
1 male 0.188908
print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())
Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.156673
3 Mrs 0.793651
4 Rare 0.347826
可以得知 Miss 跟 Mrs 的生存率超過七成!這才是該學的精神。對於常搭船的,大多很清楚一開始必然先讓婦幼先逃生,所以有背景就會先假定讓女性生存率高的分析方式,沒背景就得靠統計功力了