先說一下 Hash table 的用法,就單純掃過所有數值,對所有數值建立查表方式,可以掃過一輪資料時,順便把標記都處理完畢:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
dataset = sns.load_dataset("tips")
print(dataset)
print(dataset.shape)
print(dataset.columns)
# Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')
dataset_formated = None
# 土炮模式
fields_lookup = {}
print(dataset.columns)
for index, row in dataset.iterrows():
row_formated = np.empty([])
#for fieldname in dataset.columns:
for fieldname in [ 'day', 'smoker', 'time', 'sex', 'size' ]:
#print(field)
field_value = None
if fieldname not in fields_lookup:
fields_lookup[fieldname] = {}
if row[fieldname] not in fields_lookup[fieldname]:
fields_lookup[fieldname][ row[fieldname] ] = len(fields_lookup[fieldname])
# field value from hash table
field_data = np.zeros(1, dtype=np.int)
field_data[0] = fields_lookup[fieldname][ row[fieldname] ]
# handle row
row_formated = np.append( row_formated, field_data.reshape(1, -1) )
# handle data
if dataset_formated is None:
dataset_formated = np.zeros([ dataset.shape[0], row_formated.reshape(1, -1).size ], dtype=np.int)
dataset_formated[index] = row_formated.reshape(1, -1)
#print(row_formated)
#print(row_formated.reshape(1, -1))
print(fields_lookup)
print(dataset_formated)
如此一來就完成編碼,也完成資料格式轉換:
{'day': {'Sun': 0, 'Sat': 1, 'Thur': 2, 'Fri': 3}, 'smoker': {'No': 0, 'Yes': 1}, 'time': {'Dinner': 0, 'Lunch': 1}, 'sex': {'Female': 0, 'Male': 1}, 'size': {2: 0, 3: 1, 4: 2, 1: 3, 6: 4, 5: 5}}
然而,對於部分演算法可能拿編碼的整數進行運算,或是想要更精準把整數擴展成選擇結果,那解法就是擴展欄位,例如有 5 種結果,就擴展成 5 個欄位,選到的標 1 ,沒選到標 0,土炮處理方式就麻煩了點,需要先掃一次建立 hash table,接著第二次在重建數據:
# build hash table only
for index, row in dataset.iterrows():
#for fieldname in dataset.columns:
for fieldname in [ 'day', 'smoker', 'time', 'sex', 'size' ]:
#print(field)
if fieldname not in fields_lookup:
fields_lookup[fieldname] = {}
if row[fieldname] not in fields_lookup[fieldname]:
fields_lookup[fieldname][ row[fieldname] ] = len(fields_lookup[fieldname])
print(fields_lookup)
# build new matrix
for index, row in dataset.iterrows():
row_formated = np.empty([])
for fieldname in [ 'day', 'smoker', 'time', 'sex', 'size' ]:
# field value from hash table
field_data = np.zeros([len(fields_lookup[fieldname]), 1], dtype=np.int)
field_data[ fields_lookup[fieldname][row[fieldname]] ][0] = 1
# handle row
row_formated = np.append( row_formated, field_data.reshape(1, -1) )
#print(row_formated)
# handle data
if dataset_formated is None:
dataset_formated = np.zeros([ dataset.shape[0], row_formated.reshape(1, -1).size ], dtype=np.int)
dataset_formated[index] = row_formated.reshape(1, -1)
#print(row_formated)
#print(row_formated.reshape(1, -1))
print(dataset_formated)
回到常用的方式 - LabelEncoder:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
dataset_encode = dataset.copy()
labels = {}
for i, field in enumerate(dataset.columns):
if field == 'tip' or field == 'total_bill':
continue
labels[field] = list(set(dataset[field].unique()))
label_encoder = LabelEncoder()
label_encoder.fit(labels[field])
# original
#print(dataset_encode.iloc[:,i])
#print(dataset_encode[field])
# encode
#feature = label_encoder.transform(dataset_encode.iloc[:,i])
#feature = feature.reshape(dataset.shape[0], 1)
# https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
dataset_encode[field] = label_encoder.fit_transform(dataset_encode[field])
#print(dataset_encode[field])
print(dataset_encode)
輸出:
total_bill tip sex smoker day time size
0 16.99 1.01 0 0 2 0 1
1 10.34 1.66 1 0 2 0 2
2 21.01 3.50 1 0 2 0 2
3 23.68 3.31 1 0 2 0 1
4 24.59 3.61 0 0 2 0 3
5 25.29 4.71 1 0 2 0 3
6 8.77 2.00 1 0 2 0 1
7 26.88 3.12 1 0 2 0 3
8 15.04 1.96 1 0 2 0 1
9 14.78 3.23 1 0 2 0 1
10 10.27 1.71 1 0 2 0 1
11 35.26 5.00 0 0 2 0 3
12 15.42 1.57 1 0 2 0 1
13 18.43 3.00 1 0 2 0 3
14 14.83 3.02 0 0 2 0 1
15 21.58 3.92 1 0 2 0 1
16 10.33 1.67 0 0 2 0 2
17 16.29 3.71 1 0 2 0 2
18 16.97 3.50 0 0 2 0 2
19 20.65 3.35 1 0 1 0 2
20 17.92 4.08 1 0 1 0 1
21 20.29 2.75 0 0 1 0 1
22 15.77 2.23 0 0 1 0 1
23 39.42 7.58 1 0 1 0 3
24 19.82 3.18 1 0 1 0 1
25 17.81 2.34 1 0 1 0 3
26 13.37 2.00 1 0 1 0 1
27 12.69 2.00 1 0 1 0 1
28 21.70 4.30 1 0 1 0 1
29 19.65 3.00 0 0 1 0 1
.. ... ... ... ... ... ... ...
214 28.17 6.50 0 1 1 0 2
215 12.90 1.10 0 1 1 0 1
216 28.15 3.00 1 1 1 0 4
217 11.59 1.50 1 1 1 0 1
218 7.74 1.44 1 1 1 0 1
219 30.14 3.09 0 1 1 0 3
220 12.16 2.20 1 1 0 1 1
221 13.42 3.48 0 1 0 1 1
222 8.58 1.92 1 1 0 1 0
223 15.98 3.00 0 0 0 1 2
224 13.42 1.58 1 1 0 1 1
225 16.27 2.50 0 1 0 1 1
226 10.09 2.00 0 1 0 1 1
227 20.45 3.00 1 0 1 0 3
228 13.28 2.72 1 0 1 0 1
229 22.12 2.88 0 1 1 0 1
230 24.01 2.00 1 1 1 0 3
231 15.69 3.00 1 1 1 0 2
232 11.61 3.39 1 0 1 0 1
233 10.77 1.47 1 0 1 0 1
234 15.53 3.00 1 1 1 0 1
235 10.07 1.25 1 0 1 0 1
236 12.60 1.00 1 1 1 0 1
237 32.83 1.17 1 1 1 0 1
238 35.83 4.67 0 0 1 0 2
239 29.03 5.92 1 0 1 0 2
240 27.18 2.00 0 1 1 0 1
241 22.67 2.00 1 1 1 0 1
242 17.82 1.75 1 0 1 0 1
243 18.78 3.00 0 0 3 0 1
[244 rows x 7 columns]
若想保留之前的欄位,也可用添加新欄位的方式:
dataset_encode[field+"-encode"] = label_encoder.fit_transform(dataset_encode[field])
而 OneHotEncoder 則是因為產出的維度會變大,要再設法把新產出來的數值再添加回去:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
dataset_encode = dataset.copy()
labels = {}
for i, field in enumerate(dataset.columns):
if field == 'tip' or field == 'total_bill':
continue
# original
#print(dataset_encode.iloc[:,i])
#print(dataset_encode[field])
# LabelEncode
labels[field] = list(set(dataset[field].unique()))
label_encoder = LabelEncoder()
label_encoder.fit(labels[field])
# https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
dataset_encode[field+"-LabelEncode"] = label_encoder.fit_transform(dataset_encode[field])
#dataset_encode[field] = label_encoder.fit_transform(dataset_encode[field])
# OneHotEncode
feature = label_encoder.transform(dataset_encode[field])
feature = feature.reshape(dataset.shape[0], 1)
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
onehot_encoder = OneHotEncoder(sparse=False,n_values=len(labels[field]))
onehot_result = onehot_encoder.fit_transform(feature)
#dataset_encode[field+"-OneHotEncode"] = onehot_encoder.fit_transform(feature)
#onehot_encoder[ ["A","B"] ] = onehot_result
for index in range(len(labels[field])):
dataset_encode[field+"-OneHotEncode-"+str(index)] = onehot_result[:,index]
#print(dataset_encode[field])
print(dataset_encode.head(5))
如此一來,結果會是這樣:
total_bill tip sex smoker day time size sex-LabelEncode \
0 16.99 1.01 Female No Sun Dinner 2 0
1 10.34 1.66 Male No Sun Dinner 3 1
2 21.01 3.50 Male No Sun Dinner 3 1
3 23.68 3.31 Male No Sun Dinner 2 1
4 24.59 3.61 Female No Sun Dinner 4 0
sex-OneHotEncode-0 sex-OneHotEncode-1 ... \
0 1.0 0.0 ...
1 0.0 1.0 ...
2 0.0 1.0 ...
3 0.0 1.0 ...
4 1.0 0.0 ...
time-LabelEncode time-OneHotEncode-0 time-OneHotEncode-1 \
0 0 1.0 0.0
1 0 1.0 0.0
2 0 1.0 0.0
3 0 1.0 0.0
4 0 1.0 0.0
size-LabelEncode size-OneHotEncode-0 size-OneHotEncode-1 \
0 1 0.0 1.0
1 2 0.0 0.0
2 2 0.0 0.0
3 1 0.0 1.0
4 3 0.0 0.0
size-OneHotEncode-2 size-OneHotEncode-3 size-OneHotEncode-4 \
0 0.0 0.0 0.0
1 1.0 0.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 1.0 0.0
size-OneHotEncode-5
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0