基于 Jupyter 的特征工程手册之数据预处理Word文件下载.docx
《基于 Jupyter 的特征工程手册之数据预处理Word文件下载.docx》由会员分享,可在线阅读,更多相关《基于 Jupyter 的特征工程手册之数据预处理Word文件下载.docx(25页珍藏版)》请在冰点文库上搜索。
二值化
将数值特征二值化。
#loadthesampledata
fromsklearn.datasetsimportfetch_california_housing
dataset=fetch_california_housing()
X,y=dataset.data,dataset.target
#wewilltakethefirstcolumnastheexamplelater
%matplotlibinline
importseabornassns
importmatplotlib.pyplotasplt
fig,ax=plt.subplots()
sns.distplot(X[:
0],hist=True,kde=True)
ax.set_title('
Histogram'
fontsize=12)
ax.set_xlabel('
Value'
ax.set_ylabel('
Frequency'
fontsize=12);
#thisfeaturehaslong-taildistribution
fromsklearn.preprocessingimportBinarizer
sample_columns=X[0:
10,0]
#selectthetop10samples
#returnarray([8.3252,8.3014,7.2574,5.6431,3.8462,4.0368,3.6591,3.12,2.0804,3.6912])
model=Binarizer(threshold=6)
#set6tobethethreshold#ifvalue<
=6,thenreturn0elsereturn1
result=model.fit_transform(sample_columns.reshape(-1,1)).reshape(-1)
#returnarray([1.,1.,1.,0.,0.,0.,0.,0.,0.,0.])
1.1.1.2分箱
将数值特征分箱。
均匀分箱:
fromsklearn.preprocessingimportKBinsDiscretizer
#inordertomimictheoperationinreal-world,weshallfittheKBinsDiscretizer#onthetrainsetandtransformthetestset#wetakethetoptensamplesinthefirstcolumnastestset#taketherestsamplesinthefirstcolumnastrainset
test_set=X[0:
train_set=X[10:
0]
model=KBinsDiscretizer(n_bins=5,encode='
ordinal'
strategy='
uniform'
)
#set5bins#returnoridinalbinnumber,setallbinstohaveidenticalwidths
model.fit(train_set.reshape(-1,1))
result=model.transform(test_set.reshape(-1,1)).reshape(-1)
#returnarray([2.,2.,2.,1.,1.,1.,1.,0.,0.,1.])
bin_edge=model.bin_edges_[0]
#returnarray([0.4999,3.39994,6.29998,9.20002,12.10006,15.0001]),thebinedges
#visualizathebinedges
sns.distplot(train_set,hist=True,kde=True)
foredgeinbin_edge:
#uniformbins
line=plt.axvline(edge,color='
b'
ax.legend([line],['
UniformBinEdges'
],fontsize=10)
分位数分箱:
quantile'
#set3bins#returnoridinalbinnumber,setallbinsbasedonquantile
#returnarray([4.,4.,4.,4.,2.,3.,2.,1.,0.,2.])
#returnarray([0.4999,2.3523,3.1406,3.9667,5.10824,15.0001]),thebinedges#2.3523isthe20%quantile#3.1406isthe40%quantile,etc..
fig,ax=plt.subplots()sns.distplot(train_set,hist=True,kde=True)
#quantilebasedbins
QuantilesBinEdges'
1.1.2缩放
不同尺度的特征之间难以比较,特别是在线性回归和逻辑回归等线性模型中。
在基于欧氏距离的
k-means聚类或
KNN模型中,就需要进行特征缩放,否则距离的测量是无用的。
而对于任何使用梯度下降的算法,缩放也会加快收敛速度。
一些常用的模型:
注:
偏度影响PCA模型,因此最好使用幂变换来消除偏度。
1.1.2.1
标准缩放(Z分数标准化)
公式:
其中,X是变量(特征),𝜇
是X的均值,𝜎
是X的标准差。
此方法对异常值非常敏感,因为异常值同时影响到𝜇
和𝜎
。
fromsklearn.preprocessingimportStandardScaler
#inordertomimictheoperationinreal-world,weshallfittheStandardScaler
#onthetrainsetandtransformthetestset
#wetakethetoptensamplesinthefirstcolumnastestset#taketherestsamplesinthefirstcolumnastrainset
model=StandardScaler()
#fitonthetrainsetandtransformthetestset#toptennumbersforsimplification
#returnarray([2.34539745,2.33286782,1.78324852,0.93339178,-0.0125957,
#0.08774668,-0.11109548,-0.39490751,-0.94221041,-0.09419626])
#resultisthesameas((X[0:
10,0]-X[10:
0].mean())/X[10:
0].std())
#visualizethedistributionafterthescaling#fitandtransformtheentirefirstfeature
fig,ax=plt.subplots(2,1,figsize=(13,9))
0],hist=True,kde=True,ax=ax[0])ax[0].set_title('
HistogramoftheOriginalDistribution'
ax[0].set_xlabel('
ax[0].set_ylabel('
#thisfeaturehaslong-taildistribution
model=StandardScaler()model.fit(X[:
0].reshape(-1,1))
result=model.transform(X[:
0].reshape(-1,1)).reshape(-1)
#showthedistributionoftheentirefeature
sns.distplot(result,hist=True,kde=True,ax=ax[1])ax[1].set_title('
HistogramoftheTransformedDistribution'
ax[1].set_xlabel('
ax[1].set_ylabel('
#thedistributionisthesame,butscaleschange
fig.tight_layout()
1.1.2.2
MinMaxScaler(按数值范围缩放)
假设我们要缩放的特征数值范围为(a,b)。
其中,Min是X的最小值,Max是X的最大值。
此方法也对异常值非常敏感,因为异常值同时影响到Min和Max。
fromsklearn.preprocessingimportMinMaxScaler
#inordertomimictheoperationinreal-world,weshallfittheMinMaxScaler
#wetakethetoptensamplesinthefirstcolumnastestset#taketherestsamplesinthefirstcolumnastrainset
model=MinMaxScaler(feature_range=(0,1))
#settherangetobe(0,1)
model.fit(train_set.reshape(-1,1))
#returnarray([0.53966842,0.53802706,0.46602805,0.35469856,0.23077613,
#0.24392077,0.21787286,0.18069406,0.1089985,0.22008662])
#resultisthesameas(X[0:
0].min())/(X[10:
0].max()-X[10:
0].min())
0],hist=True,kde=True,ax=ax[0])
ax[0].set_title('
model.fit(X[:
0].reshape(-1,1))
fig.tight_layout()
#nowthescalechangeto[0,1]
1.1.2.3
RobustScaler(抗异常值缩放)
使用对异常值稳健的统计(分位数)来缩放特征。
假设我们要将缩放的特征分位数范围为(a,b)。
这种方法对异常点鲁棒性更强。
importnumpyasnp
fromsklearn.preprocessingimportRobustScaler
#inordertomimictheoperationinreal-world,weshallfittheRobustScaler#onthetrainsetandtransformthetestset#wetakethetoptensamplesinthefirstcolumnastestset#taketherestsamplesinthefirstcolumnastrainset
model=RobustScaler(with_centering=True,with_scaling=True,quantile_range=(25.0,75.0))
#with_centering=True=>
recenterthefeaturebysetX'
=X-X.median()#with_scaling=True=>
rescalethefeaturebythequantilesetbyuser#setthequantiletothe(25%,75%)
result=model.transform(test_set.reshape(-1,1)).reshape(-1)#returnarray([2.19755974,2.18664281,1.7077657,0.96729508,0.14306683,#0.23049401,0.05724508,-0.19003715,-0.66689601,0.07196918])#resultisthesameas(X[0:
10,0]-np.quantile(X[10:
0],0.5))/(np.quantile(X[10:
0],0.75)-np.q
uantile(X[10:
0],0.25))
sns.distplot(result,hist=True,kde=True,ax=ax[1]