基于 Jupyter 的特征工程手册之数据预处理Word文件下载.docx

资源描述

基于 Jupyter 的特征工程手册之数据预处理Word文件下载.docx

《基于 Jupyter 的特征工程手册之数据预处理Word文件下载.docx》由会员分享，可在线阅读，更多相关《基于 Jupyter 的特征工程手册之数据预处理Word文件下载.docx（25页珍藏版）》请在冰点文库上搜索。

基于 Jupyter 的特征工程手册之数据预处理Word文件下载.docx

二值化

将数值特征二值化。

#loadthesampledata

fromsklearn.datasetsimportfetch_california_housing

dataset=fetch_california_housing（）

X,y=dataset.data,dataset.target

#wewilltakethefirstcolumnastheexamplelater

%matplotlibinline

importseabornassns

importmatplotlib.pyplotasplt

fig,ax=plt.subplots（）

sns.distplot（X[:

0],hist=True,kde=True）

ax.set_title（'

Histogram'

fontsize=12）

ax.set_xlabel（'

Value'

ax.set_ylabel（'

Frequency'

fontsize=12）;

#thisfeaturehaslong-taildistribution

fromsklearn.preprocessingimportBinarizer

sample_columns=X[0:

10,0]

#selectthetop10samples

#returnarray（[8.3252,8.3014,7.2574,5.6431,3.8462,4.0368,3.6591,3.12,2.0804,3.6912]）

model=Binarizer（threshold=6）

#set6tobethethreshold#ifvalue<

=6,thenreturn0elsereturn1

result=model.fit_transform（sample_columns.reshape（-1,1））.reshape（-1）

#returnarray（[1.,1.,1.,0.,0.,0.,0.,0.,0.,0.]）

1.1.1.2分箱

将数值特征分箱。

均匀分箱：

fromsklearn.preprocessingimportKBinsDiscretizer

#inordertomimictheoperationinreal-world,weshallfittheKBinsDiscretizer#onthetrainsetandtransformthetestset#wetakethetoptensamplesinthefirstcolumnastestset#taketherestsamplesinthefirstcolumnastrainset

test_set=X[0:

train_set=X[10:

model=KBinsDiscretizer（n_bins=5,encode='

ordinal'

strategy='

uniform'

）

#set5bins#returnoridinalbinnumber,setallbinstohaveidenticalwidths

model.fit（train_set.reshape（-1,1））

result=model.transform（test_set.reshape（-1,1））.reshape（-1）

#returnarray（[2.,2.,2.,1.,1.,1.,1.,0.,0.,1.]）

bin_edge=model.bin_edges_[0]

#returnarray（[0.4999,3.39994,6.29998,9.20002,12.10006,15.0001]）,thebinedges

#visualizathebinedges

sns.distplot（train_set,hist=True,kde=True）

foredgeinbin_edge:

#uniformbins

line=plt.axvline（edge,color='

ax.legend（[line],['

UniformBinEdges'

],fontsize=10）

分位数分箱：

quantile'

#set3bins#returnoridinalbinnumber,setallbinsbasedonquantile

#returnarray（[4.,4.,4.,4.,2.,3.,2.,1.,0.,2.]）

#returnarray（[0.4999,2.3523,3.1406,3.9667,5.10824,15.0001]）,thebinedges#2.3523isthe20%quantile#3.1406isthe40%quantile,etc..

fig,ax=plt.subplots（）sns.distplot（train_set,hist=True,kde=True）

#quantilebasedbins

QuantilesBinEdges'

1.1.2缩放

不同尺度的特征之间难以比较，特别是在线性回归和逻辑回归等线性模型中。

在基于欧氏距离的

k-means聚类或

KNN模型中，就需要进行特征缩放，否则距离的测量是无用的。

而对于任何使用梯度下降的算法，缩放也会加快收敛速度。

一些常用的模型：

注：

偏度影响PCA模型，因此最好使用幂变换来消除偏度。

1.1.2.1

标准缩放（Z分数标准化）

公式：

其中，X是变量（特征），𝜇

是X的均值，𝜎

是X的标准差。

此方法对异常值非常敏感，因为异常值同时影响到𝜇

和𝜎

。

fromsklearn.preprocessingimportStandardScaler

#inordertomimictheoperationinreal-world,weshallfittheStandardScaler

#onthetrainsetandtransformthetestset

#wetakethetoptensamplesinthefirstcolumnastestset#taketherestsamplesinthefirstcolumnastrainset

model=StandardScaler（）

#fitonthetrainsetandtransformthetestset#toptennumbersforsimplification

#returnarray（[2.34539745,2.33286782,1.78324852,0.93339178,-0.0125957,

#0.08774668,-0.11109548,-0.39490751,-0.94221041,-0.09419626]）

#resultisthesameas（（X[0:

10,0]-X[10:

0].mean（））/X[10:

0].std（））

#visualizethedistributionafterthescaling#fitandtransformtheentirefirstfeature

fig,ax=plt.subplots（2,1,figsize=（13,9））

0],hist=True,kde=True,ax=ax[0]）ax[0].set_title（'

HistogramoftheOriginalDistribution'

ax[0].set_xlabel（'

ax[0].set_ylabel（'

#thisfeaturehaslong-taildistribution

model=StandardScaler（）model.fit（X[:

0].reshape（-1,1））

result=model.transform（X[:

0].reshape（-1,1））.reshape（-1）

#showthedistributionoftheentirefeature

sns.distplot（result,hist=True,kde=True,ax=ax[1]）ax[1].set_title（'

HistogramoftheTransformedDistribution'

ax[1].set_xlabel（'

ax[1].set_ylabel（'

#thedistributionisthesame,butscaleschange

fig.tight_layout（）

1.1.2.2

MinMaxScaler（按数值范围缩放）

假设我们要缩放的特征数值范围为（a,b）。

其中，Min是X的最小值，Max是X的最大值。

此方法也对异常值非常敏感，因为异常值同时影响到Min和Max。

fromsklearn.preprocessingimportMinMaxScaler

#inordertomimictheoperationinreal-world,weshallfittheMinMaxScaler

#wetakethetoptensamplesinthefirstcolumnastestset#taketherestsamplesinthefirstcolumnastrainset

model=MinMaxScaler（feature_range=（0,1））

#settherangetobe（0,1）

model.fit（train_set.reshape（-1,1））

#returnarray（[0.53966842,0.53802706,0.46602805,0.35469856,0.23077613,

#0.24392077,0.21787286,0.18069406,0.1089985,0.22008662]）

#resultisthesameas（X[0:

0].min（））/（X[10:

0].max（）-X[10:

0].min（））

0],hist=True,kde=True,ax=ax[0]）

ax[0].set_title（'

model.fit（X[:

0].reshape（-1,1））

fig.tight_layout（）

#nowthescalechangeto[0,1]

1.1.2.3

RobustScaler（抗异常值缩放）

使用对异常值稳健的统计（分位数）来缩放特征。

假设我们要将缩放的特征分位数范围为（a,b）。

这种方法对异常点鲁棒性更强。

importnumpyasnp

fromsklearn.preprocessingimportRobustScaler

#inordertomimictheoperationinreal-world,weshallfittheRobustScaler#onthetrainsetandtransformthetestset#wetakethetoptensamplesinthefirstcolumnastestset#taketherestsamplesinthefirstcolumnastrainset

model=RobustScaler（with_centering=True,with_scaling=True,quantile_range=（25.0,75.0））

#with_centering=True=>

recenterthefeaturebysetX'

=X-X.median（）#with_scaling=True=>

rescalethefeaturebythequantilesetbyuser#setthequantiletothe（25%,75%）

result=model.transform（test_set.reshape（-1,1））.reshape（-1）#returnarray（[2.19755974,2.18664281,1.7077657,0.96729508,0.14306683,#0.23049401,0.05724508,-0.19003715,-0.66689601,0.07196918]）#resultisthesameas（X[0:

10,0]-np.quantile（X[10:

0],0.5））/（np.quantile（X[10:

0],0.75）-np.q

uantile（X[10:

0],0.25））

sns.distplot（result,hist=True,kde=True,ax=ax[1]

展开阅读全文