数据挖掘报告.docx
《数据挖掘报告.docx》由会员分享,可在线阅读,更多相关《数据挖掘报告.docx(20页珍藏版)》请在冰点文库上搜索。
数据挖掘报告
基于PCA和K--means算法的应用
摘要
1.引言
2.PCA原理
3.K--means原理
4.实验结果
4.1PCA的工作过程
4.1基于Iisi数据实验结果(4维2维)
图1Iris数据
4.2基于人脸数据(5000*1024)
图2数据原图
图3PCA后的前36个主成分数据图
4.3原图和用前100个主成分的恢复实验图
图4数据原图
图5用主成分的恢复图
由上两个图可知,用主成分恢复的图和原图除了细节有些不同外,图像基本相同,所以用PCA降维可以起到很好的降维作用,这对于维数很高的且数据量很大的数据来说,无疑提供了一种很有效减少计算量,提高程序效率的方法。
4.3PCA的可视化
图6基于K-means的3D效果图
图7基于PCA的2D效果图
4.4基于K-means在图像压缩的应用
4.4.1K-means工作过程
图8 K-means的工作过程图
4.4.2在图像压缩的结果(128色16色)
图9 K-means的图像压缩实验图
5实验结论
1.PCA是一种很有效的降维方法,它可以有效降低数据维数,对于数据量很大且维数很多的数据,它是一种很有效的减少计算量,减少程序运行时间的方法。
2.PCA也是一种提供可视化技术的方法,利用PCA降维可以做到数据的可视化。
3.K-means算法是一种聚类方法,对于未知数据类别,且数据呈类圆球形分布的数据,具有较高的识别率。
4.K-means算法能用于图像压缩,可以有效减少数据存储量。
6.实验代码
6.1K-means部分代码
clear;closeall;clc
fprintf('Findingclosestcentroids.\n\n');
%Loadanexampledatasetthatwewillbeusing
load('ex7data2.mat');
%Selectaninitialsetofcentroids
K=3;%3Centroids
initial_centroids=[33;62;85];
%Findtheclosestcentroidsfortheexamplesusingthe
%initial_centroids
idx=findClosestCentroids(X,initial_centroids);
fprintf('Closestcentroidsforthefirst3examples:
\n')
fprintf('%d',idx(1:
3));
fprintf('\n(theclosestcentroidsshouldbe1,3,2respectively)\n');
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
fprintf('\nComputingcentroidsmeans.\n\n');
%Computemeansbasedontheclosestcentroidsfoundinthepreviouspart.
centroids=computeCentroids(X,idx,K);
fprintf('Centroidscomputedafterinitialfindingofclosestcentroids:
\n')
fprintf('%f%f\n',centroids');
fprintf('\n(thecentroidsshouldbe\n');
fprintf('[2.4283013.157924]\n');
fprintf('[5.8135032.633656]\n');
fprintf('[7.1193873.616684]\n\n');
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
fprintf('\nRunningK-Meansclusteringonexampledataset.\n\n');
%Loadanexampledataset
load('ex7data2.mat');
%SettingsforrunningK-Means
K=3;
max_iters=10;
%kMeansInitCentroids).
initial_centroids=[33;62;85];
%RunK-Meansalgorithm.The'true'attheendtellsourfunctiontoplot
%theprogressofK-Means
[centroids,idx]=runkMeans(X,initial_centroids,max_iters,true);
fprintf('\nK-MeansDone.\n\n');
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
fprintf('\nRunningK-Meansclusteringonpixelsfromanimage.\n\n');
%Loadanimageofabird
A=double(imread('bird_small.png'));
A=A/255;%Divideby255sothatallvaluesareintherange0-1
%Sizeoftheimage
img_size=size(A);
%ReshapetheimageintoanNx3matrixwhereN=numberofpixels.
%EachrowwillcontaintheRed,GreenandBluepixelvalues
%ThisgivesusourdatasetmatrixXthatwewilluseK-Meanson.
X=reshape(A,img_size
(1)*img_size
(2),3);
%RunyourK-Meansalgorithmonthisdata
%YoushouldtrydifferentvaluesofKandmax_itershere
K=16;
max_iters=10;
%WhenusingK-Means,itisimportanttheinitializethecentroids
%randomly.
%YoushouldcompletethecodeinkMeansInitCentroids.mbeforeproceeding
initial_centroids=kMeansInitCentroids(X,K);
%RunK-Means
[centroids,idx]=runkMeans(X,initial_centroids,max_iters);
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
fprintf('\nApplyingK-Meanstocompressanimage.\n\n');
%Findclosestclustermembers
idx=findClosestCentroids(X,centroids);
%Essentially,nowwehaverepresentedtheimageXasintermsofthe
%indicesinidx.
%Wecannowrecovertheimagefromtheindices(idx)bymappingeachpixel
%(specifiedbyit'sindexinidx)tothecentroidvalue
X_recovered=centroids(idx,:
);
%Reshapetherecoveredimageintoproperdimensions
X_recovered=reshape(X_recovered,img_size
(1),img_size
(2),3);
%Displaytheoriginalimage
subplot(1,2,1);
imagesc(A);
title('原图');
%Displaycompressedimagesidebyside
subplot(1,2,2);
imagesc(X_recovered)
title(sprintf('用%d种颜色的压缩图.',K));
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
6.2PCA部分代码
clear;closeall;clc
fprintf('VisualizingexampledatasetforPCA.\n\n');
%load('ex7data1.mat');
X=importdata('Iris.txt');
X=X(:
1:
4);
%Visualizetheexampledataset
plot(X(:
1),X(:
2),'bo');
axis([0.56.528]);axissquare;
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
fprintf('\nRunningPCAonexampledataset.\n\n');
%BeforerunningPCA,itisimportanttofirstnormalizeX
[X_norm,mu,sigma]=featureNormalize(X);
%RunPCA
[U,S]=pca(X_norm);
%Computemu,themeanoftheeachfeature
holdon;
drawLine(mu,mu+1.5*S(1,1)*U(:
1)','-k','LineWidth',2);
drawLine(mu,mu+1.5*S(2,2)*U(:
2)','-k','LineWidth',2);
holdoff;
fprintf('Topeigenvector:
\n');
fprintf('U(:
1)=%f%f\n',U(1,1),U(2,1));
fprintf('\n(youshouldexpecttosee-0.707107-0.707107)\n');
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
fprintf('\nDimensionreductiononexampledataset.\n\n');
%Plotthenormalizeddataset(returnedfrompca)
plot(X_norm(:
1),X_norm(:
2),'bo');
axis([-43-43]);axissquare
%ProjectthedataontoK=1dimension
%K=1;
K=2;
Z=projectData(X_norm,U,K);
fprintf('Projectionofthefirstexample:
%f\n',Z
(1));
fprintf('\n(thisvalueshouldbeabout1.481274)\n\n');
X_rec=recoverData(Z,U,K);
fprintf('Approximationofthefirstexample:
%f%f\n',X_rec(1,1),X_rec(1,2));
fprintf('\n(thisvalueshouldbeabout-1.047419-1.047419)\n\n');
%Drawlinesconnectingtheprojectedpointstotheoriginalpoints
holdon;
plot(X_rec(:
1),X_rec(:
2),'ro');
fori=1:
size(X_norm,1)
drawLine(X_norm(i,:
),X_rec(i,:
),'--k','LineWidth',1);
end
holdoff
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
fprintf('\nLoadingfacedataset.\n\n');
%LoadFacedataset
load('ex7faces.mat')
%Displaythefirst100facesinthedataset
displayData(X(1:
36,:
));
title('降维前的人脸原图');
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
fprintf(['\nRunningPCAonfacedataset.\n'...
'(thismghttakeaminuteortwo...)\n\n']);
%BeforerunningPCA,itisimportanttofirstnormalizeXbysubtracting
%themeanvaluefromeachfeature
[X_norm,mu,sigma]=featureNormalize(X);
%RunPCA
[U,S]=pca(X_norm);
%Visualizethetop36eigenvectorsfound
displayData(U(:
1:
36)');
title('降维后的人脸图');
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
fprintf('\nDimensionreductionforfacedataset.\n\n');
K=100;
Z=projectData(X_norm,U,K);
fprintf('TheprojecteddataZhasasizeof:
')
fprintf('%d',size(Z));
fprintf('\n\nProgrampaused.Pressentertocontinue.\n');
pause;
fprintf('\nVisualizingtheprojected(reduceddimension)faces.\n\n');
K=100;
X_rec=recoverData(Z,U,K);
figure
(2);
%Displaynormalizeddata
%subplot(1,2,1);
displayData(X_norm(1:
36,:
));
title('Originalfaces');
axissquare;
%Displayreconstructeddatafromonlykeigenfaces
%subplot(1,2,2);
figure(3);
displayData(X_rec(1:
36,:
));
title('Recoveredfaces');
axissquare;
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
closeall;closeall;clc
A=double(imread('bird_small.png'));
A=A/255;
img_size=size(A);
X=reshape(A,img_size
(1)*img_size
(2),3);
K=16;
max_iters=10;
initial_centroids=kMeansInitCentroids(X,K);
[centroids,idx]=runkMeans(X,initial_centroids,max_iters);
%Sample1000randomindexes
sel=floor(rand(1000,1)*size(X,1))+1;
%SetupColorPalette
palette=hsv(K);
colors=palette(idx(sel),:
);
%Visualizethedataandcentroidmembershipsin3D
figure;
scatter3(X(sel,1),X(sel,2),X(sel,3),10,colors);
%title('Pixeldatasetplottedin3D.Colorshowscentroidmemberships');
title('部分人脸数据集的3D图');
fprintf('Programpaused.Pressentertocontinue.\n');
pause;
%UsePCAtoprojectthiscloudto2Dforvisualization
%SubtractthemeantousePCA
[X_norm,mu,sigma]=featureNormalize(X);
%PCAandprojectthedatato2D
[U,S]=pca(X_norm);
Z=projectData(X_norm,U,2);
%Plotin2D
figure;
plotDataPoints(Z(sel,:
),idx(sel),K);
%title('Pixeldatasetplottedin2D,usingPCAfordimensionalityreduction');
title('PCA降维后的2D人脸数据集图');
fprintf('Programpaused.Pressentertocontinue.\n');
pause;