apriori算法的改进.docx

资源描述

apriori算法的改进.docx

《apriori算法的改进.docx》由会员分享，可在线阅读，更多相关《apriori算法的改进.docx（11页珍藏版）》请在冰点文库上搜索。

apriori算法的改进.docx

apriori算法的改进

ADEVELOPEDALGORITHMOFAPRIORIBASEDONASSOCIATIONANALYSIS

Abstract

Basedonassociationanalysis,animprovedalgorithmofAprioriispresentedinthepaper.Themainideaofthealgorithmare:

（1）Counttheprobabilityofeachattributeitem（A1,A2,…Am）ofaDBbyscanningtheDBfirsttime;

（2）TheprobabilityofanytwoitemsAkandAmappearedsynchronouslyinonerecordisPkm.

min（Pk,Pm）≤Pkm≤Pk*Pm,

ifAkandAmistotalcorrelation,thenthePkmistheminimumofthePkandPm,;ifAkandAmistotalindependent,thenthePkmisPk*Pm;

Sowecanestimate:

Pkm=（a*min（Pk,Pm）+b*Pk*Pm）/（a+b）;a+b=1

Parameter“a”istheprobabilitywhileAkandAmaretotalcorrelation,Parameter“b”istheprobabilitywhileAkandAmaretotalindependent,

Parameter“a”and“b”canuseothermethodsuchasassociationanalysistocount.Inthispaperamethodforcalculatetheparameter“a”and”b”withassociationanalysisisprovided.

ifPkmismorethanthethresholdvaluewhichtheuserset,thenAk,Amarethefrequentitemsets.

YoucanusethemethodwhichdescribedabovetofindoutallthefrequentitemsetswithoutscanningDBsomanytimes.

（3）CountthesupportofthefrequentitemsetsbyscanningtheDBanothertime;（4）Outputtheassociationrulesfromthefrequentitemsets.

Thedetailedalgorithmandit'ssamplearedescribedinthepaper.Atlastwecompareditwithalgorithmapriori.ThebestqualityisthatthealgorithminourpaperreducethetimesofscanningDB.

KEYWORDS:

AssociationRuleAlgorithmAprioriFrequentItemsetAssociationAnalysis

1Introduction

Withthedevelopmentoftheinformationtechnologyandapplicationofdatabase.Thecollecteddatafarexceedpeople'sabilitytoanalyzeit.Thusnewandefficientmethodsareneededtodiscoverknowledgefromlargedatabases.Anddataminingappeared.

Dataminingisthecoreofknowledgediscoveryindatabases.It’saproceduretofindtheusefulandpotentiallyknowledgeindatabase.Associationrulesareoneofthemostimportantknowledgeofdatamining’sresultwhichcanbedefinedastherelationanddependencybetweentheitemsetsbygivensupportandconfidenceindatabase.

Fromtheaboveitcanbeconcludedthatthekeyproblemoftheaprioriisittaketoomuchtimetominingthefrequentitemsets.Thetimemainlycostsintwoareas,oneisthetimeforscanningthelargedatabaserepeatedlytheotherisforgeneratingfrequentitemsetswithJOIN.TherearealotofimprovedalgorithmforapriorisuchasAprioriTID,AprioriHybri,Multipleoins,ReorderandDirectetc.Themainideaofallthesealgorithmisaccordingthetheorythatthesubsetofafrequentitemsetisafrequentitemsetandthesupersetofainfrequentitemsetisainfrequentitemset.Theyscanthedatabaserepeatdlytominingtheassociationrules.

ThereisanotherfeatureforalgorithmAprioriTID,thesupportofthecandidatefrequentitemsetsarecalculatedonlyatthefirsttimeitscannedthedatabaseDandalsogeneratedcandidatetransactiondatabaseD’whichonlyincludesthecandidatefrequentitemsets.ThenthelatterminingarebasedonthedatabaseD’,ItreducethetimeofI/OoperationbecauseD’issmallerthanD,so,itenhancetheefficiencyofthealgorithm.

ForalgorithmAprioriHybri,itisthecombinationofalgorithmAprioriandAprioriTID.WhenthecandidatetransactiondatabaseD’couldn’tbecontainedintheRAM,itmineswithalgorithmAprioriotherwisewithalgorithmAprioriTID.

Themostimportantstepinminingassociationisgenerationfrequentitemsets.Inalgorithmapriorithemosttimeisconsumedbyscanningthedatabaserepeatedly.Itwouldreducetherunningtimeofthealgorithmbyreducingthetimesitscansthedatabasefarandaway.Inthispaperamethodofminingfrequentitemsetsbyevaluatingtheirprobabilityofsupportsbasedonassociationanalyzingwerementioned.First,itgainedtheprobabilityofevery1-itemsetbyscanningthedatabase,the1-itemsetwiththemorelargersupportthantheprobabilitytheusersetswouldbefrequent1-itemsets.Second,itevaluatestheprobabilityofevery2-itemset,every3-itemset,everyk-itemsetfromthefrequent1-itemsets.Third,itgainsallthecandidatefrequentitemsets.Fourth,itscansthedatabaseforverifyingthesupportofthecandidatefrequentitemsets,Lastthefrequentitemsetsareminedandassociationrulesalsodo.Inthemethoditreducesalotoftimesofscanningdatabaseandshortenedthecalculatetimeofthealgorithm.

2Relatedconcepts

Inthissection,wepresentthedefinitionsoftheconceptsthatareusedtodescribetheimprovedalgorithm.Letus

startfromthefollowingdefinitionsforassociationrules.

Let

beasetofallitems,whereanitemisanobjectwithsomepredefinedattributes（e.g.,

price,weight,etc.）.Atransaction

isatuple,wheretid

istheidentifierofthetransactionand

.AtransactiondatabaseTconsistsofasetoftransactions.Anitemsetisasubsetofthesetofitems.A

k-itemsetisanitemsetofsizek.Wewriteitemsetsas

omittingsetbrackets.

Definition1.AnassociationruletakestheformX⇒YwhereX⊂I,Y⊂I,andX∩Y=Φ.thesupportoftheruleX⇒Yintransactiondatabaseis:

support（X⇒Y）=|{T:

X∪Y⊆T,T∈D}|/|D|

markedbysupport（X⇒Y）.

Definition2.theconfidenceoftheruleX⇒Yintransactiondatabaseis:

confidence（X⇒Y）=|{T:

X∪Y⊆T,T∈D}|/|{T:

X⊆T,T∈D}|

markedbyconfidence（X⇒Y）.

Theminingassociationrulesproblemisgeneratingassociationruleswithitemsetswhichhavemorelargerorequalsupportandconfidencethantheusersettheminsuppandminconf.

Definition3.LetA=A1,A2,…,Anbeasetofallevents.If

P（Ai1Ai2…Aik）=P（Ai1）P（Ai2）…P（Aik）,

（1）

Foranyk（1

（1）isrignt,thenA1,A2,…,Anareinter-independentevents.

Candidatefrequentitemsetsrefertotheitemsetswhoseprobabilityislargerthantheusersets,theymaybethelastfrequentitemsetornot.

3TheImprovedalgorithmforapriori

3.1themainideaoftheimprovedalgorithm

LetP1,P2…PnaretheindependentprobabilityofeveryitemA1,A2…An,theprobabilityforanytwoitemAk,Am（Pk

IfAkandAmaretotalnon-correlation,fromdefinition3itcanbeconcludedthatPkm=Pk*Pm,ifAkandAmaretotalcorrelation,thenPkmistheminimumofthePkandPmthatisPk,so,Pk*Pm≤Pkm≤Pk

Nowtheproblemis:

GivenPkandPm,alsoPk*Pm≤Pkm≤Pk,PleaseevaluatePkm.Theproblemcouldn’tbesolved

withtheconditionsinmathematics.Butinfact,thereis

alotof

informationwithoutaccuratemathematicformula

whichbeomitted.Inthispaperitofferedamethodbyassociationanalysistoconfirmtheformula.

LetparameterabetheprobabilitywhichAkandAm

aretotalcorrelation,andparameterbfortotalnon-correlation.

a+b=1,0

（2）below:

Pkm=a∗Pk+b∗Pk∗Pm

（2）

Therearealotmethodstoconfirmthevalueofparameteraandb,inthispaperitprovideawaytodefine“a”and“b”byassociationanalysis.

3.2ConfirmtheParameter“a”,“b”byassociationanalysis

Thereareaseriesofcriterionaboutenvironment.Themostingredientsofthepollutioncanbeconfirmedbasedonthesource.Sowecanconsiderthecriterionasareferencedlistandthelistneededtofindthecorrelationasacomparisonlist.Thenwe’llgetthecorrelationcoefficientwhichistheparameter“a”inourformula

（2）,andb=1-a.Thedetailsasbelow:

LetS={S1,S2,…Sm}bethevaluelistofitemAm,S1,S2,…SmaresampleextractedfromtheDBandX={X1,X2,…Xm}bethevaluelistforitemAk,X1,X2,…XmaresampleextractedfromtheDB.

akm=（3）

intheformula（3）,akmisthecorrelationcoefficientofitemAmandAk,

=|Si--Xi|,рisthedistinguished

coefficientwhichsetbyusers,usually,р∈（0,1）.

Wecanusetheformula（3）tocalculatethecorrelationcoefficientofanyitems.

4Comparisonforthealgorithms

Itwascomparedbetweenouralgorithmandaprioriinthissection,thenumberofthecandidatefrequentitemsetsandtimesofscanningthedatabase.Whichalsowerethelinchpinoftheefficiencyinaalgorithm.Thealgorithminthispaperaheadaprioriinreducingthetimesofscanningthedatabase.Itwouldscanthedatabasektimestofindoutfrequentk-itemsetsinaprioriwhileonly2timesinouralgorithmbyputtingforwardaconceptofcandidatefrequentitemset..

Inthetable1below,itlistedthefrequent1-itemsets,2-itemsets,3-itemsetsforbothofthetwoalgorithmrespectively.Table1TheComparisonofthefrequentitemsets

（candidate）frequent

AlgorithmApriori

Algorithminthispaper

itemsets

1-itemsets

{I1,I2,I3,I4,I5}

2-itemsets

{I1,I2},{I1,I3},{I1,I5},{I2,I3},{I2,I4},

{I2,I5}

{I2,I5},{I3,I5}

3-itemsets

{I1,I2,I3},{I1,I2,I5}

{I1,I2,I3},{I1,I2,I5},{I2,I3,I5}

Fromthetable1,weknewthatthereweremorecandidatefrequentitemsetsinthealgorithmthispaperprovidedthanthatinthealgorithmapriori.Itmakesurethatitwouldn’tmissanyfrequentitemsets.

5Experimentandtheanalysisoftheresults

Inthissection,wereportourexperimentalresults.Tovalidatethealgorithmpresentedinthepaper,Adatabaseforrecordingtheupdateofthespatialdatabaseareusedintheexperiment.Thereare10itemsinthedatabase,suchasI1formender,I2forDEPTofthemender,I3forlayerofthetower,I4forlayeroftheroad,I5forlayerofthepolluterarea,I6forlayerofline,I7forlayerofwater,I8forlayerofbuilding,I9forlayerofthunderandI10forlayerplant.Thepurposeoftheexperimentwasmin

展开阅读全文