apriori算法的改进.docx
《apriori算法的改进.docx》由会员分享,可在线阅读,更多相关《apriori算法的改进.docx(11页珍藏版)》请在冰点文库上搜索。
apriori算法的改进
ADEVELOPEDALGORITHMOFAPRIORIBASEDONASSOCIATIONANALYSIS
Abstract
Basedonassociationanalysis,animprovedalgorithmofAprioriispresentedinthepaper.Themainideaofthealgorithmare:
(1)Counttheprobabilityofeachattributeitem(A1,A2,…Am)ofaDBbyscanningtheDBfirsttime;
(2)TheprobabilityofanytwoitemsAkandAmappearedsynchronouslyinonerecordisPkm.
min(Pk,Pm)≤Pkm≤Pk*Pm,
ifAkandAmistotalcorrelation,thenthePkmistheminimumofthePkandPm,;ifAkandAmistotalindependent,thenthePkmisPk*Pm;
Sowecanestimate:
Pkm=(a*min(Pk,Pm)+b*Pk*Pm)/(a+b);a+b=1
Parameter“a”istheprobabilitywhileAkandAmaretotalcorrelation,Parameter“b”istheprobabilitywhileAkandAmaretotalindependent,
Parameter“a”and“b”canuseothermethodsuchasassociationanalysistocount.Inthispaperamethodforcalculatetheparameter“a”and”b”withassociationanalysisisprovided.
ifPkmismorethanthethresholdvaluewhichtheuserset,thenAk,Amarethefrequentitemsets.
YoucanusethemethodwhichdescribedabovetofindoutallthefrequentitemsetswithoutscanningDBsomanytimes.
(3)CountthesupportofthefrequentitemsetsbyscanningtheDBanothertime;(4)Outputtheassociationrulesfromthefrequentitemsets.
Thedetailedalgorithmandit'ssamplearedescribedinthepaper.Atlastwecompareditwithalgorithmapriori.ThebestqualityisthatthealgorithminourpaperreducethetimesofscanningDB.
KEYWORDS:
AssociationRuleAlgorithmAprioriFrequentItemsetAssociationAnalysis
1Introduction
Withthedevelopmentoftheinformationtechnologyandapplicationofdatabase.Thecollecteddatafarexceedpeople'sabilitytoanalyzeit.Thusnewandefficientmethodsareneededtodiscoverknowledgefromlargedatabases.Anddataminingappeared.
Dataminingisthecoreofknowledgediscoveryindatabases.It’saproceduretofindtheusefulandpotentiallyknowledgeindatabase.Associationrulesareoneofthemostimportantknowledgeofdatamining’sresultwhichcanbedefinedastherelationanddependencybetweentheitemsetsbygivensupportandconfidenceindatabase.
Fromtheaboveitcanbeconcludedthatthekeyproblemoftheaprioriisittaketoomuchtimetominingthefrequentitemsets.Thetimemainlycostsintwoareas,oneisthetimeforscanningthelargedatabaserepeatedlytheotherisforgeneratingfrequentitemsetswithJOIN.TherearealotofimprovedalgorithmforapriorisuchasAprioriTID,AprioriHybri,Multipleoins,ReorderandDirectetc.Themainideaofallthesealgorithmisaccordingthetheorythatthesubsetofafrequentitemsetisafrequentitemsetandthesupersetofainfrequentitemsetisainfrequentitemset.Theyscanthedatabaserepeatdlytominingtheassociationrules.
ThereisanotherfeatureforalgorithmAprioriTID,thesupportofthecandidatefrequentitemsetsarecalculatedonlyatthefirsttimeitscannedthedatabaseDandalsogeneratedcandidatetransactiondatabaseD’whichonlyincludesthecandidatefrequentitemsets.ThenthelatterminingarebasedonthedatabaseD’,ItreducethetimeofI/OoperationbecauseD’issmallerthanD,so,itenhancetheefficiencyofthealgorithm.
ForalgorithmAprioriHybri,itisthecombinationofalgorithmAprioriandAprioriTID.WhenthecandidatetransactiondatabaseD’couldn’tbecontainedintheRAM,itmineswithalgorithmAprioriotherwisewithalgorithmAprioriTID.
Themostimportantstepinminingassociationisgenerationfrequentitemsets.Inalgorithmapriorithemosttimeisconsumedbyscanningthedatabaserepeatedly.Itwouldreducetherunningtimeofthealgorithmbyreducingthetimesitscansthedatabasefarandaway.Inthispaperamethodofminingfrequentitemsetsbyevaluatingtheirprobabilityofsupportsbasedonassociationanalyzingwerementioned.First,itgainedtheprobabilityofevery1-itemsetbyscanningthedatabase,the1-itemsetwiththemorelargersupportthantheprobabilitytheusersetswouldbefrequent1-itemsets.Second,itevaluatestheprobabilityofevery2-itemset,every3-itemset,everyk-itemsetfromthefrequent1-itemsets.Third,itgainsallthecandidatefrequentitemsets.Fourth,itscansthedatabaseforverifyingthesupportofthecandidatefrequentitemsets,Lastthefrequentitemsetsareminedandassociationrulesalsodo.Inthemethoditreducesalotoftimesofscanningdatabaseandshortenedthecalculatetimeofthealgorithm.
2Relatedconcepts
Inthissection,wepresentthedefinitionsoftheconceptsthatareusedtodescribetheimprovedalgorithm.Letus
startfromthefollowingdefinitionsforassociationrules.
Let
beasetofallitems,whereanitemisanobjectwithsomepredefinedattributes(e.g.,
price,weight,etc.).Atransaction
isatuple,wheretid
istheidentifierofthetransactionand
.AtransactiondatabaseTconsistsofasetoftransactions.Anitemsetisasubsetofthesetofitems.A
k-itemsetisanitemsetofsizek.Wewriteitemsetsas
omittingsetbrackets.
Definition1.AnassociationruletakestheformX⇒YwhereX⊂I,Y⊂I,andX∩Y=Φ.thesupportoftheruleX⇒Yintransactiondatabaseis:
support(X⇒Y)=|{T:
X∪Y⊆T,T∈D}|/|D|
markedbysupport(X⇒Y).
Definition2.theconfidenceoftheruleX⇒Yintransactiondatabaseis:
confidence(X⇒Y)=|{T:
X∪Y⊆T,T∈D}|/|{T:
X⊆T,T∈D}|
markedbyconfidence(X⇒Y).
Theminingassociationrulesproblemisgeneratingassociationruleswithitemsetswhichhavemorelargerorequalsupportandconfidencethantheusersettheminsuppandminconf.
2
Definition3.LetA=A1,A2,…,Anbeasetofallevents.If
P(Ai1Ai2…Aik)=P(Ai1)P(Ai2)…P(Aik),
(1)
Foranyk(1(1)isrignt,thenA1,A2,…,Anareinter-independentevents.
Candidatefrequentitemsetsrefertotheitemsetswhoseprobabilityislargerthantheusersets,theymaybethelastfrequentitemsetornot.
3TheImprovedalgorithmforapriori
3.1themainideaoftheimprovedalgorithm
LetP1,P2…PnaretheindependentprobabilityofeveryitemA1,A2…An,theprobabilityforanytwoitemAk,Am(PkIfAkandAmaretotalnon-correlation,fromdefinition3itcanbeconcludedthatPkm=Pk*Pm,ifAkandAmaretotalcorrelation,thenPkmistheminimumofthePkandPmthatisPk,so,Pk*Pm≤Pkm≤Pk
Nowtheproblemis:
GivenPkandPm,alsoPk*Pm≤Pkm≤Pk,PleaseevaluatePkm.Theproblemcouldn’tbesolved
withtheconditionsinmathematics.Butinfact,thereis
alotof
informationwithoutaccuratemathematicformula
whichbeomitted.Inthispaperitofferedamethodbyassociationanalysistoconfirmtheformula.
LetparameterabetheprobabilitywhichAkandAm
aretotalcorrelation,andparameterbfortotalnon-correlation.
a+b=1,0(2)below:
Pkm=a∗Pk+b∗Pk∗Pm
(2)
Therearealotmethodstoconfirmthevalueofparameteraandb,inthispaperitprovideawaytodefine“a”and“b”byassociationanalysis.
3.2ConfirmtheParameter“a”,“b”byassociationanalysis
Thereareaseriesofcriterionaboutenvironment.Themostingredientsofthepollutioncanbeconfirmedbasedonthesource.Sowecanconsiderthecriterionasareferencedlistandthelistneededtofindthecorrelationasacomparisonlist.Thenwe’llgetthecorrelationcoefficientwhichistheparameter“a”inourformula
(2),andb=1-a.Thedetailsasbelow:
LetS={S1,S2,…Sm}bethevaluelistofitemAm,S1,S2,…SmaresampleextractedfromtheDBandX={X1,X2,…Xm}bethevaluelistforitemAk,X1,X2,…XmaresampleextractedfromtheDB.
akm=(3)
intheformula(3),akmisthecorrelationcoefficientofitemAmandAk,
=|Si--Xi|,рisthedistinguished
coefficientwhichsetbyusers,usually,р∈(0,1).
Wecanusetheformula(3)tocalculatethecorrelationcoefficientofanyitems.
4Comparisonforthealgorithms
Itwascomparedbetweenouralgorithmandaprioriinthissection,thenumberofthecandidatefrequentitemsetsandtimesofscanningthedatabase.Whichalsowerethelinchpinoftheefficiencyinaalgorithm.Thealgorithminthispaperaheadaprioriinreducingthetimesofscanningthedatabase.Itwouldscanthedatabasektimestofindoutfrequentk-itemsetsinaprioriwhileonly2timesinouralgorithmbyputtingforwardaconceptofcandidatefrequentitemset..
Inthetable1below,itlistedthefrequent1-itemsets,2-itemsets,3-itemsetsforbothofthetwoalgorithmrespectively.Table1TheComparisonofthefrequentitemsets
(candidate)frequent
AlgorithmApriori
Algorithminthispaper
itemsets
1-itemsets
{I1,I2,I3,I4,I5}
{I1,I2,I3,I4,I5}
2-itemsets
{I1,I2},{I1,I3},{I1,I5},{I2,I3},{I2,I4},
{I1,I2},{I1,I3},{I1,I5},{I2,I3},{I2,I4},
{I2,I5}
{I2,I5},{I3,I5}
3-itemsets
{I1,I2,I3},{I1,I2,I5}
{I1,I2,I3},{I1,I2,I5},{I2,I3,I5}
Fromthetable1,weknewthatthereweremorecandidatefrequentitemsetsinthealgorithmthispaperprovidedthanthatinthealgorithmapriori.Itmakesurethatitwouldn’tmissanyfrequentitemsets.
5Experimentandtheanalysisoftheresults
Inthissection,wereportourexperimentalresults.Tovalidatethealgorithmpresentedinthepaper,Adatabaseforrecordingtheupdateofthespatialdatabaseareusedintheexperiment.Thereare10itemsinthedatabase,suchasI1formender,I2forDEPTofthemender,I3forlayerofthetower,I4forlayeroftheroad,I5forlayerofthepolluterarea,I6forlayerofline,I7forlayerofwater,I8forlayerofbuilding,I9forlayerofthunderandI10forlayerplant.Thepurposeoftheexperimentwasmin