Springer.docx
《Springer.docx》由会员分享,可在线阅读,更多相关《Springer.docx(5页珍藏版)》请在冰点文库上搜索。
Springer
Springer
NearestNeighbourSearchForXMLTreesLszlKovcs,TiborRpsi,ErikaBaksa-Vargakovacs@iit.uni-miskolc.hu,repasi@iit.uni-miskolc.hu,iitev@uni-miskolc.huAbstract:
Thereisasignificantresearcheffortonefficientcomputingofsimilaritiesbetweenobjectsofnontraditionaldatatypesasstrings,documents,soundtracksorpictures.ItisreasonabletousetheresultsoftheseeffortsintheproblemofXMLtreematching,too.AsanXMLdocumenthasatreestructureandthetreescanbetransformedintoalinearstructure,atreecanberegardedasaspecialkindofstring.Inourapproach,theresultsofstringcomparisonwillbeextendedtomeasurethesimilaritybetweenXMLtrees.Keywords:
Treeeditdistance,Nearestneighboursearching1IntroductiontoXMLandXPathTheExtensibleMarkupLanguage(XML)isasimple,veryflexibletextformatderivedfromSGML(StandardGeneralizedMarkupLanguage)[1].XMLprovidesawayofdescribingdatainarich,flexibleandefficientwaybymarkingupdatawithdescriptivetags.However,XMLdoesnotprovideawaytolocatespecificpiecesofstructureddatawithinagivendocument.Instead,theXMLPathLanguage(XPath)providesasyntaxforlocatingspecificpartsofanXMLdocumenteffectivelyandefficiently.ItgetsitsnamefromitsuseofapathnotationasinURLsfornavigatingthroughthehierarchicalstructureofanXMLdocument.InXPath,anXMLdocumentisviewedconceptuallyasatreeinwhicheachpartofthedocumentisrepresentedasanode[2].ThestandardXPathmodelassumesrandomaccesstoXMLdataandalsotheabilitytoqueryXMLdataatanytime.XPathisnotastructurallanguagelikeXML;rather,itisastring-basedlanguageofexpressions.Anexpressionisevaluatedtoyieldanobject,whichhasoneofthefollowingfourbasictypes:
node-set(anunorderedcollectionofnodeswithoutduplicates);boolean;numberorstring.XPathexpressionsoftenoccurinXMLattributes.Expressionevaluationoccurswithrespecttoacontext.Thecontextconsistsof:
anode(thecontextnode),apairofnon-zeropositiveintegers(thecontextpositionandthecontextsize),asetofvariablebindings,afunctionlibraryandthesetofnamespacedeclarationsinscopefortheexpression[3].Animportantkindofexpressionisalocationpath.Therearetwokindsoflocationpaths:
relativelocationpathsandabsolutelocationpaths.
(1)Arelativelocationpathconsistsofasequenceofoneormorelocationstepsseparatedby/,andeachstepinturnselectsasetofnodesrelativetothecontextnode.Alocationstephasthreeparts:
anaxis,whichspecifiesthetreerelationshipbetweenthenodesselectedbythelocationstepandthecontextnode;anodetest,whichspecifiesthenodetypeandexpanded-nameofthenodesselectedbythelocationstep;andzeroormorepredicates,whichusearbitraryexpressionstofurtherrefinethesetofnodesselectedbythelocationstep.Thenode-setselectedbythelocationstepisthenode-setthatresultsfromgeneratinganinitialnode-setfromtheaxisandnode-test,andthenfilteringthatnode-setbyeachofthepredicatesinturn.
(2)Anabsolutelocationpathconsistsof/optionallyfollowedbyarelativelocationpath.A/byitselfselectstherootnodeofthedocumentcontainingthecontextnode.Ifitisfollowedbyarelativelocationpath,thenthelocationpathselectsthesetofnodesthatwouldbeselectedbytherelativelocationpathrelativetotherootnodeofthedocumentcontainingthecontextnode[3].Thusalocationpathselectsasetofnodes.Theresultofevaluatinganexpressionthatisalocationpathisthenode-setcontainingthenodesselectedbythelocationpath.Theinitialnode-setconsistsofthenodeshavingtherelationshiptothecontextnodespecifiedbytheaxis,andhavingthenodetypeandexpanded-namespecifiedbythenodetest.Theinitialnode-setisfilteredbythefirstpredicatetogenerateanewnode-set;thisnewnode-setisthenfilteredusingthesecondpredicate,andsoon.Thefinalnode-setisthenode-setselectedbythelocationstep.Theaxisaffectshowtheexpressionineachpredicateisevaluatedandsothesemanticsofapredicateisdefinedwithrespecttoanaxis.Everyaxishasaprincipalnodetype.Fortheattributeaxis,theprincipalnodetypeisattribute;forthenamespaceaxis,theprincipalnodetypeisnamespace;forotheraxes,theprincipalnodetypeiselement[3].ThedisadvantageofXPathisthatitisbasedonpartialmatching.Itmeansthatsomepartsofthetreemustmatchthegivenpatternexactly,whileothersaretotallyignored.Theresultedmeasuredoesnottakeintoaccountthedeviationfromthepattern.However,thisdeviationvaluewouldcarryusefulinformationforusers.Ouraimistofindanefficientalgorithmthatmeasuresthedifferencebetweenthepatternandtheselectedsubtrees.ThereforeallowingsearchingfornearestneighbourapproximationsofthepatternXMLdocument.2ThedistancebetweentwoXMLtreesTheproblemofobjectmatchingisaninterdisciplinarareaasitcollectsmethodsfromthreecurrentdisciplines:
computerscience,statisticsandoperationsresearch.Thereisasignificantresearcheffortonefficientcomputingofsimilaritiesbetweenobjectsofnontraditionaldatatypesasstrings,documents,soundtracksorpictures.ItisreasonabletousetheresultsoftheseeffortsintheproblemofXMLtreematching,too.AsanXMLdocumenthasatreestructureandthetreescanbetransformedintoalinearstructure,atreecanberegardedasaspecialkindofstring.Inourapproach,theresultsofstringcomparisonwillbeextendedtomeasurethesimilaritybetweenXMLtrees.ThestringsandXMLtreescanbeconsideredasobjectslocatedinmetricspace.AspaceXismetricifthereisarealnon-negativefunctionoftwoobjectsd(A,B)defined.Thefunctionisknownasthedistancebetweenthetwopoints.Itischaracterizedbythefollowingproperties.ForAX,BX,andCX1.d(A,B)=0ifandonlyifA=B(thedistanceis0ifandonlyifthepointscoincide);2.d(A,B)=d(B,A)(thedistancefromAtoBisthesameasthedistancefromBtoA);3.d(A,B)+d(B,C)d(A,C)(thesumoftwosidesofatriangleisneverlessthanthethirdside).ThebestknownstringmatchingalgorithmsincomputersciencearetheRusselSoundexcodingandtheLevenshteineditdistancemethods.TheSoundexmethodisusedinrecordlinkagemethodsmainlyforblockingthemasterrecordset.TheLevenshteineditdistanceisdefinedasthesmallestnumberofinsertions,deletions,andsubstitutionsrequiredtochangeonestringortreeintoanother.TheefficiencyofthealgorithmtocomputethedistancebetweenstringsisO(mn),wheremandnarethelengthsofthestrings.TheLevenshteindistanceisdefinedforstringsofarbitrarylength.Itcountsthedifferencesbetweentwostrings,wherewewouldcountadifferencenotonlywhenstringshavedifferentcharactersbutalsowhenonehasacharacterwhereastheotherdoesnot.Ingeneral,thetransformationofstringscanbeperformedbetweenstringsofdifferentcharactersets.Thus,theinputparameterforstringdistanceproblemisatriple(A,B,c),whereAisthecharactersetoftheinputstring,Bisthecharactersetoftheoutputstring,andcisthecostfunction.Thetransformationmayusuallyconsistsofthefollowingelementaryoperations:
insertion,deletionandsubstitution.Thereareseveralvariantsforelementarycostfunctions,amongwhichthesimplestoneistheunit-costfunction.Inthiscasetheinsertion,thedeletionofasinglecharacterhasthevalueof1,andthesubstitutionoftwodifferentcharactersisequalto1,too.Forastrings,lets(i)standforitsithcharacter.Fortwocharactersaandb,definec(0,a)=1(insertion)c(a,0)=1(deletion)c(a,b)=0ifa=b;1,otherwise(substitution)Assumewearegiventwostringssandtoflengthnandm,respectively.Thelengthvalueisdenotedbyanupperindex:
sn,tm..Thedistance,i.e.theminimalcostoftransformationfromstotisusuallycalculatedwithusingadynamicprogrammingalgorithm.Thedistancevalueisdefinedwitharecursiveformula:
d(sn,tm)=min{c(sn,tm)+d(sn-1,tm-1)c(sn,0)+d(sn-1,tm)c(0,tm)+d(sn,tm-1)wheresidenotesthecharateratthepositioniinstrings.Basically,editdistancecanbecomputedinO(nm)orO(nm/log(n))timeifthecostweightsarerational[7].Inalotofapplications,amoreprecisedistancevaluecanbeachievedwiththenormalizededitdistancemeasure.Thismeasuretakesnotonlythenumberofelementarytransformationsintoaccount,butalsothelengthoftheinputandoutputstrings.Thenormalizededitdistanceistheratioofthetotalweighttothelengthofthesequence.ThecomplexityofthenormalizededitdistanceisO(nmlog(n))[7].Thedistancevaluefortreescanbedefinedsimilarlytothedistanceofstrings.Animportantassumptioninourinvestigationisthatthelabelsofthetreesareordered.ThisassumptioncorrespondstotheXMLstandard.Usually,likein[4],thefollowingelementaryerationsaredefinedfortreeobjects:
-relabel:
assignsanewnodenametotherootofthetree-insert:
insertinganewnodeintothechildrenoftherootnode-delete:
deletinganodefromthechildrenoftherootnode-inserttree:
insertingatreeundertherootnode-deletingtree:
deletingatreefromthechildrenofthenodeInourapproach,thetreeinsertionortreedeletionoperatorsarereplacedwithalistofsinglenodeoperations.Accordingto[7],thetreedistancevaluecanbecalculatedusingthefollowingrecursiveformula:
d(0,0)=0d(F,0)=d(F-v,0)+c(v,0)d(0,F)=d(0,F-v)+c(0,v)d(F1,F2)=min{d(F1-v,F2)+c(T(v),0)d(F1,F2-v)+c(0,T(v))d(F1-T