网络爬虫-毕业论文外文文献翻译.docx

资源描述

网络爬虫-毕业论文外文文献翻译.docx

《网络爬虫-毕业论文外文文献翻译.docx》由会员分享，可在线阅读，更多相关《网络爬虫-毕业论文外文文献翻译.docx（17页珍藏版）》请在冰点文库上搜索。

网络爬虫-毕业论文外文文献翻译.docx

外文资料

ABSTRACT

Crawlingthewebisdeceptivelysimple:

thebasicalgorithmis（a）Fetchapage（b）ParseittoextractalllinkedURLs（c）ForalltheURLsnotseenbefore,repeat（a）–（c）.However,thesizeoftheweb（estimatedatover4billionpages）and itsrateofchange（estimatedat7%perweek）movethisplanfromatrivialprogrammingexercisetoaseriousalgorithmicandsystemdesignchallenge.Indeed,thesetwofactorsaloneimplythatforareasonablyfreshandcompletecrawloftheweb,step（a）mustbeexecutedaboutathousandtimespersecond,andthusthemembershiptest（c）mustbedonewellovertenthousandtimespersecondagainstasettoolargetostoreinmainmemory.Thisrequiresadistributedarchitecture,whichfurthercomplicatesthemembershiptest.

Acrucialwaytospeedupthetestistocache,thatis,tostoreinmainmemorya（dynamic）subsetofthe“seen”URLs.ThemaingoalofthispaperistocarefullyinvestigateseveralURLcachingtechniquesforwebcrawling.Weconsiderbothpracticalalgorithms:

randomreplacement,staticcache,LRU,andCLOCK,andtheoreticallimits:

clairvoyantcachingandinfinitecache.Weperformedabout1,800simulationsusingthesealgorithmswithvariouscachesizes,usingactuallogdataextractedfromamassive33daywebcrawlthatissuedoveronebillionHTTPrequests.Ourmainconclusionisthatcachingisveryeffective–inoursetup,acacheofroughly50,000entriescanachieveahitrateofalmost80%.Interestingly,thiscachesizefallsatacriticalpoint:

asubstantiallysmallercacheismuchlesseffectivewhileasubstantiallylargercachebringslittleadditionalbenefit.Weconjecturethatsuchcriticalpointsareinherenttoourproblemandventureanexplanationforthisphenomenon.

1.INTRODUCTION

ArecentPewFoundationstudy[31]statesthat“SearchengineshavebecomeanindispensableutilityforInternetusers”andestimatesthatasofmid-2002,slightlyover50%ofallAmericanshaveusedwebsearchtofindinformation.Hence,thetechnologythatpowerswebsearchisofenormouspracticalinterest.Inthispaper,weconcentrateononeaspectofthesearchtechnology,namelytheprocessofcollectingwebpagesthateventuallyconstitutethesearchenginecorpus.

Searchenginescollectpagesinmanyways,amongthemdirectURLsubmission,paidinclusion,andURLextractionfromnonwebsources,butthebulkofthecorpusisobtainedbyrecursivelyexploringtheweb,aprocessknownascrawlingorSPIDERing.Thebasicalgorithmis

（a）Fetchapage

（b）ParseittoextractalllinkedURLs

（c）ForalltheURLsnotseenbefore,repeat（a）–（c）

CrawlingtypicallystartsfromasetofseedURLs,madeupofURLsobtainedbyothermeansasdescribedaboveand/ormadeupofURLscollectedduringpreviouscrawls.Sometimescrawlsarestartedfromasinglewellconnectedpage,oradirectorysuchas ,butinthiscasearelativelylargeportionoftheweb（estimatedatover20%）isneverreached.See[9]foradiscussionofthegraphstructureofthewebthatleadstothisphenomenon.

Ifweviewwebpagesasnodesinagraph,andhyperlinksasdirectededgesamongthesenodes,thencrawlingbecomesaprocessknowninmathematicalcirclesasgraphtraversal.Variousstrategiesforgraphtraversaldifferintheirchoiceofwhichnodeamongthenodesnotyetexploredtoexplorenext.TwostandardstrategiesforgraphtraversalareDepthFirstSearch（DFS）andBreadthFirstSearch（BFS）–theyareeasytoimplementandtaughtinmanyintroductoryalgorithmsclasses.（Seeforinstance[34]）.

However,crawlingthewebisnotatrivialprogrammingexercisebutaseriousalgorithmicandsystemdesignchallengebecauseofthefollowingtwofactors.

1.Thewebisverylarge.Currently,Google[20]claimstohaveindexedover3billionpages.Variousstudies[3,27,28]haveindicatedthat,historically,thewebhasdoubledevery9-12months.

2.Webpagesarechangingrapidly.If“change”means“anychange”,thenabout40%ofallwebpageschangeweekly[12].Evenifweconsideronlypagesthatchangebyathirdormore,about7%ofallwebpageschangeweekly[17].

Thesetwofactorsimplythattoobtainareasonablyfreshand679completesnapshotoftheweb,asearchenginemustcrawlatleast100millionpagesperday.Therefore,step（a）mustbeexecutedabout1,000timespersecond,andthemembershiptestinstep（c）mustbedonewellovertenthousandtimespersecond,againstasetofURLsthatistoolargetostoreinmainmemory.Inaddition,crawlerstypicallyuseadistributedarchitecturetocrawlmorepagesinparallel,whichfurthercomplicatesthemembershiptest:

itispossiblethatthemembershipquestioncanonlybeansweredbyapeernode,notlocally.

Acrucialwaytospeedupthemembershiptestistocachea（dynamic）subsetofthe“seen”URLsinmainmemory.ThemaingoalofthispaperistoinvestigateindepthseveralURLcachingtechniquesforwebcrawling.Weexaminedfourpracticaltechniques:

randomreplacement,staticcache,LRU,andCLOCK,andcomparedthemagainsttwotheoreticallimits:

clairvoyantcachingandinfinitecachewhenrunagainstatraceofawebcrawlthatissuedoveronebillionHTTPrequests.Wefoundthatsimplecachingtechniquesareextremelyeffectiveevenatrelativelysmallcachesizessuchas50,000entriesandshowhowthesecachescanbeimplementedveryefficiently.

Thepaperisorganizedasfollows:

Section2discussesthevariouscrawlingsolutionsproposedintheliteratureandhowcachingfitsintheirmodel.Section3presentsanintroductiontocachingtechniquesanddescribesseveraltheoreticalandpracticalalgorithmsforcaching.WeimplementedthesealgorithmsundertheexperimentalsetupdescribedinSection4.TheresultsofoursimulationsaredepictedanddiscussedinSection5,andourrecommendationsforpracticalalgorithmsanddatastructuresforURLcachingarepresentedinSection6.Section7containsourconclusionsanddirectionsforfurtherresearch.

2.CRAWLING

Webcrawlersarealmostasoldasthewebitself,andnumerouscrawlingsystemshavebeendescribedintheliterature.Inthissection,wepresentabriefsurveyofthesecrawlers（inhistoricalorder）andthendiscusswhymostofthesecrawlerscouldbenefitfromURLcaching.

ThecrawlerusedbytheInternetArchive[10]employsmultiplecrawlingprocesses,eachofwhichperformsanexhaustivecrawlof64hostsatatime.Thecrawlingprocessessavenon-localURLstodisk;attheendofacrawl,abatchjobaddstheseURLstotheper-hostseedsetsofthenextcrawl.

TheoriginalGooglecrawler,describedin[7],implementsthedifferentcrawlercomponentsasdifferentprocesses.AsingleURLserverprocessmaintainsthesetofURLstodownload;crawlingprocessesfetchpages;indexingprocessesextractwordsandlinks;andURLresolverprocessesconvertrelativeintoabsoluteURLs,whicharethenfedtotheURLServer.Thevariousprocessescommunicateviathefilesystem.

Fortheexperimentsdescribedinthispaper,weusedtheMercatorwebcrawler[22,29].Mercatorusesasetofindependent,communicatingwebcrawlerprocesses.Eachcrawlerprocessisresponsibleforasubsetofallwebservers;theassignmentofURLstocrawlerprocessesisbasedonahashoftheURL’shostcomponent.AcrawlerthatdiscoversanURLforwhichitisnotresponsiblesendsthisURLviaTCPtothecrawlerthatisresponsibleforit,batchingURLstogethertominimizeTCPoverhead.WedescribeMercatorinmoredetailinSection4.

ChoandGarcia-Molina’scrawler[13]issimilartoMercator.Thesystemiscomposedofmultipleindependent,communicatingwebcrawlerprocesses（called“C-procs”）.ChoandGarcia-MolinaconsiderdifferentschemesforpartitioningtheURLspace,includingURL-based（assigninganURLtoaC-procbasedonahashoftheentireURL）,site-based（assigninganURLtoaC-procbasedonahashoftheURL’shostpart）,andhierarchical（assigninganURLtoaC-procbasedonsomepropertyoftheURL,suchasitstop-leveldomain）.

TheWebFountaincrawler[16]isalsocomposedofasetofindependent,communicatingcrawlingprocesses（the“ants”）.AnantthatdiscoversanURLforwhichitisnotresponsible,sendsthisURLtoadedicatedprocess（the“controller”）,whichforwardstheURLtotheappropriateant.

UbiCrawler（formerlyknownasTrovatore）[4,5]isagaincomposedofmultipleindependent,communicatingwebcrawlerprocesses.Italsoemploysacontrollerprocesswhichoverseesthecrawlingprocesses,detectsprocessfailures,andinitiatesfail-overtoothercrawlingprocesses.

ShkapenyukandSuel’scrawler[35]issimilartoGoogle’s;thedifferentcrawlercomponentsareimplementedasdifferentprocesses.A“crawlingapplication”maintainsthesetofURLstobedownloaded,andschedulestheorderinwhichtodownloadthem.Itsendsdownloadrequeststoa“crawlmanager”,whichforwardsthemtoapoolof“downloader”processes.ThedownloaderprocessesfetchthepagesandsavethemtoanNFS-mountedfilesystem.Thecrawlingapplicationreadsthosesavedpages,extractsanylinkscontainedwithinthem,andaddsthemtothesetofURLstobedownloaded.

AnywebcrawlermustmaintainacollectionofURLsthataretobedownloaded.Moreover,sinceitwouldbeunacceptabletodownloadthesameURLoverandover,itmusthaveawaytoavoidaddingURLstothecollectionmorethanonce.Typically,avoidanceisachievedbymaintainingasetofdiscoveredURLs,coveringtheURLsinthefrontieraswellasthosethathavealreadybeendownloaded.Ifthissetistoolargetofitinmemory（whichitoftenis,giventhattherearebillionsofvalidURLs）,itisstoredondiskandca

展开阅读全文