网络爬虫-毕业论文外文文献翻译.docx
《网络爬虫-毕业论文外文文献翻译.docx》由会员分享,可在线阅读,更多相关《网络爬虫-毕业论文外文文献翻译.docx(17页珍藏版)》请在冰点文库上搜索。
![网络爬虫-毕业论文外文文献翻译.docx](https://file1.bingdoc.com/fileroot1/2023-5/16/0d8903d5-fde8-46af-93f6-c79daf9e3db6/0d8903d5-fde8-46af-93f6-c79daf9e3db61.gif)
外文资料
ABSTRACT
Crawlingthewebisdeceptivelysimple:
thebasicalgorithmis(a)Fetchapage(b)ParseittoextractalllinkedURLs(c)ForalltheURLsnotseenbefore,repeat(a)–(c).However,thesizeoftheweb(estimatedatover4billionpages)and itsrateofchange(estimatedat7%perweek)movethisplanfromatrivialprogrammingexercisetoaseriousalgorithmicandsystemdesignchallenge.Indeed,thesetwofactorsaloneimplythatforareasonablyfreshandcompletecrawloftheweb,step(a)mustbeexecutedaboutathousandtimespersecond,andthusthemembershiptest(c)mustbedonewellovertenthousandtimespersecondagainstasettoolargetostoreinmainmemory.Thisrequiresadistributedarchitecture,whichfurthercomplicatesthemembershiptest.
Acrucialwaytospeedupthetestistocache,thatis,tostoreinmainmemorya(dynamic)subsetofthe“seen”URLs.ThemaingoalofthispaperistocarefullyinvestigateseveralURLcachingtechniquesforwebcrawling.Weconsiderbothpracticalalgorithms:
randomreplacement,staticcache,LRU,andCLOCK,andtheoreticallimits:
clairvoyantcachingandinfinitecache.Weperformedabout1,800simulationsusingthesealgorithmswithvariouscachesizes,usingactuallogdataextractedfromamassive33daywebcrawlthatissuedoveronebillionHTTPrequests.Ourmainconclusionisthatcachingisveryeffective–inoursetup,acacheofroughly50,000entriescanachieveahitrateofalmost80%.Interestingly,thiscachesizefallsatacriticalpoint:
asubstantiallysmallercacheismuchlesseffectivewhileasubstantiallylargercachebringslittleadditionalbenefit.Weconjecturethatsuchcriticalpointsareinherenttoourproblemandventureanexplanationforthisphenomenon.
1.INTRODUCTION
ArecentPewFoundationstudy[31]statesthat“SearchengineshavebecomeanindispensableutilityforInternetusers”andestimatesthatasofmid-2002,slightlyover50%ofallAmericanshaveusedwebsearchtofindinformation.Hence,thetechnologythatpowerswebsearchisofenormouspracticalinterest.Inthispaper,weconcentrateononeaspectofthesearchtechnology,namelytheprocessofcollectingwebpagesthateventuallyconstitutethesearchenginecorpus.
Searchenginescollectpagesinmanyways,amongthemdirectURLsubmission,paidinclusion,andURLextractionfromnonwebsources,butthebulkofthecorpusisobtainedbyrecursivelyexploringtheweb,aprocessknownascrawlingorSPIDERing.Thebasicalgorithmis
(a)Fetchapage
(b)ParseittoextractalllinkedURLs
(c)ForalltheURLsnotseenbefore,repeat(a)–(c)
CrawlingtypicallystartsfromasetofseedURLs,madeupofURLsobtainedbyothermeansasdescribedaboveand/ormadeupofURLscollectedduringpreviouscrawls.Sometimescrawlsarestartedfromasinglewellconnectedpage,oradirectorysuchas ,butinthiscasearelativelylargeportionoftheweb(estimatedatover20%)isneverreached.See[9]foradiscussionofthegraphstructureofthewebthatleadstothisphenomenon.
Ifweviewwebpagesasnodesinagraph,andhyperlinksasdirectededgesamongthesenodes,thencrawlingbecomesaprocessknowninmathematicalcirclesasgraphtraversal.Variousstrategiesforgraphtraversaldifferintheirchoiceofwhichnodeamongthenodesnotyetexploredtoexplorenext.TwostandardstrategiesforgraphtraversalareDepthFirstSearch(DFS)andBreadthFirstSearch(BFS)–theyareeasytoimplementandtaughtinmanyintroductoryalgorithmsclasses.(Seeforinstance[34]).
However,crawlingthewebisnotatrivialprogrammingexercisebutaseriousalgorithmicandsystemdesignchallengebecauseofthefollowingtwofactors.
1.Thewebisverylarge.Currently,Google[20]claimstohaveindexedover3billionpages.Variousstudies[3,27,28]haveindicatedthat,historically,thewebhasdoubledevery9-12months.
2.Webpagesarechangingrapidly.If“change”means“anychange”,thenabout40%ofallwebpageschangeweekly[12].Evenifweconsideronlypagesthatchangebyathirdormore,about7%ofallwebpageschangeweekly[17].
Thesetwofactorsimplythattoobtainareasonablyfreshand679completesnapshotoftheweb,asearchenginemustcrawlatleast100millionpagesperday.Therefore,step(a)mustbeexecutedabout1,000timespersecond,andthemembershiptestinstep(c)mustbedonewellovertenthousandtimespersecond,againstasetofURLsthatistoolargetostoreinmainmemory.Inaddition,crawlerstypicallyuseadistributedarchitecturetocrawlmorepagesinparallel,whichfurthercomplicatesthemembershiptest:
itispossiblethatthemembershipquestioncanonlybeansweredbyapeernode,notlocally.
Acrucialwaytospeedupthemembershiptestistocachea(dynamic)subsetofthe“seen”URLsinmainmemory.ThemaingoalofthispaperistoinvestigateindepthseveralURLcachingtechniquesforwebcrawling.Weexaminedfourpracticaltechniques:
randomreplacement,staticcache,LRU,andCLOCK,andcomparedthemagainsttwotheoreticallimits:
clairvoyantcachingandinfinitecachewhenrunagainstatraceofawebcrawlthatissuedoveronebillionHTTPrequests.Wefoundthatsimplecachingtechniquesareextremelyeffectiveevenatrelativelysmallcachesizessuchas50,000entriesandshowhowthesecachescanbeimplementedveryefficiently.
Thepaperisorganizedasfollows:
Section2discussesthevariouscrawlingsolutionsproposedintheliteratureandhowcachingfitsintheirmodel.Section3presentsanintroductiontocachingtechniquesanddescribesseveraltheoreticalandpracticalalgorithmsforcaching.WeimplementedthesealgorithmsundertheexperimentalsetupdescribedinSection4.TheresultsofoursimulationsaredepictedanddiscussedinSection5,andourrecommendationsforpracticalalgorithmsanddatastructuresforURLcachingarepresentedinSection6.Section7containsourconclusionsanddirectionsforfurtherresearch.
2.CRAWLING
Webcrawlersarealmostasoldasthewebitself,andnumerouscrawlingsystemshavebeendescribedintheliterature.Inthissection,wepresentabriefsurveyofthesecrawlers(inhistoricalorder)andthendiscusswhymostofthesecrawlerscouldbenefitfromURLcaching.
ThecrawlerusedbytheInternetArchive[10]employsmultiplecrawlingprocesses,eachofwhichperformsanexhaustivecrawlof64hostsatatime.Thecrawlingprocessessavenon-localURLstodisk;attheendofacrawl,abatchjobaddstheseURLstotheper-hostseedsetsofthenextcrawl.
TheoriginalGooglecrawler,describedin[7],implementsthedifferentcrawlercomponentsasdifferentprocesses.AsingleURLserverprocessmaintainsthesetofURLstodownload;crawlingprocessesfetchpages;indexingprocessesextractwordsandlinks;andURLresolverprocessesconvertrelativeintoabsoluteURLs,whicharethenfedtotheURLServer.Thevariousprocessescommunicateviathefilesystem.
Fortheexperimentsdescribedinthispaper,weusedtheMercatorwebcrawler[22,29].Mercatorusesasetofindependent,communicatingwebcrawlerprocesses.Eachcrawlerprocessisresponsibleforasubsetofallwebservers;theassignmentofURLstocrawlerprocessesisbasedonahashoftheURL’shostcomponent.AcrawlerthatdiscoversanURLforwhichitisnotresponsiblesendsthisURLviaTCPtothecrawlerthatisresponsibleforit,batchingURLstogethertominimizeTCPoverhead.WedescribeMercatorinmoredetailinSection4.
ChoandGarcia-Molina’scrawler[13]issimilartoMercator.Thesystemiscomposedofmultipleindependent,communicatingwebcrawlerprocesses(called“C-procs”).ChoandGarcia-MolinaconsiderdifferentschemesforpartitioningtheURLspace,includingURL-based(assigninganURLtoaC-procbasedonahashoftheentireURL),site-based(assigninganURLtoaC-procbasedonahashoftheURL’shostpart),andhierarchical(assigninganURLtoaC-procbasedonsomepropertyoftheURL,suchasitstop-leveldomain).
TheWebFountaincrawler[16]isalsocomposedofasetofindependent,communicatingcrawlingprocesses(the“ants”).AnantthatdiscoversanURLforwhichitisnotresponsible,sendsthisURLtoadedicatedprocess(the“controller”),whichforwardstheURLtotheappropriateant.
UbiCrawler(formerlyknownasTrovatore)[4,5]isagaincomposedofmultipleindependent,communicatingwebcrawlerprocesses.Italsoemploysacontrollerprocesswhichoverseesthecrawlingprocesses,detectsprocessfailures,andinitiatesfail-overtoothercrawlingprocesses.
ShkapenyukandSuel’scrawler[35]issimilartoGoogle’s;thedifferentcrawlercomponentsareimplementedasdifferentprocesses.A“crawlingapplication”maintainsthesetofURLstobedownloaded,andschedulestheorderinwhichtodownloadthem.Itsendsdownloadrequeststoa“crawlmanager”,whichforwardsthemtoapoolof“downloader”processes.ThedownloaderprocessesfetchthepagesandsavethemtoanNFS-mountedfilesystem.Thecrawlingapplicationreadsthosesavedpages,extractsanylinkscontainedwithinthem,andaddsthemtothesetofURLstobedownloaded.
AnywebcrawlermustmaintainacollectionofURLsthataretobedownloaded.Moreover,sinceitwouldbeunacceptabletodownloadthesameURLoverandover,itmusthaveawaytoavoidaddingURLstothecollectionmorethanonce.Typically,avoidanceisachievedbymaintainingasetofdiscoveredURLs,coveringtheURLsinthefrontieraswellasthosethathavealreadybeendownloaded.Ifthissetistoolargetofitinmemory(whichitoftenis,giventhattherearebillionsofvalidURLs),itisstoredondiskandca