网络爬虫-毕业论文外文文献翻译.docx

上传人:聆听****声音 文档编号:8988134 上传时间:2023-05-16 格式:DOCX 页数:17 大小:32.61KB
下载 相关 举报
网络爬虫-毕业论文外文文献翻译.docx_第1页
第1页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第2页
第2页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第3页
第3页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第4页
第4页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第5页
第5页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第6页
第6页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第7页
第7页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第8页
第8页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第9页
第9页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第10页
第10页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第11页
第11页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第12页
第12页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第13页
第13页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第14页
第14页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第15页
第15页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第16页
第16页 / 共17页
网络爬虫-毕业论文外文文献翻译.docx_第17页
第17页 / 共17页
亲,该文档总共17页,全部预览完了,如果喜欢就下载吧!
下载资源
资源描述

网络爬虫-毕业论文外文文献翻译.docx

《网络爬虫-毕业论文外文文献翻译.docx》由会员分享,可在线阅读,更多相关《网络爬虫-毕业论文外文文献翻译.docx(17页珍藏版)》请在冰点文库上搜索。

网络爬虫-毕业论文外文文献翻译.docx

外文资料

ABSTRACT

Crawlingthewebisdeceptivelysimple:

thebasicalgorithmis(a)Fetchapage(b)ParseittoextractalllinkedURLs(c)ForalltheURLsnotseenbefore,repeat(a)–(c).However,thesizeoftheweb(estimatedatover4billionpages)and itsrateofchange(estimatedat7%perweek)movethisplanfromatrivialprogrammingexercisetoaseriousalgorithmicandsystemdesignchallenge.Indeed,thesetwofactorsaloneimplythatforareasonablyfreshandcompletecrawloftheweb,step(a)mustbeexecutedaboutathousandtimespersecond,andthusthemembershiptest(c)mustbedonewellovertenthousandtimespersecondagainstasettoolargetostoreinmainmemory.Thisrequiresadistributedarchitecture,whichfurthercomplicatesthemembershiptest.

    Acrucialwaytospeedupthetestistocache,thatis,tostoreinmainmemorya(dynamic)subsetofthe“seen”URLs.ThemaingoalofthispaperistocarefullyinvestigateseveralURLcachingtechniquesforwebcrawling.Weconsiderbothpracticalalgorithms:

randomreplacement,staticcache,LRU,andCLOCK,andtheoreticallimits:

clairvoyantcachingandinfinitecache.Weperformedabout1,800simulationsusingthesealgorithmswithvariouscachesizes,usingactuallogdataextractedfromamassive33daywebcrawlthatissuedoveronebillionHTTPrequests.Ourmainconclusionisthatcachingisveryeffective–inoursetup,acacheofroughly50,000entriescanachieveahitrateofalmost80%.Interestingly,thiscachesizefallsatacriticalpoint:

asubstantiallysmallercacheismuchlesseffectivewhileasubstantiallylargercachebringslittleadditionalbenefit.Weconjecturethatsuchcriticalpointsareinherenttoourproblemandventureanexplanationforthisphenomenon.

1.INTRODUCTION

ArecentPewFoundationstudy[31]statesthat“SearchengineshavebecomeanindispensableutilityforInternetusers”andestimatesthatasofmid-2002,slightlyover50%ofallAmericanshaveusedwebsearchtofindinformation.Hence,thetechnologythatpowerswebsearchisofenormouspracticalinterest.Inthispaper,weconcentrateononeaspectofthesearchtechnology,namelytheprocessofcollectingwebpagesthateventuallyconstitutethesearchenginecorpus.

    Searchenginescollectpagesinmanyways,amongthemdirectURLsubmission,paidinclusion,andURLextractionfromnonwebsources,butthebulkofthecorpusisobtainedbyrecursivelyexploringtheweb,aprocessknownascrawlingorSPIDERing.Thebasicalgorithmis

    (a)Fetchapage

    (b)ParseittoextractalllinkedURLs

    (c)ForalltheURLsnotseenbefore,repeat(a)–(c)

    CrawlingtypicallystartsfromasetofseedURLs,madeupofURLsobtainedbyothermeansasdescribedaboveand/ormadeupofURLscollectedduringpreviouscrawls.Sometimescrawlsarestartedfromasinglewellconnectedpage,oradirectorysuchas  ,butinthiscasearelativelylargeportionoftheweb(estimatedatover20%)isneverreached.See[9]foradiscussionofthegraphstructureofthewebthatleadstothisphenomenon.

     Ifweviewwebpagesasnodesinagraph,andhyperlinksasdirectededgesamongthesenodes,thencrawlingbecomesaprocessknowninmathematicalcirclesasgraphtraversal.Variousstrategiesforgraphtraversaldifferintheirchoiceofwhichnodeamongthenodesnotyetexploredtoexplorenext.TwostandardstrategiesforgraphtraversalareDepthFirstSearch(DFS)andBreadthFirstSearch(BFS)–theyareeasytoimplementandtaughtinmanyintroductoryalgorithmsclasses.(Seeforinstance[34]).

    However,crawlingthewebisnotatrivialprogrammingexercisebutaseriousalgorithmicandsystemdesignchallengebecauseofthefollowingtwofactors.

    1.Thewebisverylarge.Currently,Google[20]claimstohaveindexedover3billionpages.Variousstudies[3,27,28]haveindicatedthat,historically,thewebhasdoubledevery9-12months.

    2.Webpagesarechangingrapidly.If“change”means“anychange”,thenabout40%ofallwebpageschangeweekly[12].Evenifweconsideronlypagesthatchangebyathirdormore,about7%ofallwebpageschangeweekly[17].

    Thesetwofactorsimplythattoobtainareasonablyfreshand679completesnapshotoftheweb,asearchenginemustcrawlatleast100millionpagesperday.Therefore,step(a)mustbeexecutedabout1,000timespersecond,andthemembershiptestinstep(c)mustbedonewellovertenthousandtimespersecond,againstasetofURLsthatistoolargetostoreinmainmemory.Inaddition,crawlerstypicallyuseadistributedarchitecturetocrawlmorepagesinparallel,whichfurthercomplicatesthemembershiptest:

itispossiblethatthemembershipquestioncanonlybeansweredbyapeernode,notlocally.

    Acrucialwaytospeedupthemembershiptestistocachea(dynamic)subsetofthe“seen”URLsinmainmemory.ThemaingoalofthispaperistoinvestigateindepthseveralURLcachingtechniquesforwebcrawling.Weexaminedfourpracticaltechniques:

randomreplacement,staticcache,LRU,andCLOCK,andcomparedthemagainsttwotheoreticallimits:

clairvoyantcachingandinfinitecachewhenrunagainstatraceofawebcrawlthatissuedoveronebillionHTTPrequests.Wefoundthatsimplecachingtechniquesareextremelyeffectiveevenatrelativelysmallcachesizessuchas50,000entriesandshowhowthesecachescanbeimplementedveryefficiently.

    Thepaperisorganizedasfollows:

Section2discussesthevariouscrawlingsolutionsproposedintheliteratureandhowcachingfitsintheirmodel.Section3presentsanintroductiontocachingtechniquesanddescribesseveraltheoreticalandpracticalalgorithmsforcaching.WeimplementedthesealgorithmsundertheexperimentalsetupdescribedinSection4.TheresultsofoursimulationsaredepictedanddiscussedinSection5,andourrecommendationsforpracticalalgorithmsanddatastructuresforURLcachingarepresentedinSection6.Section7containsourconclusionsanddirectionsforfurtherresearch.

2.CRAWLING

Webcrawlersarealmostasoldasthewebitself,andnumerouscrawlingsystemshavebeendescribedintheliterature.Inthissection,wepresentabriefsurveyofthesecrawlers(inhistoricalorder)andthendiscusswhymostofthesecrawlerscouldbenefitfromURLcaching.

    ThecrawlerusedbytheInternetArchive[10]employsmultiplecrawlingprocesses,eachofwhichperformsanexhaustivecrawlof64hostsatatime.Thecrawlingprocessessavenon-localURLstodisk;attheendofacrawl,abatchjobaddstheseURLstotheper-hostseedsetsofthenextcrawl.

    TheoriginalGooglecrawler,describedin[7],implementsthedifferentcrawlercomponentsasdifferentprocesses.AsingleURLserverprocessmaintainsthesetofURLstodownload;crawlingprocessesfetchpages;indexingprocessesextractwordsandlinks;andURLresolverprocessesconvertrelativeintoabsoluteURLs,whicharethenfedtotheURLServer.Thevariousprocessescommunicateviathefilesystem.

    Fortheexperimentsdescribedinthispaper,weusedtheMercatorwebcrawler[22,29].Mercatorusesasetofindependent,communicatingwebcrawlerprocesses.Eachcrawlerprocessisresponsibleforasubsetofallwebservers;theassignmentofURLstocrawlerprocessesisbasedonahashoftheURL’shostcomponent.AcrawlerthatdiscoversanURLforwhichitisnotresponsiblesendsthisURLviaTCPtothecrawlerthatisresponsibleforit,batchingURLstogethertominimizeTCPoverhead.WedescribeMercatorinmoredetailinSection4.

    ChoandGarcia-Molina’scrawler[13]issimilartoMercator.Thesystemiscomposedofmultipleindependent,communicatingwebcrawlerprocesses(called“C-procs”).ChoandGarcia-MolinaconsiderdifferentschemesforpartitioningtheURLspace,includingURL-based(assigninganURLtoaC-procbasedonahashoftheentireURL),site-based(assigninganURLtoaC-procbasedonahashoftheURL’shostpart),andhierarchical(assigninganURLtoaC-procbasedonsomepropertyoftheURL,suchasitstop-leveldomain).

    TheWebFountaincrawler[16]isalsocomposedofasetofindependent,communicatingcrawlingprocesses(the“ants”).AnantthatdiscoversanURLforwhichitisnotresponsible,sendsthisURLtoadedicatedprocess(the“controller”),whichforwardstheURLtotheappropriateant.

    UbiCrawler(formerlyknownasTrovatore)[4,5]isagaincomposedofmultipleindependent,communicatingwebcrawlerprocesses.Italsoemploysacontrollerprocesswhichoverseesthecrawlingprocesses,detectsprocessfailures,andinitiatesfail-overtoothercrawlingprocesses.

    ShkapenyukandSuel’scrawler[35]issimilartoGoogle’s;thedifferentcrawlercomponentsareimplementedasdifferentprocesses.A“crawlingapplication”maintainsthesetofURLstobedownloaded,andschedulestheorderinwhichtodownloadthem.Itsendsdownloadrequeststoa“crawlmanager”,whichforwardsthemtoapoolof“downloader”processes.ThedownloaderprocessesfetchthepagesandsavethemtoanNFS-mountedfilesystem.Thecrawlingapplicationreadsthosesavedpages,extractsanylinkscontainedwithinthem,andaddsthemtothesetofURLstobedownloaded.

    AnywebcrawlermustmaintainacollectionofURLsthataretobedownloaded.Moreover,sinceitwouldbeunacceptabletodownloadthesameURLoverandover,itmusthaveawaytoavoidaddingURLstothecollectionmorethanonce.Typically,avoidanceisachievedbymaintainingasetofdiscoveredURLs,coveringtheURLsinthefrontieraswellasthosethathavealreadybeendownloaded.Ifthissetistoolargetofitinmemory(whichitoftenis,giventhattherearebillionsofvalidURLs),itisstoredondiskandca

展开阅读全文
相关资源
猜你喜欢
相关搜索
资源标签

当前位置:首页 > 总结汇报 > 学习总结

copyright@ 2008-2023 冰点文库 网站版权所有

经营许可证编号:鄂ICP备19020893号-2