hadoop分布式存储平台外文文献翻译.docx

上传人:b****2 文档编号:17526395 上传时间:2023-07-26 格式:DOCX 页数:20 大小:30.81KB
下载 相关 举报
hadoop分布式存储平台外文文献翻译.docx_第1页
第1页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第2页
第2页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第3页
第3页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第4页
第4页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第5页
第5页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第6页
第6页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第7页
第7页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第8页
第8页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第9页
第9页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第10页
第10页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第11页
第11页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第12页
第12页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第13页
第13页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第14页
第14页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第15页
第15页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第16页
第16页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第17页
第17页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第18页
第18页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第19页
第19页 / 共20页
hadoop分布式存储平台外文文献翻译.docx_第20页
第20页 / 共20页
亲,该文档总共20页,全部预览完了,如果喜欢就下载吧!
下载资源
资源描述

hadoop分布式存储平台外文文献翻译.docx

《hadoop分布式存储平台外文文献翻译.docx》由会员分享,可在线阅读,更多相关《hadoop分布式存储平台外文文献翻译.docx(20页珍藏版)》请在冰点文库上搜索。

hadoop分布式存储平台外文文献翻译.docx

hadoop分布式存储平台外文文献翻译

hadoop分布式存储平台外文文献翻译

(含:

英文原文及中文译文)

文献出处:

BorthakurD.TheHadoopDistributedFileSystem:

ArchitectureandDesign[J].HadoopProjectWebsite,2007,11(11):

1-10.

英文原文

HadoopDistributedFileSystem:

ArchitectureandDesign

DhrubaBorthakur

introduction

TheHadoopDistributedFileSystem(HDFS)isdesignedtobesuitablefordistributedfilesystemsrunningongeneral-purposehardware(commodityhardware).Ithasalotincommonwithexistingdistributedfilesystems.Atthesametime,itisalsoverydifferentfromotherdistributedfilesystems.HDFSisahighlyfault-tolerantsystemthatissuitablefordeploymentoninexpensivemachines.HDFScanprovidehigh-throughputdataaccessandisverysuitableforlarge-scaledata.Applicationsontheset.HDFSrelaxessomeofthePOSIXconstraintstostreamlinethereadingoffilesystemdata.HDFSwasoriginallydevelopedastheinfrastructurefortheApacheNutchsearchengineproject.HDFSispartoftheApacheHadoopCoreproject..

Prerequisitesanddesigngoals

Hardwareerror

Hardwareerrorsarethenorm,nottheexception.HDFSmayconsistofhundredsofservers,eachofwhichstorespartofthefilesystem'sdata.Therealitywefaceisthatthenumberofcomponentsthatmakeupasystemishuge,andanycomponentcanfail.ThismeansthatthereisalwaysaportionofHDFScomponentsthatarenotworking.Therefore,errordetectionandrapid,automaticrecoveryarethecorearchitecturalgoalsofHDFS.

Streamingdataaccess

ApplicationsrunningonHDFSaredifferentfromnormalapplicationsinthattheyneedtoaccesstheirdatasetsinastream.ThedesignofHDFStakesmoreconsiderationofdatabatchprocessingthanuserinteractionprocessing.Thelowerlatencyofdataaccessismorecriticalthanthehighthroughputofdataaccess.ManyofthehardconstraintsimposedbythePOSIXstandardsettingarenotrequiredforHDFSapplications.Toimprovethethroughputofthedata,somechangeshavebeenmadetothesemanticsofPOSIX.

Large-scaledatasets

ApplicationsrunningonHDFShavelargedatasets.AtypicalfilesizeonHDFSistypicallyintherangeof1byteto1byte.Therefore,HDFSistunedtosupportlargefilestorage.Itshouldbeabletoprovideahighoveralldatatransmissionbandwidththatcanscaletohundredsofnodesinacluster.AsingleHDFSinstanceshouldbeabletosupporttensofmillionsoffiles.

Simpleconsistencymodel

HDFSapplicationsrequirea"writeonce,readmany"fileaccessmodel.Afileiscreated,written,andclosedwithoutchangingit.Thisassumptionsimplifiesdataconsistencyissuesandmakeshigh-throughputdataaccesspossible.MAP/reductionapplicationsorwebcrawlerapplicationsarewellsuitedtothismodel.Therearealsoplanstoexpandthismodelinthefuturesothatitsupportsadditionalwriteoperationsforfiles.

"Mobilecomputingismorecosteffectivethanmobiledata"

Thecalculationofanapplicationrequestismoreefficientasitisclosertothedataitmanipulates,especiallywhenthedatareachesamassivelevel.Becausethiscanreducetheimpactofnetworkcongestionandincreasethethroughputofsystemdata.Movingthecalculationsclosertothedataisclearlybetterthanmovingthedatatotheapplication.HDFSprovidesapplicationswithinterfacestomovethemselvesaroundthedata.

Heterogeneitybetweenhardwareandsoftwareplatforms

HDFStakesintoaccounttheplatform'sportabilityatdesigntime.ThisfeaturefacilitatesthepromotionofHDFSasalarge-scaledataapplicationplatform.

NamenodeandDatanode

HDFSusesamaster/slavearchitecture.AnHDFSclusterconsistsofaNamenodeandacertainnumberofDatanodes.ANamenodeisacentralserverthatmanagesthefilesystem'snamespace(namespace)andclientaccesstofiles.TheDatanodeinaclusterisgenerallyanodethatisresponsibleformanagingstorageonthenodewhereitislocated.TheHDFSexposesthefilesystem'snamespace,anduserscanstoredataonitintheformoffiles.Internally,afileisactuallydividedintooneormoredatablocks,whicharestoredonasetofDatanodes.Namenodeperformsfilesystemnamespaceoperationssuchasopening,closing,renamingafileordirectory.ItisalsoresponsiblefordeterminingthemappingofdatablockstospecificDatanodenodes.TheDatanodeisresponsibleforhandlingreadandwriterequestsfromthefilesystemclient.Datablocksarecreated,deleted,andcopiedundertheunifiedscheduleoftheNameNode.

TheNamenodeandaDatanodearedesignedtorunoncommonbusinessmachines.ThesemachinesgenerallyruntheGNU/Linuxoperatingsystem(OS).TheHDFSusesJavalanguagedevelopment,soanyJava-enabledmachinecandeployaNamenodeorDatanode.DuetothehighlyportablelanguageofJava,HDFScanbedeployedonmanytypesofmachines.AtypicaldeploymentscenarioiswhenonlyoneNamenodeinstanceisrunningonamachine,andothermachinesintheclusterarerunninginstancesofaDatanode.ThisarchitecturedoesnotexcludetheoperationofmultipleDatanodesonasinglemachine,butthisisrelativelyrare.

ThestructureofasingleNamenodeinaclustergreatlysimplifiesthearchitectureofthesystem.NamenodeisthearbiterandadministratorofallHDFSmetadata,sothattheNameNodewhereuserdataneverflows.

FileSystemNamespace

HDFSsupportstraditionalhierarchicalfileorganization.Usersorapplicationscancreatedirectoriesandthenstorefilesinthesedirectories.Thefilesystemnamespacehierarchyissimilartomostexistingfilesystems:

.Userscancreate,delete,moveorrenamefiles.Currently,HDFSdoesnotsupportuserdiskquotaandaccesscontrol,nordoesitsupporthardlinksandsoftlinks.However,theHDFSarchitecturedoesnotpreventtheimplementationofthesefeatures.

TheNameNodeisresponsibleformaintainingthefilesystem'snamespace,andanychangestothefilesystemnamespaceorattributeswillberecordedbytheNamenode.TheapplicationcansetthenumberofcopiesoftheHDFSsavedfile.Thenumberoffilecopiesiscalledthecopyfactorofthefile.ThisinformationisalsostoredbytheNamenode.

Datareplication

HDFSisdesignedtoreliablystoreverylargefilesacrossmachinesinalargecluster.Itstoreseachfileasaseriesofdatablocks,exceptforthelastone,alldatablocksarethesamesize.Forfaulttolerance,alldatablocksofthefilewillhaveacopy.Theblocksizeandcopyfactorofeachfileareconfigurable.Applicationscanspecifythenumberofcopiesofafile.Replicacoefficientscanbespecifiedatthetimeoffilecreation,ortheycanbechangedlater.FilesinHDFSarewrittenonce,anditisstrictlyrequiredthattherecanbeonlyonewriteratanytime.

Thenamenodefullymanagesthereplicationofdatablocks,whichperiodicallyreceiveheartbeatsignalsandblockstatusreportsfromDatanodesineachoftheclusters.TheheartbeatsignalreceivedmeansthattheDatanode'snodeisworkingproperly.TheblockstatusreportcontainsalistofalldatablocksontheDatanode.

Copystorage:

thefirststep

ThestorageofcopiesisthekeytoHDFSreliabilityandperformance.TheoptimizedcopystoragepolicyisanimportantfeatureofHDFSdistinguishingitfrommostotherdistributedfilesystems.Thisfeaturerequiresalotoftuningandrequirestheaccumulationofexperience.HDFSusesastrategycalledrackawareness(rackawareness)toimprovedatareliability,availability,andutilizationofnetworkbandwidth.Thecurrentcopystoragestrategyisonlythefirststepinthisdirection.Theshort-termgoaltoachievethisstrategyistoverifyitseffectivenessintheproductionenvironment,observeitsbehavior,andlaythefoundationfortestingandresearchtoachievemoreadvancedstrategies.

LargeHDFSinstancestypicallyrunonclustersofcomputersthatspanmultipleracks.Communicationbetweentwomachinesondifferentracksneedstogothroughtheswitch.Inmostcases,thebandwidthbetweentwomachinesinthesamerackwillbegreaterthanthebandwidthbetweentwomachinesindifferentracks.

Througharack-awareprocess,theNamenodecandeterminetheIDoftheracktowhicheachDatanodebelongs.Asimplebutnotoptimizedstrategyistostorethecopiesindifferentracks.Thiscaneffectivelypreventthelossofdatawhentheentirerackfails,andallowfullutilizationofthebandwidthofmultiplerackswhenreadingdata.Thiskindofpolicysettingcanevenlydistributethecopiesinthecluster,whichisbeneficialtoloadbalancingintheeventofcomponentfailure.However,becauseawriteoperationofthisstrategyrequiresthetransmissionofdatablockstomultipleracks,thisaddstothecostofwriting.

Inmostcases,thereplicacoefficientis3,HDFSstoragestrategyistostoreacopyonthenodeofthelocalrack,acopyonanothernodeofthesamerack,thelastcopyonadifferentrackOnthenode.Thisstrategyreducesthetransmissionofdatabetweenracks,whichincreasestheefficiencyofwriteoperations.Rackerrorsarefarfewerthannodeerrors,sothisstrategydoesnotaffectdatareliabilityandavailability.Atthesametime,becausethedatablocksareonlyplacedontwo(notthree)differentracks,thisstrategyreducesthetotalnetworktransmissionbandwidthrequiredwhenreadingdata.Underthisstrategy,replicasarenotevenlydistributedacrossdifferentracks.One-thirdofthereplicasareononenode,two-thirdsofthereplicasareononerack,andotherreplicasareevenlydistributedintheremainingracks.Thisstrategydoesnotcompromisedatareliabilityandreadperformance.Undertheimprovedwriteperformance.

Currently,thedefaultcopystoragestrategydescribedhereisintheprocessofdevelopment.

Copyselection

Inordertoreduc

展开阅读全文
相关资源
猜你喜欢
相关搜索
资源标签

当前位置:首页 > IT计算机

copyright@ 2008-2023 冰点文库 网站版权所有

经营许可证编号:鄂ICP备19020893号-2