hadoop分布式存储平台外文文献翻译.docx
《hadoop分布式存储平台外文文献翻译.docx》由会员分享,可在线阅读,更多相关《hadoop分布式存储平台外文文献翻译.docx(20页珍藏版)》请在冰点文库上搜索。
![hadoop分布式存储平台外文文献翻译.docx](https://file1.bingdoc.com/fileroot1/2023-7/26/3b1cdec2-93aa-42a3-92d5-f5d20e2c5fed/3b1cdec2-93aa-42a3-92d5-f5d20e2c5fed1.gif)
hadoop分布式存储平台外文文献翻译
hadoop分布式存储平台外文文献翻译
(含:
英文原文及中文译文)
文献出处:
BorthakurD.TheHadoopDistributedFileSystem:
ArchitectureandDesign[J].HadoopProjectWebsite,2007,11(11):
1-10.
英文原文
HadoopDistributedFileSystem:
ArchitectureandDesign
DhrubaBorthakur
introduction
TheHadoopDistributedFileSystem(HDFS)isdesignedtobesuitablefordistributedfilesystemsrunningongeneral-purposehardware(commodityhardware).Ithasalotincommonwithexistingdistributedfilesystems.Atthesametime,itisalsoverydifferentfromotherdistributedfilesystems.HDFSisahighlyfault-tolerantsystemthatissuitablefordeploymentoninexpensivemachines.HDFScanprovidehigh-throughputdataaccessandisverysuitableforlarge-scaledata.Applicationsontheset.HDFSrelaxessomeofthePOSIXconstraintstostreamlinethereadingoffilesystemdata.HDFSwasoriginallydevelopedastheinfrastructurefortheApacheNutchsearchengineproject.HDFSispartoftheApacheHadoopCoreproject..
Prerequisitesanddesigngoals
Hardwareerror
Hardwareerrorsarethenorm,nottheexception.HDFSmayconsistofhundredsofservers,eachofwhichstorespartofthefilesystem'sdata.Therealitywefaceisthatthenumberofcomponentsthatmakeupasystemishuge,andanycomponentcanfail.ThismeansthatthereisalwaysaportionofHDFScomponentsthatarenotworking.Therefore,errordetectionandrapid,automaticrecoveryarethecorearchitecturalgoalsofHDFS.
Streamingdataaccess
ApplicationsrunningonHDFSaredifferentfromnormalapplicationsinthattheyneedtoaccesstheirdatasetsinastream.ThedesignofHDFStakesmoreconsiderationofdatabatchprocessingthanuserinteractionprocessing.Thelowerlatencyofdataaccessismorecriticalthanthehighthroughputofdataaccess.ManyofthehardconstraintsimposedbythePOSIXstandardsettingarenotrequiredforHDFSapplications.Toimprovethethroughputofthedata,somechangeshavebeenmadetothesemanticsofPOSIX.
Large-scaledatasets
ApplicationsrunningonHDFShavelargedatasets.AtypicalfilesizeonHDFSistypicallyintherangeof1byteto1byte.Therefore,HDFSistunedtosupportlargefilestorage.Itshouldbeabletoprovideahighoveralldatatransmissionbandwidththatcanscaletohundredsofnodesinacluster.AsingleHDFSinstanceshouldbeabletosupporttensofmillionsoffiles.
Simpleconsistencymodel
HDFSapplicationsrequirea"writeonce,readmany"fileaccessmodel.Afileiscreated,written,andclosedwithoutchangingit.Thisassumptionsimplifiesdataconsistencyissuesandmakeshigh-throughputdataaccesspossible.MAP/reductionapplicationsorwebcrawlerapplicationsarewellsuitedtothismodel.Therearealsoplanstoexpandthismodelinthefuturesothatitsupportsadditionalwriteoperationsforfiles.
"Mobilecomputingismorecosteffectivethanmobiledata"
Thecalculationofanapplicationrequestismoreefficientasitisclosertothedataitmanipulates,especiallywhenthedatareachesamassivelevel.Becausethiscanreducetheimpactofnetworkcongestionandincreasethethroughputofsystemdata.Movingthecalculationsclosertothedataisclearlybetterthanmovingthedatatotheapplication.HDFSprovidesapplicationswithinterfacestomovethemselvesaroundthedata.
Heterogeneitybetweenhardwareandsoftwareplatforms
HDFStakesintoaccounttheplatform'sportabilityatdesigntime.ThisfeaturefacilitatesthepromotionofHDFSasalarge-scaledataapplicationplatform.
NamenodeandDatanode
HDFSusesamaster/slavearchitecture.AnHDFSclusterconsistsofaNamenodeandacertainnumberofDatanodes.ANamenodeisacentralserverthatmanagesthefilesystem'snamespace(namespace)andclientaccesstofiles.TheDatanodeinaclusterisgenerallyanodethatisresponsibleformanagingstorageonthenodewhereitislocated.TheHDFSexposesthefilesystem'snamespace,anduserscanstoredataonitintheformoffiles.Internally,afileisactuallydividedintooneormoredatablocks,whicharestoredonasetofDatanodes.Namenodeperformsfilesystemnamespaceoperationssuchasopening,closing,renamingafileordirectory.ItisalsoresponsiblefordeterminingthemappingofdatablockstospecificDatanodenodes.TheDatanodeisresponsibleforhandlingreadandwriterequestsfromthefilesystemclient.Datablocksarecreated,deleted,andcopiedundertheunifiedscheduleoftheNameNode.
TheNamenodeandaDatanodearedesignedtorunoncommonbusinessmachines.ThesemachinesgenerallyruntheGNU/Linuxoperatingsystem(OS).TheHDFSusesJavalanguagedevelopment,soanyJava-enabledmachinecandeployaNamenodeorDatanode.DuetothehighlyportablelanguageofJava,HDFScanbedeployedonmanytypesofmachines.AtypicaldeploymentscenarioiswhenonlyoneNamenodeinstanceisrunningonamachine,andothermachinesintheclusterarerunninginstancesofaDatanode.ThisarchitecturedoesnotexcludetheoperationofmultipleDatanodesonasinglemachine,butthisisrelativelyrare.
ThestructureofasingleNamenodeinaclustergreatlysimplifiesthearchitectureofthesystem.NamenodeisthearbiterandadministratorofallHDFSmetadata,sothattheNameNodewhereuserdataneverflows.
FileSystemNamespace
HDFSsupportstraditionalhierarchicalfileorganization.Usersorapplicationscancreatedirectoriesandthenstorefilesinthesedirectories.Thefilesystemnamespacehierarchyissimilartomostexistingfilesystems:
.Userscancreate,delete,moveorrenamefiles.Currently,HDFSdoesnotsupportuserdiskquotaandaccesscontrol,nordoesitsupporthardlinksandsoftlinks.However,theHDFSarchitecturedoesnotpreventtheimplementationofthesefeatures.
TheNameNodeisresponsibleformaintainingthefilesystem'snamespace,andanychangestothefilesystemnamespaceorattributeswillberecordedbytheNamenode.TheapplicationcansetthenumberofcopiesoftheHDFSsavedfile.Thenumberoffilecopiesiscalledthecopyfactorofthefile.ThisinformationisalsostoredbytheNamenode.
Datareplication
HDFSisdesignedtoreliablystoreverylargefilesacrossmachinesinalargecluster.Itstoreseachfileasaseriesofdatablocks,exceptforthelastone,alldatablocksarethesamesize.Forfaulttolerance,alldatablocksofthefilewillhaveacopy.Theblocksizeandcopyfactorofeachfileareconfigurable.Applicationscanspecifythenumberofcopiesofafile.Replicacoefficientscanbespecifiedatthetimeoffilecreation,ortheycanbechangedlater.FilesinHDFSarewrittenonce,anditisstrictlyrequiredthattherecanbeonlyonewriteratanytime.
Thenamenodefullymanagesthereplicationofdatablocks,whichperiodicallyreceiveheartbeatsignalsandblockstatusreportsfromDatanodesineachoftheclusters.TheheartbeatsignalreceivedmeansthattheDatanode'snodeisworkingproperly.TheblockstatusreportcontainsalistofalldatablocksontheDatanode.
Copystorage:
thefirststep
ThestorageofcopiesisthekeytoHDFSreliabilityandperformance.TheoptimizedcopystoragepolicyisanimportantfeatureofHDFSdistinguishingitfrommostotherdistributedfilesystems.Thisfeaturerequiresalotoftuningandrequirestheaccumulationofexperience.HDFSusesastrategycalledrackawareness(rackawareness)toimprovedatareliability,availability,andutilizationofnetworkbandwidth.Thecurrentcopystoragestrategyisonlythefirststepinthisdirection.Theshort-termgoaltoachievethisstrategyistoverifyitseffectivenessintheproductionenvironment,observeitsbehavior,andlaythefoundationfortestingandresearchtoachievemoreadvancedstrategies.
LargeHDFSinstancestypicallyrunonclustersofcomputersthatspanmultipleracks.Communicationbetweentwomachinesondifferentracksneedstogothroughtheswitch.Inmostcases,thebandwidthbetweentwomachinesinthesamerackwillbegreaterthanthebandwidthbetweentwomachinesindifferentracks.
Througharack-awareprocess,theNamenodecandeterminetheIDoftheracktowhicheachDatanodebelongs.Asimplebutnotoptimizedstrategyistostorethecopiesindifferentracks.Thiscaneffectivelypreventthelossofdatawhentheentirerackfails,andallowfullutilizationofthebandwidthofmultiplerackswhenreadingdata.Thiskindofpolicysettingcanevenlydistributethecopiesinthecluster,whichisbeneficialtoloadbalancingintheeventofcomponentfailure.However,becauseawriteoperationofthisstrategyrequiresthetransmissionofdatablockstomultipleracks,thisaddstothecostofwriting.
Inmostcases,thereplicacoefficientis3,HDFSstoragestrategyistostoreacopyonthenodeofthelocalrack,acopyonanothernodeofthesamerack,thelastcopyonadifferentrackOnthenode.Thisstrategyreducesthetransmissionofdatabetweenracks,whichincreasestheefficiencyofwriteoperations.Rackerrorsarefarfewerthannodeerrors,sothisstrategydoesnotaffectdatareliabilityandavailability.Atthesametime,becausethedatablocksareonlyplacedontwo(notthree)differentracks,thisstrategyreducesthetotalnetworktransmissionbandwidthrequiredwhenreadingdata.Underthisstrategy,replicasarenotevenlydistributedacrossdifferentracks.One-thirdofthereplicasareononenode,two-thirdsofthereplicasareononerack,andotherreplicasareevenlydistributedintheremainingracks.Thisstrategydoesnotcompromisedatareliabilityandreadperformance.Undertheimprovedwriteperformance.
Currently,thedefaultcopystoragestrategydescribedhereisintheprocessofdevelopment.
Copyselection
Inordertoreduc