hadoop分布式存储平台外文文献翻译.docx

资源描述

hadoop分布式存储平台外文文献翻译.docx

《hadoop分布式存储平台外文文献翻译.docx》由会员分享，可在线阅读，更多相关《hadoop分布式存储平台外文文献翻译.docx（20页珍藏版）》请在冰点文库上搜索。

hadoop分布式存储平台外文文献翻译.docx

hadoop分布式存储平台外文文献翻译

（含：

英文原文及中文译文）

文献出处：

BorthakurD.TheHadoopDistributedFileSystem:

ArchitectureandDesign[J].HadoopProjectWebsite,2007,11（11）:

1-10.

英文原文

HadoopDistributedFileSystem:

ArchitectureandDesign

DhrubaBorthakur

introduction

TheHadoopDistributedFileSystem（HDFS）isdesignedtobesuitablefordistributedfilesystemsrunningongeneral-purposehardware（commodityhardware）.Ithasalotincommonwithexistingdistributedfilesystems.Atthesametime,itisalsoverydifferentfromotherdistributedfilesystems.HDFSisahighlyfault-tolerantsystemthatissuitablefordeploymentoninexpensivemachines.HDFScanprovidehigh-throughputdataaccessandisverysuitableforlarge-scaledata.Applicationsontheset.HDFSrelaxessomeofthePOSIXconstraintstostreamlinethereadingoffilesystemdata.HDFSwasoriginallydevelopedastheinfrastructurefortheApacheNutchsearchengineproject.HDFSispartoftheApacheHadoopCoreproject..

Prerequisitesanddesigngoals

Hardwareerror

Hardwareerrorsarethenorm,nottheexception.HDFSmayconsistofhundredsofservers,eachofwhichstorespartofthefilesystem'sdata.Therealitywefaceisthatthenumberofcomponentsthatmakeupasystemishuge,andanycomponentcanfail.ThismeansthatthereisalwaysaportionofHDFScomponentsthatarenotworking.Therefore,errordetectionandrapid,automaticrecoveryarethecorearchitecturalgoalsofHDFS.

Streamingdataaccess

ApplicationsrunningonHDFSaredifferentfromnormalapplicationsinthattheyneedtoaccesstheirdatasetsinastream.ThedesignofHDFStakesmoreconsiderationofdatabatchprocessingthanuserinteractionprocessing.Thelowerlatencyofdataaccessismorecriticalthanthehighthroughputofdataaccess.ManyofthehardconstraintsimposedbythePOSIXstandardsettingarenotrequiredforHDFSapplications.Toimprovethethroughputofthedata,somechangeshavebeenmadetothesemanticsofPOSIX.

Large-scaledatasets

ApplicationsrunningonHDFShavelargedatasets.AtypicalfilesizeonHDFSistypicallyintherangeof1byteto1byte.Therefore,HDFSistunedtosupportlargefilestorage.Itshouldbeabletoprovideahighoveralldatatransmissionbandwidththatcanscaletohundredsofnodesinacluster.AsingleHDFSinstanceshouldbeabletosupporttensofmillionsoffiles.

Simpleconsistencymodel

HDFSapplicationsrequirea"writeonce,readmany"fileaccessmodel.Afileiscreated,written,andclosedwithoutchangingit.Thisassumptionsimplifiesdataconsistencyissuesandmakeshigh-throughputdataaccesspossible.MAP/reductionapplicationsorwebcrawlerapplicationsarewellsuitedtothismodel.Therearealsoplanstoexpandthismodelinthefuturesothatitsupportsadditionalwriteoperationsforfiles.

"Mobilecomputingismorecosteffectivethanmobiledata"

Thecalculationofanapplicationrequestismoreefficientasitisclosertothedataitmanipulates,especiallywhenthedatareachesamassivelevel.Becausethiscanreducetheimpactofnetworkcongestionandincreasethethroughputofsystemdata.Movingthecalculationsclosertothedataisclearlybetterthanmovingthedatatotheapplication.HDFSprovidesapplicationswithinterfacestomovethemselvesaroundthedata.

Heterogeneitybetweenhardwareandsoftwareplatforms

HDFStakesintoaccounttheplatform'sportabilityatdesigntime.ThisfeaturefacilitatesthepromotionofHDFSasalarge-scaledataapplicationplatform.

NamenodeandDatanode

HDFSusesamaster/slavearchitecture.AnHDFSclusterconsistsofaNamenodeandacertainnumberofDatanodes.ANamenodeisacentralserverthatmanagesthefilesystem'snamespace（namespace）andclientaccesstofiles.TheDatanodeinaclusterisgenerallyanodethatisresponsibleformanagingstorageonthenodewhereitislocated.TheHDFSexposesthefilesystem'snamespace,anduserscanstoredataonitintheformoffiles.Internally,afileisactuallydividedintooneormoredatablocks,whicharestoredonasetofDatanodes.Namenodeperformsfilesystemnamespaceoperationssuchasopening,closing,renamingafileordirectory.ItisalsoresponsiblefordeterminingthemappingofdatablockstospecificDatanodenodes.TheDatanodeisresponsibleforhandlingreadandwriterequestsfromthefilesystemclient.Datablocksarecreated,deleted,andcopiedundertheunifiedscheduleoftheNameNode.

TheNamenodeandaDatanodearedesignedtorunoncommonbusinessmachines.ThesemachinesgenerallyruntheGNU/Linuxoperatingsystem（OS）.TheHDFSusesJavalanguagedevelopment,soanyJava-enabledmachinecandeployaNamenodeorDatanode.DuetothehighlyportablelanguageofJava,HDFScanbedeployedonmanytypesofmachines.AtypicaldeploymentscenarioiswhenonlyoneNamenodeinstanceisrunningonamachine,andothermachinesintheclusterarerunninginstancesofaDatanode.ThisarchitecturedoesnotexcludetheoperationofmultipleDatanodesonasinglemachine,butthisisrelativelyrare.

ThestructureofasingleNamenodeinaclustergreatlysimplifiesthearchitectureofthesystem.NamenodeisthearbiterandadministratorofallHDFSmetadata,sothattheNameNodewhereuserdataneverflows.

FileSystemNamespace

HDFSsupportstraditionalhierarchicalfileorganization.Usersorapplicationscancreatedirectoriesandthenstorefilesinthesedirectories.Thefilesystemnamespacehierarchyissimilartomostexistingfilesystems:

.Userscancreate,delete,moveorrenamefiles.Currently,HDFSdoesnotsupportuserdiskquotaandaccesscontrol,nordoesitsupporthardlinksandsoftlinks.However,theHDFSarchitecturedoesnotpreventtheimplementationofthesefeatures.

TheNameNodeisresponsibleformaintainingthefilesystem'snamespace,andanychangestothefilesystemnamespaceorattributeswillberecordedbytheNamenode.TheapplicationcansetthenumberofcopiesoftheHDFSsavedfile.Thenumberoffilecopiesiscalledthecopyfactorofthefile.ThisinformationisalsostoredbytheNamenode.

Datareplication

HDFSisdesignedtoreliablystoreverylargefilesacrossmachinesinalargecluster.Itstoreseachfileasaseriesofdatablocks,exceptforthelastone,alldatablocksarethesamesize.Forfaulttolerance,alldatablocksofthefilewillhaveacopy.Theblocksizeandcopyfactorofeachfileareconfigurable.Applicationscanspecifythenumberofcopiesofafile.Replicacoefficientscanbespecifiedatthetimeoffilecreation,ortheycanbechangedlater.FilesinHDFSarewrittenonce,anditisstrictlyrequiredthattherecanbeonlyonewriteratanytime.

Thenamenodefullymanagesthereplicationofdatablocks,whichperiodicallyreceiveheartbeatsignalsandblockstatusreportsfromDatanodesineachoftheclusters.TheheartbeatsignalreceivedmeansthattheDatanode'snodeisworkingproperly.TheblockstatusreportcontainsalistofalldatablocksontheDatanode.

Copystorage:

thefirststep

ThestorageofcopiesisthekeytoHDFSreliabilityandperformance.TheoptimizedcopystoragepolicyisanimportantfeatureofHDFSdistinguishingitfrommostotherdistributedfilesystems.Thisfeaturerequiresalotoftuningandrequirestheaccumulationofexperience.HDFSusesastrategycalledrackawareness（rackawareness）toimprovedatareliability,availability,andutilizationofnetworkbandwidth.Thecurrentcopystoragestrategyisonlythefirststepinthisdirection.Theshort-termgoaltoachievethisstrategyistoverifyitseffectivenessintheproductionenvironment,observeitsbehavior,andlaythefoundationfortestingandresearchtoachievemoreadvancedstrategies.

LargeHDFSinstancestypicallyrunonclustersofcomputersthatspanmultipleracks.Communicationbetweentwomachinesondifferentracksneedstogothroughtheswitch.Inmostcases,thebandwidthbetweentwomachinesinthesamerackwillbegreaterthanthebandwidthbetweentwomachinesindifferentracks.

Througharack-awareprocess,theNamenodecandeterminetheIDoftheracktowhicheachDatanodebelongs.Asimplebutnotoptimizedstrategyistostorethecopiesindifferentracks.Thiscaneffectivelypreventthelossofdatawhentheentirerackfails,andallowfullutilizationofthebandwidthofmultiplerackswhenreadingdata.Thiskindofpolicysettingcanevenlydistributethecopiesinthecluster,whichisbeneficialtoloadbalancingintheeventofcomponentfailure.However,becauseawriteoperationofthisstrategyrequiresthetransmissionofdatablockstomultipleracks,thisaddstothecostofwriting.

Inmostcases,thereplicacoefficientis3,HDFSstoragestrategyistostoreacopyonthenodeofthelocalrack,acopyonanothernodeofthesamerack,thelastcopyonadifferentrackOnthenode.Thisstrategyreducesthetransmissionofdatabetweenracks,whichincreasestheefficiencyofwriteoperations.Rackerrorsarefarfewerthannodeerrors,sothisstrategydoesnotaffectdatareliabilityandavailability.Atthesametime,becausethedatablocksareonlyplacedontwo（notthree）differentracks,thisstrategyreducesthetotalnetworktransmissionbandwidthrequiredwhenreadingdata.Underthisstrategy,replicasarenotevenlydistributedacrossdifferentracks.One-thirdofthereplicasareononenode,two-thirdsofthereplicasareononerack,andotherreplicasareevenlydistributedintheremainingracks.Thisstrategydoesnotcompromisedatareliabilityandreadperformance.Undertheimprovedwriteperformance.

Currently,thedefaultcopystoragestrategydescribedhereisintheprocessofdevelopment.

Copyselection

Inordertoreduc

展开阅读全文