小程序中英文外文文献翻译Word格式.docx
《小程序中英文外文文献翻译Word格式.docx》由会员分享,可在线阅读,更多相关《小程序中英文外文文献翻译Word格式.docx(18页珍藏版)》请在冰点文库上搜索。
![小程序中英文外文文献翻译Word格式.docx](https://file1.bingdoc.com/fileroot1/2023-4/28/e265d22a-80cb-46e5-9af8-99a5618d6366/e265d22a-80cb-46e5-9af8-99a5618d63661.gif)
)
外文文献翻译原文及译文
标题:
ENHANCINGAPPLICATIONPERFORMANCEUSINGMINI-
APPS:
COMPARISONOFHYBRIDPARALLELPROGRAMMINGPARADIGMS
作者:
GaryLawsonMichaelPoteatMashaSosonkinaRobertBaurle
期刊:
ComputerScience
年份:
2016原文
COMPARISONOFHYBRIDPARALLEL
PROGRAMMINGPARADIGMS
GaryLawsonMichaelPoteatMashaSosonkinaRobertBauric
ABSTRACT
Inmanyfields,real-worldapplicationsfbrHighPerformanceComputinghavealreadybeendeveloped.Eortheseapplicationstostayup-to-date,newparallelstrategiesmustbeexploredtoyieldthebestperformance;
however,restructuringormodifyingareal-worldapplicationmaybedauntingdependingonthesizeofthecode.Inthiscase,amini-appmaybeemployedtoquicklyexploresuchoptionswithoutmodifyingtheentirecode.Inthiswork,severalmini-appshavebeencreatedtoenhanceareal-worldapplicationperformance,namelytheVULCANcodefbrcomplexflowanalysisdevelopedattheNASALangleyResearchCenter.Thesemini-appsexplorehybridparallelprogrammingparadigmswithMessagePassingInterface(MPI)fbrdistributedmemoryaccessandeitherSharedMPI(SMPI)orOpenMPfbrsharedmemoryaccesses.PerformancetestingshowsthatMPI+SMPIyieldsthebestexecutionperformance,whilerequiringthelargestnumberofcodechanges.Amaximumspeedupof23wasmeasuredforMPI+SMPI,butonly10wasmeasuredfbrMPI+OpenMP.Keywords:
Mini-apps,Performance,VULCAN,Shared
Memory,MP1.OpenMP
1INTRODUCTION
Inmanyfields,real-worldapplicationshavealreadybeendeveloped.Forestablishedapplicationstostayup-to-date,newparallelstrategiesmustbeexploredtodeterminewhichmayyieldthebestperformance,especiallywithadvancesincomputinghardware.However,restructuringormodifyingareal-worldapplicationincursincreasedcostdependingonthesizeofthecodeandchangestobemade.Amini-appmaybecreatedtoquicklyexploresuchoptionswithoutmodifyingtheentirecode.Mini-appsreducetheoverheadofapplyingnewstrategies,thusvariousstrategiesmaybeimplementedandcompared.Thisworkpresentstheauthorsexperienceswhenfollowingthisstrategyforareal-worldapplicationdevelopedbyNASA.
VULCAN(ViscousUpwindAlgorithmforComplexFlowAnalysis)isaturbulent,noequilibrium,finite-ratechemicalkinetics,Navier-Stokesflowsolverfbrstructured,cell-centered,multiblockgridsthatismaintainedanddistributedbytheHypersonicAirBreathingPropulsionBranchoftheNASALangleyResearchCenter(NASA2016).Themini-appdevelopedinthisworkusestheHouseholderReflectorkernelfbrsolvingsystemsoflinearequations.Thiskernelisusedoftenbydifferentworkloads,andisagoodcandidatetodecidewhatstrategytypetoapplytoVULCAN.VULCANisbuiltonasingle-layerofMP1andthecodehasbeenoptimizedtoobtainperfectvectorization,thereforetwo-levelsofparallelismarecurrentlyused.Thisworkinvestigatestwoflavorsofshared-memoryparallelism,OpenMPandSharedMPI,whichwillprovidethethird-levelofparallelismfbrtheapplication.Athird-levelofparallelismincreasesperformance,whichdecreasesthetime-to-solution.
MP1hasextendedthestandardtoMPIversion3.0,whichincludestheSharedMemory(SHM)model(MikhailB,(Intel)2015,MessagePassingInterfaceForum2012),knowninthisworkasSharedMPI(SMPI).ThisextensionallowsMPItocreatememorywindowsthataresharedbetweenMPItasksonthesamephysicalnode.Inthisway,MPItasksareequivalenttothreads,exceptSharedMPIismoredifficultfbraprogrammertoimplement.OpenMPisthemostcommonshared-memorylibraryusedtodatebecauseofitsease-of-use(OpenMP2016).Inmostcases,onlyafewOpenMPpragmasarerequiredtoparallelizealoop;
however,OpenMPissubjecttoincreasedoverhead,whichmaydecreaseperformanceifnotproperlytuned.
Asearlyastheyear2000,theauthorsin(CappelloandEtiemble2000)foundthatlatencysensitivecodesseemtobenefitfrompureMPIimplementationswhereasbandwidthsensitivecodesbenefitfromhybridMPI+OpenMP.Also,theauthorsfoundthatfasterprocessorswillbenefithybridMPI+OpenMPcodesifdatamovementisnotanoverwhelmingbottleneck(CappelloandEtiemble2000).Sincethistime,hybridMPl+OpenMPimplementationshaveimproved,butnotwithoutdifficulties.In(Drosi-nosandKozins2004,ChorleyandWalker2010),itwasfoundthatOpenMPincursmanyperformancereductions,including:
overhead(fbrk/join,atomics,etc),falsesharing,imbalancedmessagepassing,andasensitivitytoprocessormapping.However,OpenMPoverheadmaybehiddenwhenusingmorethreads.In(Rabenseifher,Hager,andJost2009),theauthorsfoundthatsimplyusingOpenMPcouldincurper-fbrmancepenaltiesbecausethecompileravoidsoptimizingOpenMPloops-verifieduptoversion10.1.Althoughcompilershaveadvancedconsiderablysincethistime,applicationusersthatstillcompileusingolderversionsmaybeatriskifusingOpenMP.In(DrosinosandKoziris2004,ChorleyandWalker2010)theauthorsfoundthatthehybridMPI+OpenMPapproachoutperformsthepureMPIapproachbecausethehybridstrategydiversifiesthepathtoparallelexecution.Morerecently,MPIextendeditsstandardtoincludetheSHMmodel(M汰hailB.(Intel)2015).Theauthorsin(Hoetier,Dinan,Thakur,Barrett,Balaji,Gropp,andUnderwood2015)presentMPIRMAtheoryandexamples,whicharethebasisoftheSHMmodel.In(GerstenbergenBesta,andHoefler2013),theauthorsconductathoroughperformanceevaluationofMPIRMA,includinganinvestigationofdifferentsynchronizationtechniquesfbrmemorywindows.In(Hoefler,Dinan,Buntinas,Balaji,Barrett,Brightwell,Gropp,Kale,andThakur2013),theauthorsinvestigatetheviabilityofMPI+SMPIexecution,aswellascompareittoMPI+OpenMPexecution.ItwasfoundthatanunderlyinglimitationofOpenMPistheshared-by-defaultmodelformemory,whichdoesnotcouplewellwithMP1sincethememorymodelisprivate-by-default.Forthisreason,MPI+SMPIcodesareexpectedtoperformbetter,sincesharedmemoryisexplicitandthememorymodelfbrtheentirecodeisprivate-by-default.Mostrecently,anewMPIcommunicationmodelhasbeenintroducedin(Gropp,Olson,andSamfass2016),whichbettercapturesmultinodecommunicationperformance,andoilersanopen-sourcebenchmarkingtooltocapturethemodelparametersfbragivensystem.Independentofthesharedmemorylayer,MPIisthedefactostandardindatamovementbetweennodesandsuchamodelcanhelpanyMPIprogram.Theremainderofthispaperisorganizedintothefollowingsections:
2introducestheHouseholdermini-apps,3presentstheperformancetestingresultsfbrthemini-appsconsidered,and4concludesthispaper.
2HOUSEHOLDERMIN1-APP
Themini-appsusethehouseholdercomputationkernelfromVULCAN,whichisusedinsolvingsystemsoflinearequations.Thehouseholderroutineisanalgorithmthatisusedtotransformasquarematrixintotriangularform,withoutincreasingthemagnitudeofeachelementsignificantly(Hansen1992).TheHouseholderroutineisnumericallystable,inthatitdoesnotloseasignificantamountofaccuracyduetoverysmallorverylargeintermediatevaluesusedinthecomputation.
Mini-appsaredesignedtoperformspecificfunctions.Inthiswork,theimportantfeaturesareasfollows:
Acceptgenericinput.Validatethenumericalresultoftheoptimizedroutine.Measureperformanceoftheoriginalandoptimizedroutines.Tuneoptimizations.
Thegenericinputisreadinfromafile,wherethefilemustcontainatleastonematrixAandresultingvectorb.Shouldonlyonematrixandvectorbesupplied,theinputwillbeduplicatedfbrallinstancesofm.Validationoftheoptimizedroutineisperformedbytakingthedifferenceoftheoutputfromtheoriginalandoptimizedroutines.Themini-appwillfirstcomputethesolutionoftheinputusingtheoriginalroutine,andthentheoptimizedroutine.Thiswaytheoutputmaybecompareddirectly,andrelativeperformancemayalsobemeasuredusingexecutiontime.Shouldtheoptimizedroutinefeatureoneormoreparametersthatmaybevaried,theyaretobeinvestigatedsuchthattheoptimizationmaybetunedtothehardware.Inthiswork,thereisalwaysatleastonetunableparameter.Onefeaturethatshouldhavebeenfactoredintothemini-appdesignwasmodularizingthedifferentversionsoftheHouseholderroutine.Inthiswork,twomini-appsweredesignedbecauseeachimplementsadifferentversionoftheparallelHouseholderroutine;
however,itwouldhavebeenbettertodesignasinglemini-appthatusesmodulestoincludeotherversionsoftheparallelHouseholderkernel.Withthisfunctionality,itwouldbelesscumbersometoworkoneachversionofthekernel.ToparallelizetheHouseholderroutine,misdecomposedintoseparate,butequalchunksthatarethensolvedbyeachthread-sharedMPItasksareequivalenttothreadsinthisworkfbrbrevity.However,theoriginalroutinevariesoverminsidetheinner-mostcomputationalloop(anoptimizationthatbenefitsvectorizationandcaching),buttheparallelloopmustbetheouter-mostloopfbrbestperformance.Therefore,loopblockinghasbeeninvokedtortheparallelsectionsofthecode.Loopblockingisatechniquecommonlyusedtoreducethememoryfootprintofacomputationsuchthatitfitsinsidethecachefbragivenhardware.Therefore,theparallelHouseholderroutinehasatleastonetunableparameter,blocksize.
Inthiswork,twoflavorsofthesharedmemorymodelareinvestigated:
OpenMPandSMP1.ThedifferencebetweenOpenMPandSMP1liesinhowmemoryismanaged.OpenMPusesapublic-memorymodelwherealldataisavailabletoallthr