小程序中英文外文文献翻译Word格式.docx

资源描述

小程序中英文外文文献翻译Word格式.docx

《小程序中英文外文文献翻译Word格式.docx》由会员分享，可在线阅读，更多相关《小程序中英文外文文献翻译Word格式.docx（18页珍藏版）》请在冰点文库上搜索。

小程序中英文外文文献翻译Word格式.docx

）

外文文献翻译原文及译文

标题：

ENHANCINGAPPLICATIONPERFORMANCEUSINGMINI-

APPS:

COMPARISONOFHYBRIDPARALLELPROGRAMMINGPARADIGMS

作者：

GaryLawsonMichaelPoteatMashaSosonkinaRobertBaurle

期刊：

ComputerScience

年份：

2016原文

COMPARISONOFHYBRIDPARALLEL

PROGRAMMINGPARADIGMS

GaryLawsonMichaelPoteatMashaSosonkinaRobertBauric

ABSTRACT

Inmanyfields,real-worldapplicationsfbrHighPerformanceComputinghavealreadybeendeveloped.Eortheseapplicationstostayup-to-date,newparallelstrategiesmustbeexploredtoyieldthebestperformance;

however,restructuringormodifyingareal-worldapplicationmaybedauntingdependingonthesizeofthecode.Inthiscase,amini-appmaybeemployedtoquicklyexploresuchoptionswithoutmodifyingtheentirecode.Inthiswork,severalmini-appshavebeencreatedtoenhanceareal-worldapplicationperformance,namelytheVULCANcodefbrcomplexflowanalysisdevelopedattheNASALangleyResearchCenter.Thesemini-appsexplorehybridparallelprogrammingparadigmswithMessagePassingInterface（MPI）fbrdistributedmemoryaccessandeitherSharedMPI（SMPI）orOpenMPfbrsharedmemoryaccesses.PerformancetestingshowsthatMPI+SMPIyieldsthebestexecutionperformance,whilerequiringthelargestnumberofcodechanges.Amaximumspeedupof23wasmeasuredforMPI+SMPI,butonly10wasmeasuredfbrMPI+OpenMP.Keywords:

Mini-apps,Performance,VULCAN,Shared

Memory,MP1.OpenMP

1INTRODUCTION

Inmanyfields,real-worldapplicationshavealreadybeendeveloped.Forestablishedapplicationstostayup-to-date,newparallelstrategiesmustbeexploredtodeterminewhichmayyieldthebestperformance,especiallywithadvancesincomputinghardware.However,restructuringormodifyingareal-worldapplicationincursincreasedcostdependingonthesizeofthecodeandchangestobemade.Amini-appmaybecreatedtoquicklyexploresuchoptionswithoutmodifyingtheentirecode.Mini-appsreducetheoverheadofapplyingnewstrategies,thusvariousstrategiesmaybeimplementedandcompared.Thisworkpresentstheauthorsexperienceswhenfollowingthisstrategyforareal-worldapplicationdevelopedbyNASA.

VULCAN（ViscousUpwindAlgorithmforComplexFlowAnalysis）isaturbulent,noequilibrium,finite-ratechemicalkinetics,Navier-Stokesflowsolverfbrstructured,cell-centered,multiblockgridsthatismaintainedanddistributedbytheHypersonicAirBreathingPropulsionBranchoftheNASALangleyResearchCenter（NASA2016）.Themini-appdevelopedinthisworkusestheHouseholderReflectorkernelfbrsolvingsystemsoflinearequations.Thiskernelisusedoftenbydifferentworkloads,andisagoodcandidatetodecidewhatstrategytypetoapplytoVULCAN.VULCANisbuiltonasingle-layerofMP1andthecodehasbeenoptimizedtoobtainperfectvectorization,thereforetwo-levelsofparallelismarecurrentlyused.Thisworkinvestigatestwoflavorsofshared-memoryparallelism,OpenMPandSharedMPI,whichwillprovidethethird-levelofparallelismfbrtheapplication.Athird-levelofparallelismincreasesperformance,whichdecreasesthetime-to-solution.

MP1hasextendedthestandardtoMPIversion3.0,whichincludestheSharedMemory（SHM）model（MikhailB,（Intel）2015,MessagePassingInterfaceForum2012）,knowninthisworkasSharedMPI（SMPI）.ThisextensionallowsMPItocreatememorywindowsthataresharedbetweenMPItasksonthesamephysicalnode.Inthisway,MPItasksareequivalenttothreads,exceptSharedMPIismoredifficultfbraprogrammertoimplement.OpenMPisthemostcommonshared-memorylibraryusedtodatebecauseofitsease-of-use（OpenMP2016）.Inmostcases,onlyafewOpenMPpragmasarerequiredtoparallelizealoop;

however,OpenMPissubjecttoincreasedoverhead,whichmaydecreaseperformanceifnotproperlytuned.

Asearlyastheyear2000,theauthorsin（CappelloandEtiemble2000）foundthatlatencysensitivecodesseemtobenefitfrompureMPIimplementationswhereasbandwidthsensitivecodesbenefitfromhybridMPI+OpenMP.Also,theauthorsfoundthatfasterprocessorswillbenefithybridMPI+OpenMPcodesifdatamovementisnotanoverwhelmingbottleneck（CappelloandEtiemble2000）.Sincethistime,hybridMPl+OpenMPimplementationshaveimproved,butnotwithoutdifficulties.In（Drosi-nosandKozins2004,ChorleyandWalker2010）,itwasfoundthatOpenMPincursmanyperformancereductions,including:

overhead（fbrk/join,atomics,etc）,falsesharing,imbalancedmessagepassing,andasensitivitytoprocessormapping.However,OpenMPoverheadmaybehiddenwhenusingmorethreads.In（Rabenseifher,Hager,andJost2009）,theauthorsfoundthatsimplyusingOpenMPcouldincurper-fbrmancepenaltiesbecausethecompileravoidsoptimizingOpenMPloops-verifieduptoversion10.1.Althoughcompilershaveadvancedconsiderablysincethistime,applicationusersthatstillcompileusingolderversionsmaybeatriskifusingOpenMP.In（DrosinosandKoziris2004,ChorleyandWalker2010）theauthorsfoundthatthehybridMPI+OpenMPapproachoutperformsthepureMPIapproachbecausethehybridstrategydiversifiesthepathtoparallelexecution.Morerecently,MPIextendeditsstandardtoincludetheSHMmodel（M汰hailB.（Intel）2015）.Theauthorsin（Hoetier,Dinan,Thakur,Barrett,Balaji,Gropp,andUnderwood2015）presentMPIRMAtheoryandexamples,whicharethebasisoftheSHMmodel.In（GerstenbergenBesta,andHoefler2013）,theauthorsconductathoroughperformanceevaluationofMPIRMA,includinganinvestigationofdifferentsynchronizationtechniquesfbrmemorywindows.In（Hoefler,Dinan,Buntinas,Balaji,Barrett,Brightwell,Gropp,Kale,andThakur2013）,theauthorsinvestigatetheviabilityofMPI+SMPIexecution,aswellascompareittoMPI+OpenMPexecution.ItwasfoundthatanunderlyinglimitationofOpenMPistheshared-by-defaultmodelformemory,whichdoesnotcouplewellwithMP1sincethememorymodelisprivate-by-default.Forthisreason,MPI+SMPIcodesareexpectedtoperformbetter,sincesharedmemoryisexplicitandthememorymodelfbrtheentirecodeisprivate-by-default.Mostrecently,anewMPIcommunicationmodelhasbeenintroducedin（Gropp,Olson,andSamfass2016）,whichbettercapturesmultinodecommunicationperformance,andoilersanopen-sourcebenchmarkingtooltocapturethemodelparametersfbragivensystem.Independentofthesharedmemorylayer,MPIisthedefactostandardindatamovementbetweennodesandsuchamodelcanhelpanyMPIprogram.Theremainderofthispaperisorganizedintothefollowingsections:

2introducestheHouseholdermini-apps,3presentstheperformancetestingresultsfbrthemini-appsconsidered,and4concludesthispaper.

2HOUSEHOLDERMIN1-APP

Themini-appsusethehouseholdercomputationkernelfromVULCAN,whichisusedinsolvingsystemsoflinearequations.Thehouseholderroutineisanalgorithmthatisusedtotransformasquarematrixintotriangularform,withoutincreasingthemagnitudeofeachelementsignificantly（Hansen1992）.TheHouseholderroutineisnumericallystable,inthatitdoesnotloseasignificantamountofaccuracyduetoverysmallorverylargeintermediatevaluesusedinthecomputation.

Mini-appsaredesignedtoperformspecificfunctions.Inthiswork,theimportantfeaturesareasfollows:

Acceptgenericinput.Validatethenumericalresultoftheoptimizedroutine.Measureperformanceoftheoriginalandoptimizedroutines.Tuneoptimizations.

Thegenericinputisreadinfromafile,wherethefilemustcontainatleastonematrixAandresultingvectorb.Shouldonlyonematrixandvectorbesupplied,theinputwillbeduplicatedfbrallinstancesofm.Validationoftheoptimizedroutineisperformedbytakingthedifferenceoftheoutputfromtheoriginalandoptimizedroutines.Themini-appwillfirstcomputethesolutionoftheinputusingtheoriginalroutine,andthentheoptimizedroutine.Thiswaytheoutputmaybecompareddirectly,andrelativeperformancemayalsobemeasuredusingexecutiontime.Shouldtheoptimizedroutinefeatureoneormoreparametersthatmaybevaried,theyaretobeinvestigatedsuchthattheoptimizationmaybetunedtothehardware.Inthiswork,thereisalwaysatleastonetunableparameter.Onefeaturethatshouldhavebeenfactoredintothemini-appdesignwasmodularizingthedifferentversionsoftheHouseholderroutine.Inthiswork,twomini-appsweredesignedbecauseeachimplementsadifferentversionoftheparallelHouseholderroutine;

however,itwouldhavebeenbettertodesignasinglemini-appthatusesmodulestoincludeotherversionsoftheparallelHouseholderkernel.Withthisfunctionality,itwouldbelesscumbersometoworkoneachversionofthekernel.ToparallelizetheHouseholderroutine,misdecomposedintoseparate,butequalchunksthatarethensolvedbyeachthread-sharedMPItasksareequivalenttothreadsinthisworkfbrbrevity.However,theoriginalroutinevariesoverminsidetheinner-mostcomputationalloop（anoptimizationthatbenefitsvectorizationandcaching）,buttheparallelloopmustbetheouter-mostloopfbrbestperformance.Therefore,loopblockinghasbeeninvokedtortheparallelsectionsofthecode.Loopblockingisatechniquecommonlyusedtoreducethememoryfootprintofacomputationsuchthatitfitsinsidethecachefbragivenhardware.Therefore,theparallelHouseholderroutinehasatleastonetunableparameter,blocksize.

Inthiswork,twoflavorsofthesharedmemorymodelareinvestigated:

OpenMPandSMP1.ThedifferencebetweenOpenMPandSMP1liesinhowmemoryismanaged.OpenMPusesapublic-memorymodelwherealldataisavailabletoallthr

展开阅读全文

小程序中英文外文文献翻译Word格式.docx

小程序中英文外文文献翻译Word格式.docx