projectWord文件下载.doc

资源描述

projectWord文件下载.doc

《projectWord文件下载.doc》由会员分享，可在线阅读，更多相关《projectWord文件下载.doc（12页珍藏版）》请在冰点文库上搜索。

projectWord文件下载.doc

.text

initializeregisters

daddir1,r0,a

daddir2,r0,b

daddir3,r0,c

daddir4,r0,6

Loop:

lwr5,0（r1）;

elementofa

lwr6,0（r2）;

elementofb

lwr7,0（r3）;

elementofc

daddr8,r5,r6;

a[i]+b[i]

daddr9,r7,r8;

swr9,0（r1）;

storevalueina[i]

daddir1,r1,8;

incrementmemorypointers

daddir2,r2,8

daddir3,r3,8

daddir4,r4,-1;

i++

bnezr4,Loop

end:

halt

1）Loadex1.sintothememoryofMIPS64anddisabletheforwardinglogic,thedelayslotandtheBranchtargetbufferfromtheConfiguremenuinthemaintoolbar.Beforerunningtheprogram,trytopredictwherestalls

occur,howmanyclockcyclestheywilltake,andforwhatkindofhazardstheyoccur.Thencompareyour

predictionwiththesimulationresults.

答：

时钟周期数=19×

6+4+4=122

RAWdatahazard=7×

6=42次。

仿真器的模拟结果为：

2）Theprogramofex1.sconsistsaloopplussomeotherinstructionsoutsideit.Afterrunningtheprogramtocompletion,estimatetheCPIusingtheStatisticswindow.Inthecaseofaprogramcontaininga“hotspot”（i.e.aninternalloopwhoseinstructionsareexecutedmuchmorefrequentlythanalltheotherinstructions）

theCPIcanberoughlyestimatedjustusingtheasymptoticCPI,i.e.

whereNoutandSoutarethenumberofinstructionsandthenumberofstallsoutsidethe“hotspot”,respectively,Listhenumberofloopcyclesoftheinnermostloop,andIChotandShotarethenumberofinstructionsandthenumberofstallsofthe“hotspot”.ComparetheasymptoticCPIwiththevalueresultingfromsimulations.Areresultscompatible?

SimulatorCPI=（（11+7）*6+4+5+5+1）/（11*6）=1.848

CPIAsymptotic=（11+7）/11=1.636

执行情况如下：

3）Enabletheforwardinglogicandexecutethecodeagain.ComputetheCPIagain.JustifytheremainingstallsandcommentwhysomeofthemoccurafterIDstageratherthanafterIF.

SimulatorCPI=（（11+1）*6+5+5+4）/（11*6）=1.303

CPIAsymptotic=（11+1）/11=1.091

不相同。

因为存在forwarding，

ID阶段可以先读取寄存器的地址，默认的寄存器的值为错，bnez指令需要放回寄存器中的值，所以不接受daddi指令。

EXE阶段forwarding的值，而要等到WB后的值。

4）DisabletheforwardinglogicandassumethattheMIPShardwarecannotdetecthazards.ModifythesourcecodebyinsertingNOPswhereappropriatewithoutreorderingthecode（NOPstuffingtechnique）.Checkwiththesimulatorthatnostalloccurs,andcheckwhethertheCPIhaschanged.Dowehavebetterperformance?

加入NOP:

lwr5,0（r1）

lwr6,0（r2）

lwr7,0（r3）

NOP

daddr8,r5,r6

daddr9,r7,r8

swr9,0（r1）

daddir1,r1,8

daddir2,r2,8

daddir3,r3,8

daddir4,r4,-1

bnezr4,Loop

end:

加入了nop后，没有stall，CPI改变，性能变弱。

5）Rescheduletheinstructions（codemovingtechnique）inordertoavoidstallswithoutmodifyingtheprogramsemantics（checkthefinalresulttoseeifaftermovingtheinstructionstheresultisthesame）.RecomputethenormalandasymptoticCPIvalues.

代码如下：

执行情况：

lwr5,0（r1）

daddir4,r4,-1

故CPIAsymptotic=（11+3）/11=1.273

实际为1.296

6）Combinereschedulingandforwardingtechniquesandnotethedifferenceswithrespecttothe

forwarding‐onlyandrescheduling‐onlycases.Trytoenablethe“Branchtargetbuffer”lookatthesimulationcodeanddeterminetheCPI.Hasperformanceimproved?

Trytomodifytheoriginalcodebyadding6additionalinputvaluesinaandb.WhatdoyouexpectfromCPI?

加入forwarding的执行情况：

在此基础上加入“Branchtargetbuffer”，得到的结果如下：

forwarding：

rescheduling：

把循环的次数增加到12次的时，增加输入的个数，CPI又会有提高。

.space96

.word10,11,12,13,0,1,1,0,13,12,11,10

.word1,2,3,4,5,6,6,5,4,3,2,1

daddir4,r0,12

程序的执行情况如下：

rescheduling:

增加循环次数后，代码变为：

elementofb

elementofc

daddr8,r5,r6;

a[i]+b[i]

daddr9,r7,r8;

swr9,0（r1）;

daddir1,r1,8;

daddir4,r4,-1;

forwarding:

7）Awell‐knowncompileroptimizationisknownas“loopunrolling”.Basically,loopunrollingistheexplicit

repetitionoftheloopcodeanumberoftimes.Inthiswayweobtainalongerloopbodythatisexecutedless

times.Considertheoriginalcodeofex1.s.Unrollthelooptwicewithoutanycodemoving,i.e.justrepeatthe

firstfourloopinstructionsandmakethenecessarychangestherein.CalculatetheCPIforthecasewithout

forwarding.Isthereanyimprovement?

lwr10,8（r1）;

lwr11,8（r2）;

lwr12,8（r3）;

daddr13,r10,r11;

daddr14,r12,r13;

swr14,8（r1）;

daddir1,r1,16;

daddir2,r2,16

daddir3,r3,16

daddir4,r4,-2;

CPIAsymptotic=（17+13）/17=1.765

故无提高。

8）ApplycodereschedulingtothesolutionofthepreviousquestionandcalculateboththeCPIandtheasymptoticCPIvalueswithandwithoutforwarding.Isthereanyimprovement?

lwr10,8（r1）

Lwr11,8（r2）;

Lwr12,8（r3）;

daddir4,r4,-2

daddr13,r10,r11

daddir1,r1,16

daddr14,r12,r13

daddir2,r2,16

daddir3,r3,16

swr14,-8（r1）

End:

halt

CPINormal=（（17+0）*6+5+0+4）/（17*6）=1.088

CPIAsymptotic=（17+0）/17=1.000

9）Supposethattheaddoperationintheoriginalcodeisafloatingpointcalculationandtheloopisiteratedfor12

times.Pleaseusefloatingpointregistersfora[i],b[i],andc[i],andmodifyyourassemblycode.Pleaseanswer

thefollowingquestions:

Atleasthowmanytimesdoyouneedtounrollthelooptominimizestallswithout

forwarding?

Whatistheaveragelatencyofiterationsfortheoriginalloop?

Whatisthecodesize?

Pleaseshow

usyourcode.

Thefollowingistheinputdataofyourcode:

.text…………

执行情况如下：

l.df1,0（r1）

l.df2,0（r2）

l.df3,0（r3）

add.df4,f2,f1

add.df5,f3,f4

s.df5,0（r1）

将上面的程序四次展开：

程序如下：

执行情况如下：

.double10,11,12,13,0,1,1,0,13,12,11,10

.double1,2,3,4,5,6,6,5,4,3,2,1

l.df5,0（r1）

l.df6,0（r2）

l.df10,8（r1）

l.df11,8（r2）

add.df8,f5,f6

l.df15,16（r1）

l.df16,16（r2）

add.df13,f10,f11

l.df7,0（r3）

l.df12,8（r3）

add.df18,f15,f16

l.df17,16（r3）

add.df9,f7,f8

add.df14,f13,f12

daddir1,r1,24

add.df19,f17,f18

daddir4,r4,-3

s.df9,0（r1）

s.df14,8（r1）

daddir3,r3,24

daddir2,r2,24

s.df19,-8（r1）

展开阅读全文