09毕业设计算法类外文资料原文和中文译文.docx

资源描述

09毕业设计算法类外文资料原文和中文译文.docx

《09毕业设计算法类外文资料原文和中文译文.docx》由会员分享，可在线阅读，更多相关《09毕业设计算法类外文资料原文和中文译文.docx（10页珍藏版）》请在冰点文库上搜索。

09毕业设计算法类外文资料原文和中文译文.docx

09毕业设计算法类外文资料原文和中文译文

Q-LearningByExamples

Inthistutorial,youwilldiscoverstepbystephowanagentlearnsthroughtrainingwithoutteacher（unsupervised）inunknownenvironment.YouwillfindoutpartofreinforcementlearningalgorithmcalledQ-learning.Reinforcementlearningalgorithmhasbeenwidelyusedformanyapplicationssuchasrobotics,multiagentsystem,game,andetc.

Insteadoflearningthetheoryofreinforcementthatyoucanreaditfrommanybooksandotherwebsites（seeResourcesformorereferences）,inthistutorialwillintroducetheconceptthroughsimplebutcomprehensivenumericalexample.YoumayalsodownloadtheMatlabcodeorMSExcelSpreadsheetforfree.

Supposewehave5roomsinabuildingconnectedbycertaindoorsasshowninthefigurebelow.WegivenametoeachroomAtoE.Wecanconsideroutsideofthebuildingasonebigroomtocoverthebuilding,andnameitasF.NoticethattherearetwodoorsleadtothebuildingfromF,thatisthroughroomBandroomE.

Wecanrepresenttheroomsbygraph,eachroomasavertex（ornode）andeachdoorasanedge（orlink）.RefertomyothertutorialonGraphifyouarenotsureaboutwhatisGraph.

Wewanttosetthetargetroom.Ifweputanagentinanyroom,wewanttheagenttogooutsidethebuilding.Inotherword,thegoalroomisthenodeF.Tosetthiskindofgoal,weintroducegiveakindofrewardvaluetoeachdoor（i.e.edgeofthegraph）.Thedoorsthatleadimmediatelytothegoalhaveinstantrewardof100（seediagrambelow,theyhaveredarrows）.Otherdoorsthatdonothavedirectconnectiontothetargetroomhavezeroreward.Becausethedooristwoway（fromAcangotoEandfromEcangobacktoA）,weassigntwoarrowstoeachroomofthepreviousgraph.Eacharrowcontainsaninstantrewardvalue.Thegraphbecomesstatediagramasshownbelow

Additionalloopwithhighestreward（100）isgiventothegoalroom（FbacktoF）sothatiftheagentarrivesatthegoal,itwillremainthereforever.Thistypeofgoaliscalledabsorbinggoalbecausewhenitreachesthegoalstate,itwillstayinthegoalstate.

Ladiesandgentlemen,nowisthetimetointroduceoursuperstaragent….

Imagineouragentasadumbvirtualrobotthatcanlearnthroughexperience.Theagentcanpassoneroomtoanotherbuthasnoknowledgeoftheenvironment.Itdoesnotknowwhichsequenceofdoorstheagentmustpasstogooutsidethebuilding.

Supposewewanttomodelsomekindofsimpleevacuationofanagentfromanyroominthebuilding.NowsupposewehaveanagentinRoomCandwewanttheagenttolearntoreachoutsidethehouse（F）.（seediagrambelow）

Howtomakeouragentlearnfromexperience?

Beforewediscussabouthowtheagentwilllearn（usingQlearning）inthenextsection,letusdiscussaboutsometerminologiesofstateandaction.

Wecalleachroom（includingoutsidethebuilding）asastate.Agent'smovementfromoneroomtoanotherroomiscalledaction.Letusdrawbackourstatediagram.Stateisdepictedusingnodeinthestatediagram,whileactionisrepresentedbythearrow.

SupposenowtheagentisinstateC.FromstateC,theagentcangotostateDbecausethestateCisconnectedtoD.FromstateC,however,theagentcannotdirectlygotostateBbecausethereisnodirectdoorconnectingroomBandC（thus,noarrow）.FromstateD,theagentcangoeithertostateBorstateEorbacktostateC（lookatthearrowoutofstateD）.IftheagentisinstateE,thenthreepossibleactionsaretogotostateA,orstateForstateD.IfagentisstateB,itcangoeithertostateForstateD.FromstateA,itcanonlygobacktostateE.

Wecanputthestatediagramandtheinstantrewardvaluesintothefollowingrewardtable,ormatrixR.

Actiontogotostate

Agentnowinstate

100

Theminussigninthetablesaysthattherowstatehasnoactiontogotocolumnstate.Forexample,StateAcannotgotostateB（becausenodoorconnectingroomAandB,remember?

）

Intheprevioussectionsofthistutorial,wehavemodeledtheenvironmentandtherewardsystemforouragent.ThissectionwilldescribelearningalgorithmcalledQlearning（whichisasimplificationofreinforcementlearning）.

WehavemodeltheenvironmentrewardsystemasmatrixR.

NowweneedtoputsimilarmatrixnameQinthebrainofouragentthatwillrepresentthememoryofwhattheagenthavelearnedthroughmanyexperiences.TherowofmatrixQrepresentscurrentstateoftheagent,thecolumnofmatrixQpointingtotheactiontogotothenextstate.

Inthebeginning,wesaythattheagentknownothing,thusweputQaszeromatrix.Inthisexample,forthesimplicityofexplanation,weassumethenumberofstateisknown（tobesix）.Inmoregeneralcase,youcanstartwithzeromatrixofsinglecell.ItisasimpletasktoaddmorecolumnandrowsinQmatrixifanewstateisfound.

ThetransitionruleofthisQlearningisaverysimpleformula

TheformulaabovehavemeaningthattheentryvalueinmatrixQ（thatisrowrepresentstateandcolumnrepresentaction）isequaltocorrespondingentryofmatrixRaddedbyamultiplicationofalearningparameter

andmaximumvalueofQforallactioninthenextstate.

Ourvirtualagentwilllearnthroughexperiencewithoutteacher（thisiscalledunsupervisedlearning）.Theagentwillexplorestatetostateuntilitreachesthegoal.Wecalleachexplorationasanepisode.Inoneepisodetheagentwillmovefrominitialstateuntilthegoalstate.Oncetheagentarrivesatthegoalstate,programgoestothenextepisode.Thealgorithmbelowhasbeenprovedtobeconvergence（Seereferencesfortheproof）

Q学习实例

在本教程中，您将一步一步地发现在未知的环境中一个代理如何进行没有老师（非监督）的学习训练。

你会发现强化学习算法的一部分——称为Q学习。

强化学习算法已经得到广泛的应用，如机器人技术、多代理系统、游戏,等等。

虽然你可以阅读从许多书籍和其他网站（参见参考资料获取更多的引用）来学习的加固的理论，但本教程将通过数值例子介绍简单而全面的概念。

你也可以下载Matlab代码或免费的Excel电子表格。

假设我们在建筑中有5个房间，由某些大门连接如下图所示。

我们给每个房间一个名字，从A到E。

我们可以考虑大楼外部作为一个大房间里涵盖了大楼，并将其命名为F。

请注意,有两扇门可以从F到建筑里，就是通过B室和E室。

我们可以通过图形表示房间。

每个房间作为一个顶点（或节点）和每个门作为一个边缘（或链接）。

如果你不确定是什么图，请参考我其他教程上的图形,。

我们想要设定目标房间。

如果我们把一个系统放到任何房间中,我们想要它到建筑物的外面。

换句话说，目标的房间是节点F。

为了设置这样的目标，我们介绍给每扇门（即图的边）一个奖励价值。

立即到达目标的门有即时回报100（见图表,他们有红色箭头）。

其他没有直接连接到目标的房间的门只有零回报。

因为通过门是有两个方向的（从A可以去E和从E可以回到A）,我们给每个房间的前面的图分配两个箭头。

每个箭头都包含一个即时回报价值。

这个图变得状态关系图如下所示

额外的有最高的奖励（100）的路径是考虑到目标的房间（F回到F），以便使代理如果到达目标，它将永远留在那里。

这种类型的目标被称为吸收目标，因为当它达到目标状态，它将停留在目标状态。

现在是时候介绍我们的超级代理了…

想象一下我们的代理作为一个愚蠢的虚拟机器人，这种机器人可以通过经验学习。

代理可以从一个房间到另一个房间，但是没有对环境的认知。

它不知道去建筑物的外面必须通过哪个序列的门代理。

假设我们想为某种简单的从任何教室疏散代理建模。

现在假设我们有一个代理在房间C，我们想要代理学会达到在房子外面（F）。

（见下图）

如何使我们的代理从经验中学习?

在我们讨论关于代理将学习（使用Q学习）之前，在接下来的部分中,我们讨论一些术语和行动。

我们称每个房间（包括建筑外）为一个区域。

代理的运动从一个房间到另一个房间叫行动。

让我们收回我们的状态关系图。

状态是使用状态关系图的节点描述，行动用箭头表示。

假设现在代理是在区域C，代理可以去区域D，因为状态C是连接到D。

从国家C,然而,代理不能直接去国家B,因为没有直接连接房间门B和C（因此,没有箭头）。

从区域D,代理要么去区域B或E或回到状态C（看了箭头区域D）。

如果代理是在区域E。

然后三种可能的行动去F或D。

如果代理在区域B，它要么去D或F.从A，它只能回到区域E。

我们可以把状态图和即时回报值分为以下奖励表或矩阵R。

Actiontogotostate

Agentnowinstate

100

在表中的减号表示,这行没有去列的行动。

例如，A不能去B（因为没有门连接房间A和B。

）

前一节本教程中，我们已经为我们代理的环境和奖励系统建模。

这一小节将描述学习算法称为Q学习（这是一个简化的强化学习）。

我们有模型环境回馈系统为矩阵R。

现在我们需要把相似矩阵的大脑中名为Q将代表我们的代理，大脑可以通过经验学习到很多。

一排排的矩阵Q代表了当前的状态的代理,列的矩阵Q指向行动去下一个状态。

在开始的时候,我们说代理一无所知。

因此我们把Q看作零矩阵。

在这个例子中,简单的解释,我们假设的区域数是已知的（6）。

在更一般的情况下,你可以从单个细胞零矩阵开始。

如果有一个新的区域发现，在Q矩阵中添加更多的列和行是一个简单任务。

Q学习的转换规则是一个非常简单的公式

上面的公式已经意味着条目值在矩阵Q（即行代表区域和列代表行动）等同于相应的条目的矩阵R添加一个乘法的学习参数

和在这个状态下所有行动的Q的最大值。

我们的虚拟代理将通过经验在没有老师的情况下学习（这就是所谓的无监督学习）。

代理将探索各区域,直到达到目标。

我们调用每个勘探作为一个插话。

在其中代理将从初始状态到目标状态。

一旦代理到达目标状态，程序去下一段。

下面的算法已被证明是趋同（请参见参考资料以获取证据）。

展开阅读全文