強化學習（Reinforcement learning）是什么？

2019-11-06 06:17:51

字體：大中小

來源：轉載

供稿：網友

強化學習（Reinforcement learning）：

Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The PRoblem, due to its generality, is studied in many other disciplines, such as game theory, control theory, Operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics, and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called approximate dynamic programming. The problem has been studied in the theory of optimal control, though most studies are concerned with the existence of optimal solutions and their characterization, and not with the learning or approximation aspects. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.In machine learning, the environment is typically formulated as a Markov decision process (MDP) as many reinforcement learning algorithms for this context utilize dynamic programming techniques.[1] The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible.Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).[2] The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

強化學習(reinforcement learning)，又稱再勵學習、評價學習，是一種重要的機器學習方法，在智能控制機器人及分析預測等領域有許多應用。但在傳統的機器學習分類中沒有提到過強化學習，而在連接主義學習中，把學習算法分為三種類型，即非監督學習(unsupervised learning)、監督學習(supervised leaning)和強化學習。

強化學習（Reinforcement learning）靈感來源于心理學中的行為主義理論，即有機體如何在環境給予的獎勵或懲罰的刺激下，逐步形成對刺激的預期，產生能獲得最大利益的習慣性行為。這個方法具有普適性，因此在其他許多領域都有研究，例如博弈論、控制論、運籌學、信息論、模擬優化方法、多主體系統學習、群體智能、統計學以及遺傳算法。強化學習也是多學科多領域交叉的一個產物，它的本質就是解決“決策（decision making）”問題，即學會自動進行決策。強化學習作為一個序列決策（Sequential Decision Making）問題，它需要連續選擇一些行為，從這些行為完成后得到最大的收益作為最好的結果。它在沒有任何label告訴算法應該怎么做的情況下，通過先嘗試做出一些行為——然后得到一個結果，通過判斷這個結果是對還是錯來對之前的行為進行反饋。由這個反饋來調整之前的行為，通過不斷的調整算法能夠學習到在什么樣的情況下選擇什么樣的行為可以得到最好的結果。強化學習與監督學習有不少區別，從前文中可以看到監督學習是有一個label（標記）的，這個label告訴算法什么樣的輸入對應著什么樣的輸出。而強化學習沒有label告訴它在某種情況下應該做出什么樣的行為，只有一個做出一系列行為后最終反饋回來的reward signal，這個signal能判斷當前選擇的行為是好是壞。另外強化學習的結果反饋有延時，有時候可能需要走了很多步以后才知道之前某步的選擇是好還是壞，而監督學習如果做了比較壞的選擇則會立刻反饋給算法。強化學習面對的輸入總是在變化，不像監督學習中——輸入是獨立分布的。每當算法做出一個行為，它就影響了下一次決策的輸入。強化學習和標準的監督式學習之間的區別在于，它并不需要出現正確的輸入/輸出對，也不需要精確校正次優化的行為。強化學習更加專注于在線規劃，需要在Exploration（探索未知的領域）和Exploitation（利用現有知識）之間找到平衡。

http://blog.csdn.net/zz_1215/article/details/44138715

http://blog.csdn.net/dark_scope/article/details/8252969

http://www.cnblogs.com/jinxulin/p/3517377.html