外文原文:
Why You (Probably) Shouldn’t Use Reinforcement Learning 地址:
https://towardsdatascience.com/why-you-shouldnt-use-reinforcement-learning-163bae193da8


中文翻译版本(ChatGPT3.5翻译:)


有关这项技术存在很大的炒作,而且理由充分,因为这可能是实现通用人工智能的最重要的机器学习进展之一。但在一般兴趣之外,你可能最终会面临这样的问题:“这对你的应用是否合适”?

我目前正在一个视觉启用的机器人团队工作,并作为强化学习的过去研究者,我被要求为我的团队回答这个问题。以下是我认为你可能不想在你的应用中使用强化学习,或者至少在选择这条道路之前要三思的一些原因。让我们深入探讨!

极其嘈杂

以下是来自一个最高得分为500的游戏的两个学习图。那么哪个学习算法更好呢?捉弄你的问题。它们完全相同,第二次运行只是第一次的重新运行。在完全成功并学到了完美策略的一次训练会话和另一次彻底失败之间唯一的区别是随机种子。

强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Pr_Image


DQN(Deep Q-Network)在 CartPole 上的训练曲线。图片由作者提供。

  • 随机初始化中的微小变化可能会极大地影响训练性能,因此实验结果的可重复性具有挑战性。
  • 噪声使得很难比较算法、超参数设置等,因为你不知道性能的提升是由于你所做的更改,还是仅仅是一个随机的偶然现象。
  • 你需要在完全相同的条件下运行20次以上的训练会话,以获得一致和稳健的结果。这使得在算法上进行迭代变得非常具有挑战性(请参阅下面关于这些实验可能需要多长时间的说明!)


大量的超参数

目前市场上最成功的算法之一是Soft Actor-Critic(SAC),它有近20个需要调整的超参数。你可以自己查看!但这还不是全部...

  • 在深度强化学习中,你有所有与网络架构相关的常规深度学习参数:层数、每层节点数、激活函数、最大池化、随机失活、批标准化、学习率等。
  • 此外,你还有与强化学习特定的10个以上的超参数:缓冲区大小、熵系数、折扣因子(gamma)、动作噪声等。
  • 此外,你还有通过奖励塑形(Reward Shaping)的形式来调整智能体行为的“超参数”。
  • 调整其中之一甚至可能非常困难!参见关于极度嘈杂、长训练时间的说明……想象一下调整30个以上的情况。
  • 与大多数超参数调整一样,这些参数并不总是有直观的设置或一种确保最有效地找到最佳超参数的方法。实际上,你只能在黑暗中摸索,直到找到似乎奏效的参数。


仍处于研究和开发阶段

由于强化学习实际上仍处于起步阶段,研究社区仍在解决如何验证和共享进展中的问题。这给那些希望利用研究成果并重现结果的人带来了困扰。

  • 论文在实现细节上存在歧义。你并不总是能找到源代码,也不总是清楚如何将一些复杂的损失函数转化为代码。而且,论文似乎还省略了一些用于获得卓越性能的小技巧。
  • 一旦有一些代码在互联网上公开,由于上述原因,它们在实现上可能略有不同。这使得很难将你得到的结果与别人在线上的结果进行比较。我相对较差的表现是因为我引入了一个错误,还是因为他们使用了我不知道的技巧呢?


难以调试

  • 最近的方法采用了各种各样的技术来取得尖端的结果。这使得编写整洁的代码变得非常困难,随之而来的是很难追踪他人的代码,甚至是你自己的代码!
  • 相关的是,由于有太多的组成部分,很容易引入错误,而要找到这些错误却非常困难。强化学习通常涉及多个网络进行学习。而且在学习过程中存在很多随机性,因此某次运行可能有效,而下一次可能无效。这是因为你引入的错误还是因为随机种子的偶发事件呢?在进行更多实验之前很难说。而这需要... 时间。


非常低的样本效率

模型无关学习意味着我们不尝试构建/学习环境的模型。因此,我们学习策略的唯一方法是直接与环境进行交互。On-policy 意味着我们只能使用从执行当前策略中采样的样本来学习/改进我们的策略,也就是说,一旦运行单个反向梯度更新,我们必须丢弃所有这些样本并收集新样本。例如,PPO是一个无模型、基于策略的最先进算法。所有这些都意味着我们必须在学习策略之前与环境进行大量交互(比如数百万步)。

这在我们在一个相对低保真度的模拟器中具有高级特征时可能是可行的。例如,

强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Pr_ci_02


Humanoid环境的图像,来源于https://gym.openai.com/

Humanoid花费5小时学会走路(200万步)。

但一旦我们转向低级特征,如图像空间,我们的状态空间就会大幅增长,这意味着我们的网络必须大幅增长,例如我们必须使用卷积神经网络(CNN)。

强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Pr_Image_03


Atari Pheonix. Image by mybrainongames.com

Atari游戏,如Phoenix,可能需要12小时(40-200百万步)的时间。

而当我们开始引入像CARLA这样的3D高保真度模拟器时,情况变得更加复杂。

强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Pr_ci_04


CARLA Driving Simulator. Image by Unreal Engine

使用GPU对CARLA中的汽车进行驾驶训练需要大约3-5天(200万步)。

如果策略特别复杂,情况可能会变得更糟。
在2018年,OpenAI训练了一个能够击败DOTA 2世界冠军的智能体。你可能会问,这个智能体训练了多长时间?答案是10个月。


如果我们想在真实世界而不是在模拟器中进行训练呢?在这里,我们受到实时时间步的限制(而在之前,我们可以以超过实际时间的速度模拟步骤)。这可能需要几周,甚至更糟,可能完全无法处理。想要了解更多信息,可以搜索“强化学习的致命三重奏”。


模拟到真实的差距

如果我们想在模拟器中进行训练,然后在真实世界中部署呢?这是大多数机器人应用的情况。然而,即使一个智能体在模拟器中学会了良好的表现,也不一定意味着它会在实际应用中表现得很好。这取决于模拟器的质量。理想情况下,我们希望模拟器尽可能地接近真实生活。但请看最后一部分,了解高保真度模拟器的问题。


不可预测性和不可解释性

  1. 即使一个经过良好训练的强化学习智能体在真实环境中也可能表现出难以预测的行为。我们可能会试图严厉惩罚灾难性的行为,但我们仍然不能保证智能体不会选择那个动作,因为最终,我们只是在优化总体奖励的期望值。
  2. 可解释性:这更多地是深度学习整体上的问题,但在强化学习中,这个问题变得更加重要,因为网络通常选择如何移动可能会对人或财产造成实际损害的物理机械(比如自动驾驶或机器人)。强化学习智能体可能做出灾难性的控制决策,而我们却完全不知道为什么,这反过来意味着我们不知道如何在将来阻止它。


结论

嗯,我不知道你读完后是感到沮丧还是泼了一盆冷水。我有点意在提醒大家保持清醒,剖析研究的热度,所以我的措辞可能有些直接。但我也要声明一下,这些观点背后的事实是这正是为什么这是一个如此炙手可热的研究领域,人们正在积极致力于解决许多,如果不是所有,这些问题。这使我对强化学习的未来感到乐观,但认识到这些问题仍然存在正是我成为现实乐观主义者的原因。

信不信由你,我并不完全排除强化学习在工业应用中的可能性......当它运行时,它确实是非常出色的。我只是想确保你知道自己将自己卷入其中,以免高估承诺并低估时间表。 😃




英文原文:

There is a lot of hype around this technology. And for good reason… it’s quite possibly one of the most important machine learning advancements towards enabling general AI. But outside of general interest, you may eventually come to the question of, “is it right for your application”?

I am currently working on a team for vision enabled robotics and as a past researcher in RL, I was asked to answer this question for my team. Below, I’ve outlined some of the reasons I think you may not want to use reinforcement learning in your application, or at least think twice before walking down that path. Let’s dive in!

Extremely Noisy

Below are two learning plots from a game which has a max score of 500. So which learning algorithm was better? Trick question. They were the exact same, the second run is just a rerun of the first. The only difference between the one training session that totally rocked it and learned a perfect policy, and the other, that miserably failed, was the random seed.

强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Pr_Image


Training curves for DQN on CartPole. Image by Author

  • Small changes in random initialization can greatly affect training performance so reproducibility of experimental results is challenging.
  • Being noisy makes it very hard to compare algorithms, hyperparameter settings, etc because you don’t know if improved performance is because of the change you made or just a random artifact.
  • You need to run 20+ training sessions under the exact same conditions to get consistent/robust results. This makes iterating on your algorithm very challenging (see note below about how long these experiments can take!)


Large amount of hyperparameters

One of the most successful algorithms on the market right now is Soft Actor-Critic (SAC), which has nearly 20 hyperparmeters to tune. Check for yourself! But that’s not the end of it…

  • In deep RL, you have all the normal deep learning parameters related to network architecture: number of layers, nodes per layer, activation function, max pool, dropout, batch normalization, learning rate, etc.
  • Additionally, you have 10+ hyperparameters specific to RL: buffer size, entropy coefficient, gamma, action noise, etc
  • Additionally, you have “Hyperparameters” in the form of reward shaping (RewardArt) to get the agent to act as you want it to.
  • Tuning even one of these can be very difficult! See notes about extremely noisy, long training time… imagine tuning 30+.
  • As with most hyperparemter tuning, there’s not always an intuitive setting for each of these or a foolproof way to most efficiently find the best hyperparameters. You’re really just shooting in the dark until something seems to work.


Still in research and development

As RL is still actually in its budding phases, the research community is still working out the kinks in how advancements are validated and shared. This causes headaches for those of us that want to use the findings and reproduce the results.

  • Papers are ambiguous in implementation details. You can’t always find the code and it’s not always clear how to turn some of the complex loss functions into code. And papers also seem to leave out little handwaivy tweaks they used to get that superior performance.
  • Once some code does get out there on the interwebs, because of the reason listed above, these differ slightly in implementation. This makes it hard to compare results you’re getting to someone else’s online. Is my comparatively bad performance because I introduced a bug or because they used a trick I don’t know about?


Hard to debug

  • Recent methods use the kitchen sink of techniques to get cutting edge results. This makes it really hard to have clean code, which subsequently makes it hard to track others code or even your own!
  • On a related note, because there’s so many moving parts, it’s really easy to introduce bugs and really hard to find them. RL often has multiple networks learning. And it’s a lot of randomness in the learning process so things may work one run and may not the next. Was it because of a bug you introduced or because of a fluke in the random seed? Hard to say without running many more experiments. Which takes…. TIME.


Extremely sample inefficient

Model free learning means we don’t try to build/learn a model of the environment. So the only way we learn a policy is by interacting directly with the environment. On-policy means that we can only learn/improve our policy with samples taken from acting with our current policy, ie we have to throw away all these samples and collect new ones as soon as we run a single backward gradient update. PPO is, for example, a model-free on-policy state-of-the-art algorithm. All this means that we have to interact with the environment a lot (like millions of steps) before learning a policy.

This may be passible for if we have a high-level features in a relatively low-fidelity simulator. For example,

强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Pr_ci_02


Image of Humanoid Environment by https://gym.openai.com/

Humanoid takes 5 hours to learn how to walk (2 mil steps)

But as soon as we move to low-level features, like image space, our state space grows a lot which means our network must grow a lot, eg we must use CNN’s.

强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Pr_Image_03

Atari Pheonix. Image by mybrainongames.com
Atari games such as Phoenix takes 12(?) hours (40–200 mil steps)


And things get even worse when we start introducing 3D high-fidelity simulators like CARLA.

强化学习算法真的适合于你的应用吗 —— 强化学习研究方向(研究领域)现有的不足(短板、无法落地性) —— Why You (Pr_ci_04


CARLA Driving Simulator. Image by Unreal Engine

Training a car to drive in CARLA takes ~3–5 days (2 mil steps) with a GPU

Andd even worse if the policy is notably complex.
In 2018, OpenAI trained an agent that beat the world champions at DOTA 2. How long did the agent take to train you ask? 10 months


What if we wanted to train in the real world instead of a simulator? Here, we are bound by real-time time steps (whereas before we could simulate steps in faster than real time.). This could take weeks or even worse, just be entirely intractable. For more on this, look up “the deadly triad of RL”.


Sim 2 real gap

What if we wanted to train in a simulator and then deploy in the real world? This is the case with most robotics applications. However, even if an agent learns to play well in a simulator, it doesn’t necessarily mean that it will transfer to real world applications. Depends how good the simulator is. Ideally, we’d make the simulator as close to real life as possible. But see the last section to see the problem with high-fidelity simulators.


Unpredictability & Inexplainability

  • Even a well trained RL agent can be unpredictable in the wild. We may try to punish disastrous behaviors severely but we still don’t have a gaurantee that the agent won’t still chose that action since, in the end, we are only optimizing the expectation of total reward.
  • Explainability: this is more a problem with DL in general, but in reinforcement learning, this issue takes on a new importance since the networks are often choosing how to move physical machinery that could physically damage people or property (as in the case of self driving or robotics). The RL agent may make a disastrous control decision and we have no idea exactly why, which in turn means we don’t know how to prevent it in the future.


Conclusion

Well, I don’t know if that was depressing or a buzz kill for you to read. I kind of meant it to be a reality check to cut through the hype so I did go pretty hard. But I should also disclaim all these points with the fact that these issues are the very reason why it is such a hot research area and people are actively working on many, if not all, of these pain points. This makes me optimistic for the future of RL but realizing that these are still problems is what makes me a realistic optimist.

Believe it or not, I wouldn’t totally discount RL for industrial applications… it is really awesome when it works. I’d just make sure you know what you’re getting yourself into so you don’t overpromise and underestimate the timeline. 😃