2022年12月06日
This is an essay of the many missteps I took in the Real Robot Challenge, documented here to encourage myself to not repeat the same mistake in the future.
I will preface everything by stating that everything I write here is in hindsight, and that although some of the decisions taken at each step of the challenge may seem illogical and stupid, at the time it seemed to be the best way to go. In future projects I will try to refrain from making similar missteps, but the allure of the wrong routes (which often seem to be quick and easy, only to turn out to be mostly a misleading waste of time) may sometimes be too much to resist. In some serendipitous cases that could even be the way to lead to a novel outcome. So take these comments with a grain of salt, but try to identify when you are stuck in a metholodogical“local minima” in the future.
The first major mistake was that I tried to apply a custom implementation of the PDDM algorithm to the problem. In hindsight I can see that this doesn’t make sense; PDDM, although an off-policy algorithm, was only used for online training and not ever used offline. So there was a quite slim chance that this approach would have worked, especially so if it was a custom implementation written from scratch without using an existing PDDM implementation. These kinds of algorithms tend to have some weird tricks implemented that are crucial to it working properly, so unless I am ready to do days (weeks, months) of debugging, just start off with some rigorously tested OSS that is readily available. The first thing you do should not be to try to reinvent the wheel.
And try as many variations of OSS as possible, try to first optimize the choice of the large overarching framework and not stick with a single algorithm to which you do various hyperparameter tuning. However, I should say I did in fact try different algorithms, to compare different offline RL algorithms in d3rlpy and found that none performed better than IQL. Well, I could’ve looked at the bigger picture, and checked out behavior cloning algorithms as well since that was the winning approach. My reasoning for choosing offline RL over BC was pretty clear (at that time)- BC does not consider the reward, while for RL the reward tries to be optimized. Thus, BC would not work in the mixed setting. I think looking back, this opinion was based on the (wrong) guess that it would be hard to recover a single policy from the dataset- I had completely convinced myself (wrongly) that the dataset was composed of data from the trial runs collected by policies of previous years’ challenges. This assumption was not true, and the dataset was collected by a single (for “expert”) or two (for “mixed”) policy trained in simulation in IsaacGym. So this assumption was just plain wrong. I guess what I can say from this is, think of all the assumptions used in deciding the approach, and if the approach doesn’t work, challenge them rather than keep barking up the wrong tree (i.e. keep trying to optimize for the wrong approach). Again, try to identify local minima.
Especially for the beginning, leave as much to the algorithm as reasobale and try to keep it vanilla. Here the mistake was including multiple time_steps in the state and selecting a subset of the observations to be used within the state. In hindsight the “subset of observations” thing seems pointless- the Q function can probably learn to ignore particular modes of observation if they are “useless” for learning the correct Q function. The mistake I made was that I thought I was smarter and decided which observations to use myself, and probably discarded ones that contain actually useful information. It was a very bad idea that I offhandedly tested some observations that seem to work well, and more or less stuck to that for the duration of the challenge. So, at least, leave as a baseline a vanilla as possible implementation of the algorithm, and put some “blind trust” in well-tested implementations.
One of the few things that did go well, possibly, was that I created the automatic submission system and online logging system to get an accurate measure of the performance for different policies. Although hindered by the inherent stochasticity of the real robot performance, it gave me a useful measure of the actual policy performance. Maybe in hindsight, it would have been good to also plot something like a 箱ひげ図of the results and not just the mean, since one or two failed runs within 9 runs could drastically lower the average performance. That way I can split the evaluation between “high-performance” and “reliable performance”.
□