Commit a77493a6 authored by Daniel Lukats's avatar Daniel Lukats

nicified future work. (nicify is a word now and it means to make

something nice)
parent 47356079
\todo[inline]{make this entire thing a bit prettier}
Based on the work conducted for this thesis, three possible venues for future work present themselves. First of all, we
could work on improved evaluation methods that go beyond comparing graphs printed in different publications. This
includes selecting a diverse subset of games so robust trendlines may be created and shared.
could work on improved evaluation methods that allow for easier comparisons of algorithms across publications. This
includes increasing the number of training runs executed and potentially selecting a diverse subset of games, so robust
trendlines may be created and shared. Furthermore, we may examine advanced regression methods and evaluate if these
methods can be used to create more suitable baselines. Lastly, this also includes devising a proper means of
determining the noise of a reward graph, for example by displaying a standard deviation or confidence intervals around
the trendline.
Second of all, the code written for this thesis can be adjusted to support other benchmarking environments. Most
importantly, by supporting robotics and control tasks, we could reproduce the findings of \citeA{ilyas2018} and
\citeA{engstrom2019}.
\citeA{engstrom2019}. Moreover, we could also reproduce the findings of modified Proximal Policy Optimization algorithms
such as Truly Proximal Policy Optimization \cite{wang2019}. A number of improvements to Proximal Policy Optimization
have been suggested by researchers, so a survey comparing these might yield further insight into which modifications are
suitable and which may not be as promising.
Finally, the Proximal Policy Optimization can be combined with further learning techniques. In particular,
curiosity-driven learning as detailed by \citeA{pathak2017} is an interesting technique to combine with PPO. As
PPO contains an entropy bonus encouraging exploration already, replacing this with a guided exploration technique could
yield interesting results.
Finally, Proximal Policy Optimization can be combined with further learning techniques. In particular, curiosity-driven
learning as detailed by \citeA{pathak2017} is an interesting technique to combine with PPO. Instead of having an agent
explore by simply picking random actions, the agent attempts to predict and learn transitions to successor states. The
agent is then rewarded if it cannot predict the successor state, which incentivizes it to observe novel states more
often. The combination of this intrinsic reward with the extrinsic reward returned by the environment is an interesting
topic for further research.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment