Profiling graph for 5 episodes on Breakout-v4 with 8 environments. Environment stepping must be parallelized. Most of the time spent on to occurs in transferring the model to the GPU. The setup of the environments takes some time, too. However, this can be optimized trivially when optimizing stepping.
Parallelizing with multiprocessing lead to a significant speed up: almost 140 time steps per second instead of 70 on an i7 6700K 4GHz. Speed up on the Tesla VM is a lot worse, slightly over 80 time steps per second instead of about 50.
Most time is now spent on waiting for inter process communication, mainly recv. Not sure if those calls return immediately or if they are blocking.
Environment stepping most likely cannot be further optimized, as most of the time is spent in the emulator (36% + 15%), numpy.max() (18%) and cv2.resize() (15%). Only very minor speedups might be achievable.