Performance Review

added Feature label

Rollout:

currently storing all items collected during trajectory recording as tensors
check if appending to lists and converting to tensor once is faster

Influence seems negligible compared to actual calculation steps, python for loops and most importantly the backward pass.

added In Progress label

Advantage Calculation:

https://stackoverflow.com/questions/47970683/vectorize-a-numpy-discount-calculation (can also do Returns)
https://github.com/rlworkgroup/garage/blob/1ea04b8b90d2da0d7da10e8604a144018b61b81c/src/garage/torch/algos/_utils.py#L56 (can do multiple dimensions)

marked this issue as related to #1 (closed)

CuDNN is already being used: torch.backends.cudnn.version() is 7603

removed In Progress label

Half precision might be worth a look. Naively calling .half on pretty much every Tensor didn't work though. Needs more research.

added Point of Interest label

removed Feature label

Check if rollout generation/runner can be sped up by using subprocessing or MPI.

added In Progress label

marked this issue as related to #62 (closed)

Profiling graph for 5 episodes on Breakout-v4 with 8 environments. Environment stepping must be parallelized. Most of the time spent on to occurs in transferring the model to the GPU. The setup of the environments takes some time, too. However, this can be optimized trivially when optimizing stepping.

mentioned in commit 94f97209

Parallelizing with multiprocessing lead to a significant speed up: almost 140 time steps per second instead of 70 on an i7 6700K 4GHz. Speed up on the Tesla VM is a lot worse, slightly over 80 time steps per second instead of about 50.

Most time is now spent on waiting for inter process communication, mainly recv. Not sure if those calls return immediately or if they are blocking.

Further analysis showed 34% of runtime is spent waiting for data. A select-like behavior might help.

select brings time spent polling down to 33% when using shared memory.
Down to 30% without shared memory. For some reason recv-time went up 1%. Possibly due to memory allocation and re-ordering the returns-list?
- ```
i = self.pipes.index(pipe)
rets[i] = pipe.recv()
```

Environment stepping most likely cannot be further optimized, as most of the time is spent in the emulator (36% + 15%), numpy.max() (18%) and cv2.resize() (15%). Only very minor speedups might be achievable.

Model size does not affect runtime in a noticable way.

closed

removed In Progress label

Performance Review

Designs

Child items ...

Activity

Performance Review

Relates to

Activity