3

1

I am currently going through the Berkeley lectures on Reinforcement Learning. Specifically, I am at slide 5 of this lecture.

At the bottom of that slide, the gradient of the expected sum of rewards function is given by $$ \nabla J(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log{\pi_\theta(a_{i,t} \vert s_{i,t}) (Q(s_{i,t},a_{i,t}) - V(s_{i,t}))} $$ The q-value function is defined as $$Q(s_t,a_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})\vert s_t,a_t]$$ At first glance, this makes sense, because I compare the value of taking the chosen action $a_{i,t}$ to the average value in time step $t$ and can evaluate how good my action was.

My question is: a specific state $s_{spec}$ can occur in any timestep, for example, $s_1 = s_{spec} = s_{10}$. But isn't there a difference in value depending on whether I hit $s_{spec}$ at timestep 1 or 10 when $T$ is fixed? Does this mean that for every state there is a different q value for each possible $t \in \{0,\ldots,T\}$? I somehow doubt that this is the case, but I don't quite understand how the time horizon $T$ fits in.

Or is $T$ not fixed (perhaps it's defined as the time step in which the trajectory ends in a terminal state - but that'd mean that during trajectory sampling, each simulation would take a different number of timesteps)?

I think I have found something helpful - John Schulman notes in his thesis: So it sounds like either I sample trajectories with variable length (arrival in a terminal state) or I encode the time step into the state and the state value function will also take the time step into consideration?

– Dummie Variable – 2018-09-15T13:49:18.070