Q1
a)
Hidden‐Layer Preactivation
$$ z^{(h)} \;=\; W_{xh}\,x_t \;+\; W_{hh}\,h_{t-1} \;+\; b_h, $$
where each hidden neuron has a bias $b_h = 0.1$. Numerically:
We have
$$ x_t = \begin{bmatrix} 1.0 \\ 0.5 \end{bmatrix}, \quad h_{t-1} = \begin{bmatrix} 0.05 \\ 0.2 \\ 0.15 \end{bmatrix}, \quad W_{xh} = \begin{bmatrix} -0.07 & 0.05\\ -0.04 & 0.1\\ -0.05 & -0.05 \end{bmatrix}, \quad W_{hh} = \begin{bmatrix} -0.22 & -0.1 & 0.05\\ -0.04 & -0.09 & -0.06\\ -0.09 & 0 & -0.08 \end{bmatrix}. $$
Compute $W_{xh}\,x_t$:
$$ W_{xh}\,x_t = \begin{bmatrix} -0.07 & 0.05\\ -0.04 & 0.1\\ -0.05 & -0.05 \end{bmatrix} \begin{bmatrix} 1.0 \\ 0.5 \end{bmatrix} = \begin{bmatrix} (-0.07)\cdot1.0 + 0.05\cdot0.5 \\ (-0.04)\cdot1.0 + 0.1\cdot0.5 \\ (-0.05)\cdot1.0 + (-0.05)\cdot0.5 \end{bmatrix} = \begin{bmatrix} -0.045 \\[6pt] 0.01 \\[6pt] -0.075 \end{bmatrix}. $$
Compute $W_{hh}\,h_{t-1}$:
$$ \begin{align*}W_{hh}\,h_{t-1} &= \begin{bmatrix} -0.22 & -0.1 & 0.05\\ -0.04 & -0.09 & -0.06\\ -0.09 & 0 & -0.08 \end{bmatrix} \begin{bmatrix} 0.05\\ 0.2\\ 0.15 \end{bmatrix}\\ &= \begin{bmatrix} -0.22\cdot0.05 \;+\; -0.1\cdot0.2 \;+\; 0.05\cdot0.15\\[4pt] -0.04\cdot0.05 \;+\; -0.09\cdot0.2 \;+\; -0.06\cdot0.15\\[4pt] -0.09\cdot0.05 \;+\; 0\cdot0.2 \;+\; -0.08\cdot0.15 \end{bmatrix}\\ &= \begin{bmatrix} -0.0235\\ -0.029\\ -0.0165 \end{bmatrix}.\end{align*} $$
Add bias $b_h = 0.1$ for each hidden neuron:
$$ z^{(h)} = \underbrace{ \begin{bmatrix} -0.045\\ 0.01\\ -0.075 \end{bmatrix}}_{W_{xh}\,x_t} \;+\; \underbrace{ \begin{bmatrix} -0.0235\\ -0.029\\ -0.0165 \end{bmatrix}}_{W_{hh}\,h_{t-1}} \;+\; \underbrace{ \begin{bmatrix} 0.1\\ 0.1\\ 0.1 \end{bmatrix}}_{b_h} = \begin{bmatrix} 0.0315\\ 0.081\\ 0.0085 \end{bmatrix}. $$
We apply the ReLU to each component of (z^{(h)}):
$$ h_t \;=\; \mathrm{ReLU}\bigl(z^{(h)}\bigr) = \begin{bmatrix} \max(0,\,0.0315) \\[2pt] \max(0,\,0.081) \\[2pt] \max(0,\,0.0085) \end{bmatrix} = \begin{bmatrix} 0.0315 \\[2pt] 0.081 \\[2pt] 0.0085 \end{bmatrix}. $$
Output‐Layer Preactivation
$$ z^{(y)} \;=\; W_{yh}\,h_t \;+\; b_y, $$
where each output neuron also has bias $b_y = 0.1$.
we have
$$ W_{yh} = \begin{bmatrix} 0.06 & -0.01 & -0.18\\ -0.06 & 0.13 & 0.14 \end{bmatrix}, \quad b_y = 0.1. $$
Compute $W_{yh}\,h_t$:
$$ \begin{align*} \begin{bmatrix} 0.06 & -0.01 & -0.18\\ -0.06 & 0.13 & 0.14 \end{bmatrix} \begin{bmatrix} 0.0315\\ 0.081\\ 0.0085 \end{bmatrix} &= \begin{bmatrix} 0.06\cdot0.0315 + (-0.01)\cdot0.081 + (-0.18)\cdot0.0085\\[4pt] -0.06\cdot0.0315 + 0.13\cdot0.081 + 0.14\cdot0.0085 \end{bmatrix} \\[10pt] &= \begin{bmatrix} 0.00189 - 0.00081 - 0.00153\\ -0.00189 + 0.01053 + 0.00119 \end{bmatrix} \\[10pt] &= \begin{bmatrix} -0.00045\\ 0.00983 \end{bmatrix}. \end{align*} $$
Add bias ($b_y = 0.1$):
$$ z^{(y)} = \begin{bmatrix} -0.00045 + 0.1\\[3pt] 0.00983 + 0.1 \end{bmatrix} = \begin{bmatrix} 0.09955\\[3pt] 0.10983 \end{bmatrix}. $$
Again applying ReLU at the output layer (two output ReLU units):
$$ y_t = \mathrm{ReLU}(z^{(y)}) = \begin{bmatrix} \max(0,\,0.09955)\\ \max(0,\,0.10983) \end{bmatrix} = \begin{bmatrix} 0.09955\\ 0.10983 \end{bmatrix}. $$
Thus, the RNN’s output at time $t$ is:
$$ \boxed{y_t \;=\; \bigl[\,0.09955,\;0.10983\bigr].} $$
b)
Output-Layer Deltas $ \delta_y $
$$ \frac{\partial E}{\partial y_j} = (y_j - t_j) $$
Since output uses ReLU and all outputs are positive, the derivative is 1:
$$ \delta_{y_j} = (y_j - t_j) \times 1 $$
Neuron 1 (output):
$$ \delta_{y1} = (0.09955 - 0) = 0.09955 $$
Neuron 2 (output):
$$ \delta_{y2} = (0.10983 - 1) = -0.89017 $$
Therefore:
$$ \boxed{ \delta_{y1} = 0.09955, \quad \delta_{y2} = -0.89017 } $$
Hidden-Layer Deltas $ \delta_h $
For each hidden neuron $ i $:
$$ \delta_{h_i} = \left( \sum_j \delta_{y_j} W_{y_j, h_i} \right) \times \mathbf{1}_{(z_{h_i}>0)} $$
Since all hidden activations are positive, ReLU derivative is 1 for all.
Using the matrix:
$$ W_{yh} = \begin{bmatrix} 0.06 & -0.01 & -0.18 \\ -0.06 & 0.13 & 0.14 \end{bmatrix} $$
Hidden neuron 1:
$$ \begin{align*}\delta_{h1} = (0.09955)(0.06) + (-0.89017)(-0.06) \\ = 0.005973 + 0.0534102 = 0.0593832 \approx \boxed{0.05938}\end{align*} $$
Hidden neuron 2:
$$ \begin{align*}\delta_{h2} = (0.09955)(-0.01) + (-0.89017)(0.13) \\ = -0.0009955 - 0.1157221 = -0.1167176 \approx \boxed{-0.11672}\end{align*} $$
Hidden neuron 3:
$$ \begin{align*}\delta_{h3} = (0.09955)(-0.18) + (-0.89017)(0.14) \\ = -0.017919 - 0.1246238 = -0.1425428 \approx \boxed{-0.14254}\end{align*} $$
Output-layer deltas:
$$ \boxed{ \delta_{y1} = 0.09955, \quad \delta_{y2} = -0.89017 } $$
Hidden-layer deltas:
$$ \boxed{ \delta_{h1} \approx 0.05938, \quad \delta_{h2} \approx -0.11672, \quad \delta_{h3} \approx -0.14254 } $$
Q2
a)
$$ Q_{1} \;=\; P\,Q_{0}. $$
Given:
$$ Q_0 = \begin{bmatrix} 0.2 & 0.8 \end{bmatrix}^T = \begin{bmatrix} 0.2 \\ 0.8 \end{bmatrix}, \quad P = \begin{bmatrix} p & 1-p \\ 1-p & p \end{bmatrix}, $$
$$ P\,Q_0 = \begin{bmatrix} p & 1-p \\ 1-p & p \end{bmatrix} \begin{bmatrix} 0.2 \\ 0.8 \end{bmatrix} = \begin{bmatrix} p\cdot 0.2 \;+\; (1-p)\cdot 0.8 \\ (1-p)\cdot 0.2 \;+\; p\cdot 0.8 \end{bmatrix}. $$
- Top component
$$ p \times 0.2 \;+\; (1-p)\times 0.8 \;=\; 0.2p \;+\; 0.8 \;-\; 0.8p \;=\; 0.8 \;-\; 0.6p. $$
- Bottom component
$$ (1-p)\times 0.2 \;+\; p\times 0.8 \;=\; 0.2 \;-\; 0.2p \;+\; 0.8p \;=\; 0.2 \;+\; 0.6p. $$
Hence,
$$ Q_1 = \begin{bmatrix} 0.8 - 0.6p \\ 0.2 + 0.6p \end{bmatrix}. $$
b)
$$ P = \begin{bmatrix} p & 1-p\\ 1-p & p \end{bmatrix} $$
when $n = 1$
$$ (2p - 1)^1 = 2p - 1 $$
$$ \frac{1}{2} + \frac{1}{2}(2p - 1) = p,\quad \frac{1}{2} - \frac{1}{2}(2p - 1) = 1 - p $$
$$ P^1 = \begin{bmatrix} p & 1 - p\\ 1 - p & p \end{bmatrix} = \begin{bmatrix} \frac{1}{2} + \frac{1}{2}(2p - 1) & \frac{1}{2} - \frac{1}{2}(2p - 1)\\[6pt] \frac{1}{2} - \frac{1}{2}(2p - 1) & \frac{1}{2} + \frac{1}{2}(2p - 1) \end{bmatrix} $$
Inductive Step:
Assume
$$ P^n = \begin{bmatrix} A_n & B_n\\ B_n & A_n \end{bmatrix},\quad A_n = \frac{1}{2} + \frac{1}{2}(2p - 1)^n,\quad B_n = \frac{1}{2} - \frac{1}{2}(2p - 1)^n $$
Then
$$ P^{n+1} = P^n P = \begin{bmatrix} A_n & B_n\\ B_n & A_n \end{bmatrix} \begin{bmatrix} p & 1 - p\\ 1 - p & p \end{bmatrix} $$
Top-left entry:
Let $x = (2p - 1)^n$
$$ A_n p + B_n (1 - p) = \left( \frac{1}{2} + \frac{1}{2}x \right)p + \left( \frac{1}{2} - \frac{1}{2}x \right)(1 - p) $$
$$ = \frac{p}{2} + \frac{px}{2} + \frac{1 - p}{2} - \frac{(1 - p)x}{2} = \frac{1}{2} + \frac{x}{2}(2p - 1) = \frac{1}{2} + \frac{1}{2}(2p - 1)^{n+1} = A_{n+1} $$
Top-right entry:
$$ A_n (1 - p) + B_n p = \left( \frac{1}{2} + \frac{1}{2}x \right)(1 - p) + \left( \frac{1}{2} - \frac{1}{2}x \right)p $$
$$ = \frac{1}{2} - \frac{x}{2}(2p - 1) = \frac{1}{2} - \frac{1}{2}(2p - 1)^{n+1} = B_{n+1} $$
By symmetry:
$$ P^{n+1} = \begin{bmatrix} A_{n+1} & B_{n+1}\\ B_{n+1} & A_{n+1} \end{bmatrix} = \begin{bmatrix} \frac{1}{2} + \frac{1}{2}(2p - 1)^{n+1} & \frac{1}{2} - \frac{1}{2}(2p - 1)^{n+1}\\[6pt] \frac{1}{2} - \frac{1}{2}(2p - 1)^{n+1} & \frac{1}{2} + \frac{1}{2}(2p - 1)^{n+1} \end{bmatrix} $$
Result:
$$ \boxed{ P^n = \begin{bmatrix} \frac{1}{2} + \frac{1}{2}(2p - 1)^n & \frac{1}{2} - \frac{1}{2}(2p - 1)^n\\[6pt] \frac{1}{2} - \frac{1}{2}(2p - 1)^n & \frac{1}{2} + \frac{1}{2}(2p - 1)^n \end{bmatrix} \quad \text{for all } n \ge 1 } $$
c)
$$ P = \begin{bmatrix} p & 1 - p\\ 1 - p & p \end{bmatrix} $$
$$ \mu^T P = \mu^T, \qquad \mu = [\alpha,\;1 - \alpha] $$
$$ \mu^T P = \begin{bmatrix} \alpha & 1 - \alpha \end{bmatrix} \begin{bmatrix} p & 1 - p\\ 1 - p & p \end{bmatrix} = \begin{bmatrix} \alpha p + (1 - \alpha)(1 - p) & \alpha(1 - p) + (1 - \alpha)p \end{bmatrix} $$
$$ \alpha = \alpha p + (1 - \alpha)(1 - p) = \alpha p + 1 - p - \alpha + \alpha p = 1 - p + 2\alpha p - \alpha $$
$$ \Rightarrow\quad \alpha = \frac{1}{2},\qquad \mu = \left[ \frac{1}{2},\;\frac{1}{2} \right] $$
d)
$$ \lim_{n \to \infty} \pi(n) = \mu = \left[ \frac{1}{2},\;\frac{1}{2} \right] $$
$$ \lim_{n \to \infty} P^n = \begin{bmatrix} \mu(1) \\ \mu(1) \end{bmatrix} \begin{bmatrix} \mu(2) \\ \mu(2) \end{bmatrix}^\top = \begin{bmatrix} \frac{1}{2} & \frac{1}{2} \\ \frac{1}{2} & \frac{1}{2} \end{bmatrix} $$
$$ \boxed{ \mu = \left[ \frac{1}{2},\;\frac{1}{2} \right], \qquad \lim_{n \to \infty} P^n = \begin{bmatrix} \frac{1}{2} & \frac{1}{2} \\ \frac{1}{2} & \frac{1}{2} \end{bmatrix} } $$
Q3
$$ \text{Given: } \gamma = 0.5,\quad T = 5,\quad R_1 = -1,\; R_2 = 2,\; R_3 = 6,\; R_4 = 3,\; R_5 = 2 $$
$$ G_5 = 0 \space\text{No future reward in here.} $$
$$ G_4 = R_5 = 2 $$
$$ G_3 = R_4 + \gamma R_5 = 3 + 0.5 \times 2 = 4 $$
$$ G_2 = R_3 + \gamma R_4 + \gamma^2 R_5 = 6 + 0.5 \times 3 + 0.5^2 \times 2 = 6 + 1.5 + 0.5 = 8 $$
$$ G_1 = R_2 + \gamma R_3 + \gamma^2 R_4 + \gamma^3 R_5 = 2 + 0.5 \times 6 + 0.5^2 \times 3 + 0.5^3 \times 2 = 2 + 3 + 0.75 + 0.25 = 6 $$
$$ \begin{align*} G_0 &= R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \gamma^4 R_5 \\ &= -1 + 0.5 \times 2 + 0.5^2 \times 6 + 0.5^3 \times 3 + 0.5^4 \times 2 \\ &= -1 + 1 + 1.5 + 0.375 + 0.125 \\ &= 2 \end{align*} $$
$$ \boxed{ G_0 = 2,\quad G_1 = 6,\quad G_2 = 8,\quad G_3 = 4,\quad G_4 = 2,\quad G_5 = 0 } $$
Q4
Consider an episode or trajectory $\tau$. The return starting at time $t$ is defined by:
$$ G_t(\tau) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots $$
Adding a Constant $c$ to Each Reward
Now define a modified reward at each step:
$$ R_k' = R_k + c $$
for all $k$. Then the new return is:
$$ G_t'(\tau) = R_{t+1}' + \gamma R_{t+2}' + \gamma^2 R_{t+3}' + \dots $$
Substitute $R_k' = R_k + c$:
$$ G_t'(\tau) = (R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots) + (c + \gamma c + \gamma^2 c + \dots) $$
$$ = G_t(\tau) + \left(c + \gamma c + \gamma^2 c + \dots\right) $$
Summation of the Constant Series
The infinite series of constants has a geometric factor $\gamma^k$, so:
$$ c + \gamma c + \gamma^2 c + \dots = c \sum_{k=0}^{\infty} \gamma^k = \frac{c}{1 - \gamma} $$
Thus:
$$ G_t'(\tau) = G_t(\tau) + \frac{c}{1 - \gamma} $$
Because every trajectory’s return is shifted by the same constant $\tfrac{c}{1 - \gamma}$:
- The ordering of returns across trajectories does not change
- The optimal policy remains the same
- Only differences in rewards matter, not their absolute values
$$ \boxed{ G_t'(\tau) = G_t(\tau) + \frac{c}{1 - \gamma} } $$
Q5
a)
We are given:
$$ \pi(1 \mid s) = 1 - \frac{\varepsilon}{2} = 0.6 $$
$$ \frac{\varepsilon}{2} = 0.4 \quad\Rightarrow\quad \varepsilon = 0.8 $$
$$ \boxed{\varepsilon = 0.8} $$
b)
Environment Details:
Transitions:
$$ \begin{aligned} &\text{State 0:} \\ &\quad \text{Action 0} \to \text{State 0, Reward } -1 \\ &\quad \text{Action 1} \to \text{State 1, Reward } +1 \\ &\text{State 1:} \\ &\quad \text{Action 0} \to \text{State 1, Reward } +1 \\ &\quad \text{Action 1} \to \text{State 0, Reward } -1 \end{aligned} $$
Policy:
$$ \pi(1 \mid 0) = 0.6,\quad \pi(0 \mid 0) = 0.4,\quad \pi(0 \mid 1) = 0.6,\quad \pi(1 \mid 1) = 0.4 $$
Discount factor:
$$ \gamma = 0.1 $$
Bellman Equation for State 0:
$$ \begin{aligned} v(0) &= \pi(0 \mid 0)[-1 + \gamma v(0)] + \pi(1 \mid 0)[1 + \gamma v(1)] \\ &= 0.4[-1 + 0.1 v(0)] + 0.6[1 + 0.1 v(1)] \\ &= -0.4 + 0.04 v(0) + 0.6 + 0.06 v(1) \\ &= 0.2 + 0.04 v(0) + 0.06 v(1) \end{aligned} $$
Rearranged:
$$ 0.96 v(0) = 0.2 + 0.06 v(1) \quad \Rightarrow \quad v(0) = \frac{0.2 + 0.06 v(1)}{0.96} $$
Bellman Equation for State 1:
$$ \begin{aligned} v(1) &= \pi(0 \mid 1)[1 + \gamma v(1)] + \pi(1 \mid 1)[-1 + \gamma v(0)] \\ &= 0.6[1 + 0.1 v(1)] + 0.4[-1 + 0.1 v(0)] \\ &= 0.6 + 0.06 v(1) - 0.4 + 0.04 v(0) \\ &= 0.2 + 0.06 v(1) + 0.04 v(0) \end{aligned} $$
Rearranged:
$$ 0.94 v(1) = 0.2 + 0.04 v(0) \quad \Rightarrow \quad v(1) = \frac{0.2 + 0.04 v(0)}{0.94} $$
Solving the system (numerically):
$$ v(0) \approx v(1) \approx 0.2222 $$
Action-Value Functions:
$$ q(s, a) = R(s, a) + \gamma \cdot v(s') $$
State 0:
$$ \begin{aligned} q(0,0) &= -1 + 0.1 \cdot v(0) \approx -1 + 0.0222 = -0.9778 \\ q(0,1) &= 1 + 0.1 \cdot v(1) \approx 1 + 0.0222 = 1.0222 \end{aligned} $$
State 1:
$$ \begin{aligned} q(1,0) &= 1 + 0.1 \cdot v(1) \approx 1.0222 \\ q(1,1) &= -1 + 0.1 \cdot v(0) \approx -0.9778 \end{aligned} $$
Final Answer:
$$ \boxed{ v(0) \approx v(1) \approx 0.2222,\quad q(0,0) \approx -0.9778,\quad q(0,1) \approx 1.0222,\quad q(1,0) \approx 1.0222,\quad q(1,1) \approx -0.9778 } $$
c)
Bellman Optimality Derivation for Optimal Policy
Environment Details:
$$ \begin{aligned} \text{State 0:} &\quad \begin{cases} \text{Action 0: } R = -1,\; s' = 0 \\ \text{Action 1: } R = +1,\; s' = 1 \end{cases} \\[1mm] \text{State 1:} &\quad \begin{cases} \text{Action 0: } R = +1,\; s' = 1 \\ \text{Action 1: } R = -1,\; s' = 0 \end{cases} \end{aligned} $$
Discount factor:
$$ \gamma = 0.1 $$
Bellman Optimality Equations:
For state 0:
$$ V^*(0) = \max \left\{ -1 + 0.1 \cdot V^*(0),\;\; 1 + 0.1 \cdot V^*(1) \right\} $$
For state 1:
$$ V^*(1) = \max \left\{ 1 + 0.1 \cdot V^*(1),\;\; -1 + 0.1 \cdot V^*(0) \right\} $$
Solving for Optimal Actions:
From state 1:
$$ V^*(1) = 1 + 0.1 \cdot V^*(1) \quad \Rightarrow \quad 0.9\,V^*(1) = 1 \quad \Rightarrow \quad V^*(1) = \frac{1}{0.9} = 1.\overline{1} $$
Then from state 0:
$$ V^*(0) = 1 + 0.1 \cdot V^*(1) = 1 + 0.1 \cdot \frac{10}{9} = 1 + \frac{1}{9} = \frac{10}{9} = 1.\overline{1} $$
Optimal Action-Value Functions:
State 0:
$$ \begin{aligned} Q^*(0,0) &= -1 + 0.1 \cdot V^*(0) = -1 + 0.1 \cdot \frac{10}{9} = -\frac{8}{9} \approx -0.8889 \\ Q^*(0,1) &= 1 + 0.1 \cdot V^*(1) = 1 + 0.1 \cdot \frac{10}{9} = \frac{10}{9} \approx 1.1111 \end{aligned} $$
State 1:
$$ \begin{aligned} Q^*(1,0) &= 1 + 0.1 \cdot V^*(1) = \frac{10}{9} \approx 1.1111 \\ Q^*(1,1) &= -1 + 0.1 \cdot V^*(0) = -\frac{8}{9} \approx -0.8889 \end{aligned} $$
Optimal Policy:
$$ \pi^*(1 \mid 0) = 1, \quad \pi^*(0 \mid 1) = 1 $$
Final Answer:
$$ \boxed{ \begin{aligned} V^*(0) &= \frac{10}{9} \approx 1.1111,\quad V^*(1) = \frac{10}{9} \approx 1.1111 \\[1mm] Q^*(0,0) &= -\frac{8}{9} \approx -0.8889,\quad Q^*(0,1) = \frac{10}{9} \approx 1.1111 \\[1mm] Q^*(1,0) &= \frac{10}{9} \approx 1.1111,\quad Q^*(1,1) = -\frac{8}{9} \approx -0.8889 \\[1mm] \pi^*(1 \mid 0) &= 1,\quad \pi^*(0 \mid 1) = 1 \end{aligned} } $$
d)
Trajectory 1 (Start from state 0)
$$ 0 \xrightarrow{a=0,\;r=-1} 0 \xrightarrow{a=1,\;r=+1} 1 \xrightarrow{a=0,\;r=+1} \text{(terminal)} $$
$$ \begin{align*} G_0 &= -1 + 0.1(1) + 0.01(1) = -0.89 \\ G_1 &= 1 + 0.1(1) = 1.1 \\ G_2 &= 1 \end{align*} $$
Trajectory 2 (Start from state 1)
$$ 1 \xrightarrow{a=0,\;r=+1} 1 \xrightarrow{a=1,\;r=-1} 0 \xrightarrow{a=1,\;r=+1} \text{(terminal)} $$
$$ \begin{align*} G_0 &= 1 - 0.1 + 0.01 = 0.91 \\ G_1 &= -1 + 0.1 = -0.9 \\ G_2 &= 1 \end{align*} $$
Monte Carlo Updates with $\alpha = 0.6$
State-Value Updates
Initial: $V(s) = 0$
From Trajectory 1:
$$ \begin{align*} V(0) &\leftarrow 0 + 0.6(-0.89 - 0) = -0.534 \\ V(0) &\leftarrow -0.534 + 0.6(1.1 - (-0.534)) = 0.4464 \\ V(1) &\leftarrow 0 + 0.6(1 - 0) = 0.6 \end{align*} $$
From Trajectory 2:
$$ \begin{align*} V(1) &\leftarrow 0.6 + 0.6(0.91 - 0.6) = 0.786 \\ V(1) &\leftarrow 0.786 + 0.6(-0.9 - 0.786) = -0.2256 \\ V(0) &\leftarrow 0.4464 + 0.6(1 - 0.4464) = 0.77856 \end{align*} $$
Action-Value Updates
Initial: $Q(s,a) = 0$
From Trajectory 1:
$$ \begin{align*} Q(0,0) &\leftarrow 0 + 0.6(-0.89) = -0.534 \\ Q(0,1) &\leftarrow 0 + 0.6(1.1) = 0.66 \\ Q(1,0) &\leftarrow 0 + 0.6(1) = 0.6 \end{align*} $$
From Trajectory 2:
$$ \begin{align*} Q(1,0) &\leftarrow 0.6 + 0.6(0.91 - 0.6) = 0.786 \\ Q(1,1) &\leftarrow 0 + 0.6(-0.9) = -0.54 \\ Q(0,1) &\leftarrow 0.66 + 0.6(1 - 0.66) = 0.864 \end{align*} $$
Final Estimates
State-Value Function
$$ V(0) \approx 0.7786, \quad V(1) \approx -0.2256 $$
Action-Value Function
$$ \begin{align*} Q(0,0) &\approx -0.534 \\ Q(0,1) &\approx 0.864 \\ Q(1,0) &\approx 0.786 \\ Q(1,1) &\approx -0.54 \end{align*} $$
e)
Q-learning formula:
$$ Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right] $$
Trajectory 1
Time $t=0$:
$$ Q(0,0) \leftarrow 0 + 0.6 \left[ -1 + 0.1 \cdot 0 - 0 \right] = -0.6 $$
Time $t=1$:
$$ Q(0,1) \leftarrow 0 + 0.6 \left[ 1 + 0.1 \cdot 0 - 0 \right] = 0.6 $$
Time $t=2$:
$$ Q(1,0) \leftarrow 0 + 0.6 \left[ 1 + 0.1 \cdot 0 - 0 \right] = 0.6 $$
Trajectory 2
Time $t=0$:
$$ \begin{align*} \max_{a'} Q(1,a') &= \max\{0.6, 0\} = 0.6 \\ Q(1,0) &\leftarrow 0.6 + 0.6 \left[ 1 + 0.1 \cdot 0.6 - 0.6 \right] \\ &= 0.6 + 0.6 \cdot (1.06 - 0.6) = 0.6 + 0.6 \cdot 0.46 = 0.876 \end{align*} $$
Time $t=1$:
$$ \begin{align*} \max_{a'} Q(0,a') &= \max\{-0.6, 0.6\} = 0.6 \\ Q(1,1) &\leftarrow 0 + 0.6 \left[ -1 + 0.1 \cdot 0.6 - 0 \right] \\ &= 0 + 0.6 \cdot (-0.94) = -0.564 \end{align*} $$
Time $t=2$:
$$ \begin{align*} Q(0,1) &\leftarrow 0.6 + 0.6 \left[ 1 + 0.1 \cdot 0 - 0.6 \right] \\ &= 0.6 + 0.6 \cdot (1 - 0.6) = 0.6 + 0.24 = 0.84 \end{align*} $$
Final Answer
$$ \boxed{\begin{aligned} Q(0,0) &\approx -0.6 \\[3pt] Q(0,1) &\approx 0.84 \\[3pt] Q(1,0) &\approx 0.876 \\[3pt] Q(1,1) &\approx -0.564 \end{aligned}} $$
f)
we have
- All $Q(s,a) = 0$
- Initial policy: $\pi(1\mid0) = \pi(0\mid1) = 0.5$
- $\varepsilon = 0.8$, so the greedy probability is 0.6
- Tie-breaking: in state 0, action 1 is greedy; in state 1, action 0 is greedy
- Parameters: $\alpha=0.6,\ \gamma=0.1$
- Using the same random number sequence as in part (d):
$0.94,\ 0.27,\ 0.30,\ 0.18,\ 0.75$
The trajectory is:
$$ (0,0,r=-1) \rightarrow (0,1,r=+1) \rightarrow (1,0,r=+1) \rightarrow (1,0,r=+1) \rightarrow (1,1,r=-1) \rightarrow \text{terminal} $$
SARSA Update Rule
$$ Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma Q(s', a') - Q(s,a) \right] $$
All $Q$ values are initialized to 0.
Update Steps
$t=0$: $(s=0, a=0, r=-1)$, next: $(s'=0, a'=1)$
$$ Q(0,0) \leftarrow 0 + 0.6 \cdot \left[-1 + 0.1 \cdot Q(0,1) - 0 \right] = -0.6 $$
$t=1$: $(s=0, a=1, r=+1)$, next: $(s'=1, a'=0)$
$$ Q(0,1) \leftarrow 0 + 0.6 \cdot \left[1 + 0.1 \cdot Q(1,0) - 0 \right] = 0.6 $$
$t=2$: $(s=1, a=0, r=+1)$, next: $(s'=1, a'=0)$
$$ Q(1,0) \leftarrow 0 + 0.6 \cdot \left[1 + 0.1 \cdot Q(1,0) - 0 \right] = 0.6 $$
$t=3$: $(s=1, a=0, r=+1)$, next: $(s'=1, a'=1)$
$$ Q(1,0) \leftarrow 0.6 + 0.6 \cdot \left[1 + 0.1 \cdot Q(1,1) - 0.6 \right] = 0.6 + 0.6 \cdot 0.4 = 0.84 $$
$t=4$: $(s=1, a=1, r=-1)$, terminal
$$ Q(1,1) \leftarrow 0 + 0.6 \cdot \left[-1 + 0 - 0 \right] = -0.6 $$
Final SARSA Q-values
$$ \boxed{ \begin{aligned} Q(0,0) &\approx -0.6 \\ Q(0,1) &\approx 0.6 \\ Q(1,0) &\approx 0.84 \\ Q(1,1) &\approx -0.6 \end{aligned} } $$
Q6
a)
$$ P(S=s)\;=\;\sum_{b\in\{0,1\}}\sum_{r\in\{0,1\}}\sum_{a\in\{0,1\}} P(b)\,P(r\mid b)\,P(a\mid b)\,P(s\mid r,a). $$
we have
Prior on (B):
$$ P(b=1)=0.5,\quad P(b=0)=0.5 $$
Conditional probabilities for (R):
$$ \begin{aligned} P(r=1\mid b=1) &= 0.3,\quad P(r=0\mid b=1) = 0.7 \\ P(r=1\mid b=0) &= 0.8,\quad P(r=0\mid b=0) = 0.2 \end{aligned} $$
Conditional probabilities for (A):
$$ \begin{aligned} P(a=1\mid b=1) &= 0.1,\quad P(a=0\mid b=1) = 0.9 \\ P(a=1\mid b=0) &= 0.5,\quad P(a=0\mid b=0) = 0.5 \end{aligned} $$
Conditional probabilities for (S):
$$ \begin{aligned} P(s=1\mid r=1, a=1) &= 0 \\ P(s=1\mid r=1, a=0) &= 0.8 \\ P(s=1\mid r=0, a=1) &= 0.6 \\ P(s=1\mid r=0, a=0) &= 1.0 \end{aligned} $$
$$ \begin{aligned} P(s=0\mid r=1, a=1) &= 1 \\ P(s=0\mid r=1, a=0) &= 0.2 \\ P(s=0\mid r=0, a=1) &= 0.4 \\ P(s=0\mid r=0, a=0) &= 0 \end{aligned} $$
(b=1):
$$ \begin{aligned} &0.5 \cdot 0.3 \cdot 0.1 \cdot 0 = 0 \\ &0.5 \cdot 0.3 \cdot 0.9 \cdot 0.8 = 0.108 \\ &0.5 \cdot 0.7 \cdot 0.1 \cdot 0.6 = 0.021 \\ &0.5 \cdot 0.7 \cdot 0.9 \cdot 1.0 = 0.315 \\ &\text{Sum for } b=1: \quad 0 + 0.108 + 0.021 + 0.315 = 0.444 \end{aligned} $$
(b=0):
$$ \begin{aligned} &0.5 \cdot 0.8 \cdot 0.5 \cdot 0 = 0 \\ &0.5 \cdot 0.8 \cdot 0.5 \cdot 0.8 = 0.16 \\ &0.5 \cdot 0.2 \cdot 0.5 \cdot 0.6 = 0.03 \\ &0.5 \cdot 0.2 \cdot 0.5 \cdot 1.0 = 0.05 \\ &\text{Sum for } b=0: \quad 0 + 0.16 + 0.03 + 0.05 = 0.24 \end{aligned} $$
Final Result
$$ P(S=s) = 0.444 + 0.24 = \boxed{0.684} $$
b)
From the diagram’s CPT for $S$, once you know $R = r$ and $A = a$, $S$ depends only on $r$ and $a$. Hence the conditional probabilities $P(s \mid r, a)$ can be read directly from the $S$‐table:
$$ P(s=1 \mid r,a)= \begin{cases} 0 & (r=1,\,a=1)\\ 0.8 & (r=1,\,a=0)\\ 0.6 & (r=0,\,a=1)\\ 1.0 & (r=0,\,a=0) \end{cases} \quad \text{and} \quad P(s=0 \mid r,a)\;=\;1 \;-\; P(s=1 \mid r,a). $$
c)
Let
$$ P(r,a \mid s) = P(r \mid s)\, P(a \mid s) $$
Use the Joint and Normalize
We compute:
$$ P(r,a \mid s) = \frac{P(r,a,s)}{P(s)} = \frac{\sum_{b\in\{0,1\}} P(b)\,P(r\mid b)\,P(a\mid b)\,P(s\mid r,a)}{P(s)} $$
Once we have the joint distribution ( P(r,a \mid s) ), we can get the marginals by summing:
$$ P(r \mid s) = \sum_{a} P(r,a \mid s), \quad P(a \mid s) = \sum_{r} P(r,a \mid s) $$
Numerical Values for ( s = 1 )
From earlier computations:
$$ P(S=s) = 0.684 $$
and from the same summations,
- ( P(r=1, s=1) = 0.268 )
- ( P(r=0, s=1) = 0.416 )
- ( P(a=1, s=1) = 0.051 )
- ( P(a=0, s=1) = 0.633 )
Hence:
$$ P(r=1 \mid s=1) = \frac{0.268}{0.684} \approx 0.393 $$
$$ P(a=1 \mid s=1) = \frac{0.051}{0.684} \approx 0.075 $$
Check Conditional Independence
Note: $ P(r=1, a=1, s=1) = 0 $
Then:
$$ P(r=1, a=1 \mid s=1) = \frac{0}{0.684} = 0 $$
But:
$$ P(r=1 \mid s=1) \cdot P(a=1 \mid s=1) = (0.393)(0.075) \approx 0.0295 \neq 0 $$
This shows directly that
$$ P(r,a \mid s) \neq P(r \mid s)\, P(a \mid s) $$
d)
Computing $ P(b \mid s=1) $
We want to compute the posterior probability of $B$ given $ S=1 $. This can be done using:
$$ P(b \mid s) = \frac{P(b, s)}{P(s)} = \frac{\sum_{r \in \{0,1\}} \sum_{a \in \{0,1\}} P(b)\,P(r \mid b)\,P(a \mid b)\,P(s \mid r, a)}{P(s)} $$
From earlier, we know that:
$$ P(s = 1) = 0.684 $$
Compute $ P(b=1, s=1)$
This involves summing over all combinations of $ r $ and $ a $:
- $ r=1, a=1$: contributes 0 (since $ P(s=1 \mid r=1, a=1) = 0 $)
$r=1, a=0$:
$$ 0.5 \cdot 0.3 \cdot 0.9 \cdot 0.8 = 0.108 $$
$r=0, a=1$:
$$ 0.5 \cdot 0.7 \cdot 0.1 \cdot 0.6 = 0.021 $$
$r=0, a=0$:
$$ 0.5 \cdot 0.7 \cdot 0.9 \cdot 1.0 = 0.315 $$
Adding these:
$$ P(b=1, s=1) = 0.108 + 0.021 + 0.315 = 0.444 $$
Compute $ P(b=0, s=1) $
Again summing over all $ r, a $:
$ r=1, a=0 $:
$$ 0.5 \cdot 0.8 \cdot 0.5 \cdot 0.8 = 0.16 $$
$ r=0, a=1 $:
$$ 0.5 \cdot 0.2 \cdot 0.5 \cdot 0.6 = 0.03 $$
$ r=0, a=0 $:
$$ 0.5 \cdot 0.2 \cdot 0.5 \cdot 1.0 = 0.05 $$
Adding:
$$ P(b=0, s=1) = 0.16 + 0.03 + 0.05 = 0.24 $$
Now compute the posterior:
$$ P(b=1 \mid s=1) = \frac{0.444}{0.684} \approx 0.65 $$
$$ P(b=0 \mid s=1) = \frac{0.24}{0.684} \approx 0.35 $$
Final Answer:
$$ \boxed{ P(b=1 \mid s=1) \approx 0.65, \quad P(b=0 \mid s=1) \approx 0.35 } $$
e)
Sample $ B $
- Prior: $ P(B = 1) = 0.5 $
Given $ u_1 = 0.94 $, since $ 0.94 > 0.5 $, we set:
$$ B = 0 $$
Sample $ R \mid B = 0 $
- Conditional: $ P(R = 1 \mid B = 0) = 0.8 $
Given $ u_2 = 0.27 $, since $ 0.27 < 0.8 $, we set:
$$ R = 1 $$
Sample $ A \mid B = 0 $
- Conditional: $ P(A = 1 \mid B = 0) = 0.5 $
Given $ u_3 = 0.30 $, since $ 0.30 < 0.5 $, we set:
$$ A = 1 $$
Sample $ S \mid R = 1, A = 1 $
- From the conditional table:
$ P(S = 1 \mid R = 1, A = 1) = 0 \Rightarrow P(S = 0) = 1 $ So regardless of $ u_4 = 0.18 $, we must set:
$$ S = 0 $$
Putting it all together:
$$ \boxed{(B, R, A, S) = (0,\; 1,\; 1,\; 0)} $$