Q1

a)

Hidden‐Layer Preactivation

$$ z^{(h)} \;=\; W_{xh}\,x_t \;+\; W_{hh}\,h_{t-1} \;+\; b_h, $$

where each hidden neuron has a bias $b_h = 0.1$. Numerically:

We have

$$ x_t = \begin{bmatrix} 1.0 \\ 0.5 \end{bmatrix}, \quad h_{t-1} = \begin{bmatrix} 0.05 \\ 0.2 \\ 0.15 \end{bmatrix}, \quad W_{xh} = \begin{bmatrix} -0.07 & 0.05\\ -0.04 & 0.1\\ -0.05 & -0.05 \end{bmatrix}, \quad W_{hh} = \begin{bmatrix} -0.22 & -0.1 & 0.05\\ -0.04 & -0.09 & -0.06\\ -0.09 & 0 & -0.08 \end{bmatrix}. $$

Compute $W_{xh}\,x_t$:

$$ W_{xh}\,x_t = \begin{bmatrix} -0.07 & 0.05\\ -0.04 & 0.1\\ -0.05 & -0.05 \end{bmatrix} \begin{bmatrix} 1.0 \\ 0.5 \end{bmatrix} = \begin{bmatrix} (-0.07)\cdot1.0 + 0.05\cdot0.5 \\ (-0.04)\cdot1.0 + 0.1\cdot0.5 \\ (-0.05)\cdot1.0 + (-0.05)\cdot0.5 \end{bmatrix} = \begin{bmatrix} -0.045 \\[6pt] 0.01 \\[6pt] -0.075 \end{bmatrix}. $$

Compute $W_{hh}\,h_{t-1}$:

$$ \begin{align*}W_{hh}\,h_{t-1} &= \begin{bmatrix} -0.22 & -0.1 & 0.05\\ -0.04 & -0.09 & -0.06\\ -0.09 & 0 & -0.08 \end{bmatrix} \begin{bmatrix} 0.05\\ 0.2\\ 0.15 \end{bmatrix}\\ &= \begin{bmatrix} -0.22\cdot0.05 \;+\; -0.1\cdot0.2 \;+\; 0.05\cdot0.15\\[4pt] -0.04\cdot0.05 \;+\; -0.09\cdot0.2 \;+\; -0.06\cdot0.15\\[4pt] -0.09\cdot0.05 \;+\; 0\cdot0.2 \;+\; -0.08\cdot0.15 \end{bmatrix}\\ &= \begin{bmatrix} -0.0235\\ -0.029\\ -0.0165 \end{bmatrix}.\end{align*} $$

Add bias $b_h = 0.1$ for each hidden neuron:

$$ z^{(h)} = \underbrace{ \begin{bmatrix} -0.045\\ 0.01\\ -0.075 \end{bmatrix}}_{W_{xh}\,x_t} \;+\; \underbrace{ \begin{bmatrix} -0.0235\\ -0.029\\ -0.0165 \end{bmatrix}}_{W_{hh}\,h_{t-1}} \;+\; \underbrace{ \begin{bmatrix} 0.1\\ 0.1\\ 0.1 \end{bmatrix}}_{b_h} = \begin{bmatrix} 0.0315\\ 0.081\\ 0.0085 \end{bmatrix}. $$

We apply the ReLU to each component of (z^{(h)}):

$$ h_t \;=\; \mathrm{ReLU}\bigl(z^{(h)}\bigr) = \begin{bmatrix} \max(0,\,0.0315) \\[2pt] \max(0,\,0.081) \\[2pt] \max(0,\,0.0085) \end{bmatrix} = \begin{bmatrix} 0.0315 \\[2pt] 0.081 \\[2pt] 0.0085 \end{bmatrix}. $$

Output‐Layer Preactivation

$$ z^{(y)} \;=\; W_{yh}\,h_t \;+\; b_y, $$

where each output neuron also has bias $b_y = 0.1$.

we have

$$ W_{yh} = \begin{bmatrix} 0.06 & -0.01 & -0.18\\ -0.06 & 0.13 & 0.14 \end{bmatrix}, \quad b_y = 0.1. $$

Compute $W_{yh}\,h_t$:

$$ \begin{align*} \begin{bmatrix} 0.06 & -0.01 & -0.18\\ -0.06 & 0.13 & 0.14 \end{bmatrix} \begin{bmatrix} 0.0315\\ 0.081\\ 0.0085 \end{bmatrix} &= \begin{bmatrix} 0.06\cdot0.0315 + (-0.01)\cdot0.081 + (-0.18)\cdot0.0085\\[4pt] -0.06\cdot0.0315 + 0.13\cdot0.081 + 0.14\cdot0.0085 \end{bmatrix} \\[10pt] &= \begin{bmatrix} 0.00189 - 0.00081 - 0.00153\\ -0.00189 + 0.01053 + 0.00119 \end{bmatrix} \\[10pt] &= \begin{bmatrix} -0.00045\\ 0.00983 \end{bmatrix}. \end{align*} $$

Add bias ($b_y = 0.1$):

$$ z^{(y)} = \begin{bmatrix} -0.00045 + 0.1\\[3pt] 0.00983 + 0.1 \end{bmatrix} = \begin{bmatrix} 0.09955\\[3pt] 0.10983 \end{bmatrix}. $$

Again applying ReLU at the output layer (two output ReLU units):

$$ y_t = \mathrm{ReLU}(z^{(y)}) = \begin{bmatrix} \max(0,\,0.09955)\\ \max(0,\,0.10983) \end{bmatrix} = \begin{bmatrix} 0.09955\\ 0.10983 \end{bmatrix}. $$

Thus, the RNN’s output at time $t$ is:

$$ \boxed{y_t \;=\; \bigl[\,0.09955,\;0.10983\bigr].} $$

b)

Output-Layer Deltas $ \delta_y $

$$ \frac{\partial E}{\partial y_j} = (y_j - t_j) $$

Since output uses ReLU and all outputs are positive, the derivative is 1:

$$ \delta_{y_j} = (y_j - t_j) \times 1 $$

  • Neuron 1 (output):

    $$ \delta_{y1} = (0.09955 - 0) = 0.09955 $$

  • Neuron 2 (output):

    $$ \delta_{y2} = (0.10983 - 1) = -0.89017 $$

Therefore:

$$ \boxed{ \delta_{y1} = 0.09955, \quad \delta_{y2} = -0.89017 } $$

Hidden-Layer Deltas $ \delta_h $

For each hidden neuron $ i $:

$$ \delta_{h_i} = \left( \sum_j \delta_{y_j} W_{y_j, h_i} \right) \times \mathbf{1}_{(z_{h_i}>0)} $$

Since all hidden activations are positive, ReLU derivative is 1 for all.

Using the matrix:

$$ W_{yh} = \begin{bmatrix} 0.06 & -0.01 & -0.18 \\ -0.06 & 0.13 & 0.14 \end{bmatrix} $$

  • Hidden neuron 1:

    $$ \begin{align*}\delta_{h1} = (0.09955)(0.06) + (-0.89017)(-0.06) \\ = 0.005973 + 0.0534102 = 0.0593832 \approx \boxed{0.05938}\end{align*} $$

  • Hidden neuron 2:

    $$ \begin{align*}\delta_{h2} = (0.09955)(-0.01) + (-0.89017)(0.13) \\ = -0.0009955 - 0.1157221 = -0.1167176 \approx \boxed{-0.11672}\end{align*} $$

  • Hidden neuron 3:

    $$ \begin{align*}\delta_{h3} = (0.09955)(-0.18) + (-0.89017)(0.14) \\ = -0.017919 - 0.1246238 = -0.1425428 \approx \boxed{-0.14254}\end{align*} $$

Output-layer deltas:

$$ \boxed{ \delta_{y1} = 0.09955, \quad \delta_{y2} = -0.89017 } $$

Hidden-layer deltas:

$$ \boxed{ \delta_{h1} \approx 0.05938, \quad \delta_{h2} \approx -0.11672, \quad \delta_{h3} \approx -0.14254 } $$

Q2

a)

$$ Q_{1} \;=\; P\,Q_{0}. $$

Given:

$$ Q_0 = \begin{bmatrix} 0.2 & 0.8 \end{bmatrix}^T = \begin{bmatrix} 0.2 \\ 0.8 \end{bmatrix}, \quad P = \begin{bmatrix} p & 1-p \\ 1-p & p \end{bmatrix}, $$

$$ P\,Q_0 = \begin{bmatrix} p & 1-p \\ 1-p & p \end{bmatrix} \begin{bmatrix} 0.2 \\ 0.8 \end{bmatrix} = \begin{bmatrix} p\cdot 0.2 \;+\; (1-p)\cdot 0.8 \\ (1-p)\cdot 0.2 \;+\; p\cdot 0.8 \end{bmatrix}. $$

  • Top component

$$ p \times 0.2 \;+\; (1-p)\times 0.8 \;=\; 0.2p \;+\; 0.8 \;-\; 0.8p \;=\; 0.8 \;-\; 0.6p. $$

  • Bottom component

$$ (1-p)\times 0.2 \;+\; p\times 0.8 \;=\; 0.2 \;-\; 0.2p \;+\; 0.8p \;=\; 0.2 \;+\; 0.6p. $$

Hence,

$$ Q_1 = \begin{bmatrix} 0.8 - 0.6p \\ 0.2 + 0.6p \end{bmatrix}. $$

b)

$$ P = \begin{bmatrix} p & 1-p\\ 1-p & p \end{bmatrix} $$

when $n = 1$

$$ (2p - 1)^1 = 2p - 1 $$

$$ \frac{1}{2} + \frac{1}{2}(2p - 1) = p,\quad \frac{1}{2} - \frac{1}{2}(2p - 1) = 1 - p $$

$$ P^1 = \begin{bmatrix} p & 1 - p\\ 1 - p & p \end{bmatrix} = \begin{bmatrix} \frac{1}{2} + \frac{1}{2}(2p - 1) & \frac{1}{2} - \frac{1}{2}(2p - 1)\\[6pt] \frac{1}{2} - \frac{1}{2}(2p - 1) & \frac{1}{2} + \frac{1}{2}(2p - 1) \end{bmatrix} $$

Inductive Step:

Assume

$$ P^n = \begin{bmatrix} A_n & B_n\\ B_n & A_n \end{bmatrix},\quad A_n = \frac{1}{2} + \frac{1}{2}(2p - 1)^n,\quad B_n = \frac{1}{2} - \frac{1}{2}(2p - 1)^n $$

Then

$$ P^{n+1} = P^n P = \begin{bmatrix} A_n & B_n\\ B_n & A_n \end{bmatrix} \begin{bmatrix} p & 1 - p\\ 1 - p & p \end{bmatrix} $$

Top-left entry:

Let $x = (2p - 1)^n$

$$ A_n p + B_n (1 - p) = \left( \frac{1}{2} + \frac{1}{2}x \right)p + \left( \frac{1}{2} - \frac{1}{2}x \right)(1 - p) $$

$$ = \frac{p}{2} + \frac{px}{2} + \frac{1 - p}{2} - \frac{(1 - p)x}{2} = \frac{1}{2} + \frac{x}{2}(2p - 1) = \frac{1}{2} + \frac{1}{2}(2p - 1)^{n+1} = A_{n+1} $$

Top-right entry:

$$ A_n (1 - p) + B_n p = \left( \frac{1}{2} + \frac{1}{2}x \right)(1 - p) + \left( \frac{1}{2} - \frac{1}{2}x \right)p $$

$$ = \frac{1}{2} - \frac{x}{2}(2p - 1) = \frac{1}{2} - \frac{1}{2}(2p - 1)^{n+1} = B_{n+1} $$

By symmetry:

$$ P^{n+1} = \begin{bmatrix} A_{n+1} & B_{n+1}\\ B_{n+1} & A_{n+1} \end{bmatrix} = \begin{bmatrix} \frac{1}{2} + \frac{1}{2}(2p - 1)^{n+1} & \frac{1}{2} - \frac{1}{2}(2p - 1)^{n+1}\\[6pt] \frac{1}{2} - \frac{1}{2}(2p - 1)^{n+1} & \frac{1}{2} + \frac{1}{2}(2p - 1)^{n+1} \end{bmatrix} $$

Result:

$$ \boxed{ P^n = \begin{bmatrix} \frac{1}{2} + \frac{1}{2}(2p - 1)^n & \frac{1}{2} - \frac{1}{2}(2p - 1)^n\\[6pt] \frac{1}{2} - \frac{1}{2}(2p - 1)^n & \frac{1}{2} + \frac{1}{2}(2p - 1)^n \end{bmatrix} \quad \text{for all } n \ge 1 } $$

c)

$$ P = \begin{bmatrix} p & 1 - p\\ 1 - p & p \end{bmatrix} $$

$$ \mu^T P = \mu^T, \qquad \mu = [\alpha,\;1 - \alpha] $$

$$ \mu^T P = \begin{bmatrix} \alpha & 1 - \alpha \end{bmatrix} \begin{bmatrix} p & 1 - p\\ 1 - p & p \end{bmatrix} = \begin{bmatrix} \alpha p + (1 - \alpha)(1 - p) & \alpha(1 - p) + (1 - \alpha)p \end{bmatrix} $$

$$ \alpha = \alpha p + (1 - \alpha)(1 - p) = \alpha p + 1 - p - \alpha + \alpha p = 1 - p + 2\alpha p - \alpha $$

$$ \Rightarrow\quad \alpha = \frac{1}{2},\qquad \mu = \left[ \frac{1}{2},\;\frac{1}{2} \right] $$

d)

$$ \lim_{n \to \infty} \pi(n) = \mu = \left[ \frac{1}{2},\;\frac{1}{2} \right] $$

$$ \lim_{n \to \infty} P^n = \begin{bmatrix} \mu(1) \\ \mu(1) \end{bmatrix} \begin{bmatrix} \mu(2) \\ \mu(2) \end{bmatrix}^\top = \begin{bmatrix} \frac{1}{2} & \frac{1}{2} \\ \frac{1}{2} & \frac{1}{2} \end{bmatrix} $$

$$ \boxed{ \mu = \left[ \frac{1}{2},\;\frac{1}{2} \right], \qquad \lim_{n \to \infty} P^n = \begin{bmatrix} \frac{1}{2} & \frac{1}{2} \\ \frac{1}{2} & \frac{1}{2} \end{bmatrix} } $$

Q3

$$ \text{Given: } \gamma = 0.5,\quad T = 5,\quad R_1 = -1,\; R_2 = 2,\; R_3 = 6,\; R_4 = 3,\; R_5 = 2 $$

$$ G_5 = 0 \space\text{No future reward in here.} $$

$$ G_4 = R_5 = 2 $$

$$ G_3 = R_4 + \gamma R_5 = 3 + 0.5 \times 2 = 4 $$

$$ G_2 = R_3 + \gamma R_4 + \gamma^2 R_5 = 6 + 0.5 \times 3 + 0.5^2 \times 2 = 6 + 1.5 + 0.5 = 8 $$

$$ G_1 = R_2 + \gamma R_3 + \gamma^2 R_4 + \gamma^3 R_5 = 2 + 0.5 \times 6 + 0.5^2 \times 3 + 0.5^3 \times 2 = 2 + 3 + 0.75 + 0.25 = 6 $$

$$ \begin{align*} G_0 &= R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \gamma^4 R_5 \\ &= -1 + 0.5 \times 2 + 0.5^2 \times 6 + 0.5^3 \times 3 + 0.5^4 \times 2 \\ &= -1 + 1 + 1.5 + 0.375 + 0.125 \\ &= 2 \end{align*} $$

$$ \boxed{ G_0 = 2,\quad G_1 = 6,\quad G_2 = 8,\quad G_3 = 4,\quad G_4 = 2,\quad G_5 = 0 } $$

Q4

Consider an episode or trajectory $\tau$. The return starting at time $t$ is defined by:

$$ G_t(\tau) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots $$


Adding a Constant $c$ to Each Reward

Now define a modified reward at each step:

$$ R_k' = R_k + c $$

for all $k$. Then the new return is:

$$ G_t'(\tau) = R_{t+1}' + \gamma R_{t+2}' + \gamma^2 R_{t+3}' + \dots $$

Substitute $R_k' = R_k + c$:

$$ G_t'(\tau) = (R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots) + (c + \gamma c + \gamma^2 c + \dots) $$

$$ = G_t(\tau) + \left(c + \gamma c + \gamma^2 c + \dots\right) $$

Summation of the Constant Series

The infinite series of constants has a geometric factor $\gamma^k$, so:

$$ c + \gamma c + \gamma^2 c + \dots = c \sum_{k=0}^{\infty} \gamma^k = \frac{c}{1 - \gamma} $$

Thus:

$$ G_t'(\tau) = G_t(\tau) + \frac{c}{1 - \gamma} $$

Because every trajectory’s return is shifted by the same constant $\tfrac{c}{1 - \gamma}$:

  • The ordering of returns across trajectories does not change
  • The optimal policy remains the same
  • Only differences in rewards matter, not their absolute values

$$ \boxed{ G_t'(\tau) = G_t(\tau) + \frac{c}{1 - \gamma} } $$

Q5

a)

We are given:

$$ \pi(1 \mid s) = 1 - \frac{\varepsilon}{2} = 0.6 $$

$$ \frac{\varepsilon}{2} = 0.4 \quad\Rightarrow\quad \varepsilon = 0.8 $$

$$ \boxed{\varepsilon = 0.8} $$

b)

Environment Details:

Transitions:

$$ \begin{aligned} &\text{State 0:} \\ &\quad \text{Action 0} \to \text{State 0, Reward } -1 \\ &\quad \text{Action 1} \to \text{State 1, Reward } +1 \\ &\text{State 1:} \\ &\quad \text{Action 0} \to \text{State 1, Reward } +1 \\ &\quad \text{Action 1} \to \text{State 0, Reward } -1 \end{aligned} $$

Policy:

$$ \pi(1 \mid 0) = 0.6,\quad \pi(0 \mid 0) = 0.4,\quad \pi(0 \mid 1) = 0.6,\quad \pi(1 \mid 1) = 0.4 $$

Discount factor:

$$ \gamma = 0.1 $$

Bellman Equation for State 0:

$$ \begin{aligned} v(0) &= \pi(0 \mid 0)[-1 + \gamma v(0)] + \pi(1 \mid 0)[1 + \gamma v(1)] \\ &= 0.4[-1 + 0.1 v(0)] + 0.6[1 + 0.1 v(1)] \\ &= -0.4 + 0.04 v(0) + 0.6 + 0.06 v(1) \\ &= 0.2 + 0.04 v(0) + 0.06 v(1) \end{aligned} $$

Rearranged:

$$ 0.96 v(0) = 0.2 + 0.06 v(1) \quad \Rightarrow \quad v(0) = \frac{0.2 + 0.06 v(1)}{0.96} $$

Bellman Equation for State 1:

$$ \begin{aligned} v(1) &= \pi(0 \mid 1)[1 + \gamma v(1)] + \pi(1 \mid 1)[-1 + \gamma v(0)] \\ &= 0.6[1 + 0.1 v(1)] + 0.4[-1 + 0.1 v(0)] \\ &= 0.6 + 0.06 v(1) - 0.4 + 0.04 v(0) \\ &= 0.2 + 0.06 v(1) + 0.04 v(0) \end{aligned} $$

Rearranged:

$$ 0.94 v(1) = 0.2 + 0.04 v(0) \quad \Rightarrow \quad v(1) = \frac{0.2 + 0.04 v(0)}{0.94} $$

Solving the system (numerically):

$$ v(0) \approx v(1) \approx 0.2222 $$

Action-Value Functions:

$$ q(s, a) = R(s, a) + \gamma \cdot v(s') $$

State 0:

$$ \begin{aligned} q(0,0) &= -1 + 0.1 \cdot v(0) \approx -1 + 0.0222 = -0.9778 \\ q(0,1) &= 1 + 0.1 \cdot v(1) \approx 1 + 0.0222 = 1.0222 \end{aligned} $$

State 1:

$$ \begin{aligned} q(1,0) &= 1 + 0.1 \cdot v(1) \approx 1.0222 \\ q(1,1) &= -1 + 0.1 \cdot v(0) \approx -0.9778 \end{aligned} $$

Final Answer:

$$ \boxed{ v(0) \approx v(1) \approx 0.2222,\quad q(0,0) \approx -0.9778,\quad q(0,1) \approx 1.0222,\quad q(1,0) \approx 1.0222,\quad q(1,1) \approx -0.9778 } $$

c)

Bellman Optimality Derivation for Optimal Policy
Environment Details:

$$ \begin{aligned} \text{State 0:} &\quad \begin{cases} \text{Action 0: } R = -1,\; s' = 0 \\ \text{Action 1: } R = +1,\; s' = 1 \end{cases} \\[1mm] \text{State 1:} &\quad \begin{cases} \text{Action 0: } R = +1,\; s' = 1 \\ \text{Action 1: } R = -1,\; s' = 0 \end{cases} \end{aligned} $$

Discount factor:

$$ \gamma = 0.1 $$

Bellman Optimality Equations:

For state 0:

$$ V^*(0) = \max \left\{ -1 + 0.1 \cdot V^*(0),\;\; 1 + 0.1 \cdot V^*(1) \right\} $$

For state 1:

$$ V^*(1) = \max \left\{ 1 + 0.1 \cdot V^*(1),\;\; -1 + 0.1 \cdot V^*(0) \right\} $$

Solving for Optimal Actions:

From state 1:

$$ V^*(1) = 1 + 0.1 \cdot V^*(1) \quad \Rightarrow \quad 0.9\,V^*(1) = 1 \quad \Rightarrow \quad V^*(1) = \frac{1}{0.9} = 1.\overline{1} $$

Then from state 0:

$$ V^*(0) = 1 + 0.1 \cdot V^*(1) = 1 + 0.1 \cdot \frac{10}{9} = 1 + \frac{1}{9} = \frac{10}{9} = 1.\overline{1} $$

Optimal Action-Value Functions:

State 0:

$$ \begin{aligned} Q^*(0,0) &= -1 + 0.1 \cdot V^*(0) = -1 + 0.1 \cdot \frac{10}{9} = -\frac{8}{9} \approx -0.8889 \\ Q^*(0,1) &= 1 + 0.1 \cdot V^*(1) = 1 + 0.1 \cdot \frac{10}{9} = \frac{10}{9} \approx 1.1111 \end{aligned} $$

State 1:

$$ \begin{aligned} Q^*(1,0) &= 1 + 0.1 \cdot V^*(1) = \frac{10}{9} \approx 1.1111 \\ Q^*(1,1) &= -1 + 0.1 \cdot V^*(0) = -\frac{8}{9} \approx -0.8889 \end{aligned} $$

Optimal Policy:

$$ \pi^*(1 \mid 0) = 1, \quad \pi^*(0 \mid 1) = 1 $$

Final Answer:

$$ \boxed{ \begin{aligned} V^*(0) &= \frac{10}{9} \approx 1.1111,\quad V^*(1) = \frac{10}{9} \approx 1.1111 \\[1mm] Q^*(0,0) &= -\frac{8}{9} \approx -0.8889,\quad Q^*(0,1) = \frac{10}{9} \approx 1.1111 \\[1mm] Q^*(1,0) &= \frac{10}{9} \approx 1.1111,\quad Q^*(1,1) = -\frac{8}{9} \approx -0.8889 \\[1mm] \pi^*(1 \mid 0) &= 1,\quad \pi^*(0 \mid 1) = 1 \end{aligned} } $$

d)

Trajectory 1 (Start from state 0)

$$ 0 \xrightarrow{a=0,\;r=-1} 0 \xrightarrow{a=1,\;r=+1} 1 \xrightarrow{a=0,\;r=+1} \text{(terminal)} $$

$$ \begin{align*} G_0 &= -1 + 0.1(1) + 0.01(1) = -0.89 \\ G_1 &= 1 + 0.1(1) = 1.1 \\ G_2 &= 1 \end{align*} $$

Trajectory 2 (Start from state 1)

$$ 1 \xrightarrow{a=0,\;r=+1} 1 \xrightarrow{a=1,\;r=-1} 0 \xrightarrow{a=1,\;r=+1} \text{(terminal)} $$

$$ \begin{align*} G_0 &= 1 - 0.1 + 0.01 = 0.91 \\ G_1 &= -1 + 0.1 = -0.9 \\ G_2 &= 1 \end{align*} $$

Monte Carlo Updates with $\alpha = 0.6$
State-Value Updates
Initial: $V(s) = 0$
From Trajectory 1:

$$ \begin{align*} V(0) &\leftarrow 0 + 0.6(-0.89 - 0) = -0.534 \\ V(0) &\leftarrow -0.534 + 0.6(1.1 - (-0.534)) = 0.4464 \\ V(1) &\leftarrow 0 + 0.6(1 - 0) = 0.6 \end{align*} $$

From Trajectory 2:

$$ \begin{align*} V(1) &\leftarrow 0.6 + 0.6(0.91 - 0.6) = 0.786 \\ V(1) &\leftarrow 0.786 + 0.6(-0.9 - 0.786) = -0.2256 \\ V(0) &\leftarrow 0.4464 + 0.6(1 - 0.4464) = 0.77856 \end{align*} $$

Action-Value Updates
Initial: $Q(s,a) = 0$
From Trajectory 1:

$$ \begin{align*} Q(0,0) &\leftarrow 0 + 0.6(-0.89) = -0.534 \\ Q(0,1) &\leftarrow 0 + 0.6(1.1) = 0.66 \\ Q(1,0) &\leftarrow 0 + 0.6(1) = 0.6 \end{align*} $$

From Trajectory 2:

$$ \begin{align*} Q(1,0) &\leftarrow 0.6 + 0.6(0.91 - 0.6) = 0.786 \\ Q(1,1) &\leftarrow 0 + 0.6(-0.9) = -0.54 \\ Q(0,1) &\leftarrow 0.66 + 0.6(1 - 0.66) = 0.864 \end{align*} $$

Final Estimates
State-Value Function

$$ V(0) \approx 0.7786, \quad V(1) \approx -0.2256 $$

Action-Value Function

$$ \begin{align*} Q(0,0) &\approx -0.534 \\ Q(0,1) &\approx 0.864 \\ Q(1,0) &\approx 0.786 \\ Q(1,1) &\approx -0.54 \end{align*} $$

e)

Q-learning formula:

$$ Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right] $$

Trajectory 1
Time $t=0$:

$$ Q(0,0) \leftarrow 0 + 0.6 \left[ -1 + 0.1 \cdot 0 - 0 \right] = -0.6 $$

Time $t=1$:

$$ Q(0,1) \leftarrow 0 + 0.6 \left[ 1 + 0.1 \cdot 0 - 0 \right] = 0.6 $$

Time $t=2$:

$$ Q(1,0) \leftarrow 0 + 0.6 \left[ 1 + 0.1 \cdot 0 - 0 \right] = 0.6 $$

Trajectory 2
Time $t=0$:

$$ \begin{align*} \max_{a'} Q(1,a') &= \max\{0.6, 0\} = 0.6 \\ Q(1,0) &\leftarrow 0.6 + 0.6 \left[ 1 + 0.1 \cdot 0.6 - 0.6 \right] \\ &= 0.6 + 0.6 \cdot (1.06 - 0.6) = 0.6 + 0.6 \cdot 0.46 = 0.876 \end{align*} $$

Time $t=1$:

$$ \begin{align*} \max_{a'} Q(0,a') &= \max\{-0.6, 0.6\} = 0.6 \\ Q(1,1) &\leftarrow 0 + 0.6 \left[ -1 + 0.1 \cdot 0.6 - 0 \right] \\ &= 0 + 0.6 \cdot (-0.94) = -0.564 \end{align*} $$

Time $t=2$:

$$ \begin{align*} Q(0,1) &\leftarrow 0.6 + 0.6 \left[ 1 + 0.1 \cdot 0 - 0.6 \right] \\ &= 0.6 + 0.6 \cdot (1 - 0.6) = 0.6 + 0.24 = 0.84 \end{align*} $$

Final Answer

$$ \boxed{\begin{aligned} Q(0,0) &\approx -0.6 \\[3pt] Q(0,1) &\approx 0.84 \\[3pt] Q(1,0) &\approx 0.876 \\[3pt] Q(1,1) &\approx -0.564 \end{aligned}} $$

f)

we have

  • All $Q(s,a) = 0$
  • Initial policy: $\pi(1\mid0) = \pi(0\mid1) = 0.5$
  • $\varepsilon = 0.8$, so the greedy probability is 0.6
  • Tie-breaking: in state 0, action 1 is greedy; in state 1, action 0 is greedy
  • Parameters: $\alpha=0.6,\ \gamma=0.1$
  • Using the same random number sequence as in part (d):
    $0.94,\ 0.27,\ 0.30,\ 0.18,\ 0.75$

The trajectory is:

$$ (0,0,r=-1) \rightarrow (0,1,r=+1) \rightarrow (1,0,r=+1) \rightarrow (1,0,r=+1) \rightarrow (1,1,r=-1) \rightarrow \text{terminal} $$

SARSA Update Rule

$$ Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma Q(s', a') - Q(s,a) \right] $$

All $Q$ values are initialized to 0.

Update Steps

  • $t=0$: $(s=0, a=0, r=-1)$, next: $(s'=0, a'=1)$

    $$ Q(0,0) \leftarrow 0 + 0.6 \cdot \left[-1 + 0.1 \cdot Q(0,1) - 0 \right] = -0.6 $$

  • $t=1$: $(s=0, a=1, r=+1)$, next: $(s'=1, a'=0)$

    $$ Q(0,1) \leftarrow 0 + 0.6 \cdot \left[1 + 0.1 \cdot Q(1,0) - 0 \right] = 0.6 $$

  • $t=2$: $(s=1, a=0, r=+1)$, next: $(s'=1, a'=0)$

    $$ Q(1,0) \leftarrow 0 + 0.6 \cdot \left[1 + 0.1 \cdot Q(1,0) - 0 \right] = 0.6 $$

  • $t=3$: $(s=1, a=0, r=+1)$, next: $(s'=1, a'=1)$

    $$ Q(1,0) \leftarrow 0.6 + 0.6 \cdot \left[1 + 0.1 \cdot Q(1,1) - 0.6 \right] = 0.6 + 0.6 \cdot 0.4 = 0.84 $$

  • $t=4$: $(s=1, a=1, r=-1)$, terminal

    $$ Q(1,1) \leftarrow 0 + 0.6 \cdot \left[-1 + 0 - 0 \right] = -0.6 $$

Final SARSA Q-values

$$ \boxed{ \begin{aligned} Q(0,0) &\approx -0.6 \\ Q(0,1) &\approx 0.6 \\ Q(1,0) &\approx 0.84 \\ Q(1,1) &\approx -0.6 \end{aligned} } $$

Q6

a)

$$ P(S=s)\;=\;\sum_{b\in\{0,1\}}\sum_{r\in\{0,1\}}\sum_{a\in\{0,1\}} P(b)\,P(r\mid b)\,P(a\mid b)\,P(s\mid r,a). $$

we have

  • Prior on (B):

    $$ P(b=1)=0.5,\quad P(b=0)=0.5 $$

  • Conditional probabilities for (R):

    $$ \begin{aligned} P(r=1\mid b=1) &= 0.3,\quad P(r=0\mid b=1) = 0.7 \\ P(r=1\mid b=0) &= 0.8,\quad P(r=0\mid b=0) = 0.2 \end{aligned} $$

  • Conditional probabilities for (A):

    $$ \begin{aligned} P(a=1\mid b=1) &= 0.1,\quad P(a=0\mid b=1) = 0.9 \\ P(a=1\mid b=0) &= 0.5,\quad P(a=0\mid b=0) = 0.5 \end{aligned} $$

  • Conditional probabilities for (S):

    $$ \begin{aligned} P(s=1\mid r=1, a=1) &= 0 \\ P(s=1\mid r=1, a=0) &= 0.8 \\ P(s=1\mid r=0, a=1) &= 0.6 \\ P(s=1\mid r=0, a=0) &= 1.0 \end{aligned} $$

    $$ \begin{aligned} P(s=0\mid r=1, a=1) &= 1 \\ P(s=0\mid r=1, a=0) &= 0.2 \\ P(s=0\mid r=0, a=1) &= 0.4 \\ P(s=0\mid r=0, a=0) &= 0 \end{aligned} $$

(b=1):

$$ \begin{aligned} &0.5 \cdot 0.3 \cdot 0.1 \cdot 0 = 0 \\ &0.5 \cdot 0.3 \cdot 0.9 \cdot 0.8 = 0.108 \\ &0.5 \cdot 0.7 \cdot 0.1 \cdot 0.6 = 0.021 \\ &0.5 \cdot 0.7 \cdot 0.9 \cdot 1.0 = 0.315 \\ &\text{Sum for } b=1: \quad 0 + 0.108 + 0.021 + 0.315 = 0.444 \end{aligned} $$

(b=0):

$$ \begin{aligned} &0.5 \cdot 0.8 \cdot 0.5 \cdot 0 = 0 \\ &0.5 \cdot 0.8 \cdot 0.5 \cdot 0.8 = 0.16 \\ &0.5 \cdot 0.2 \cdot 0.5 \cdot 0.6 = 0.03 \\ &0.5 \cdot 0.2 \cdot 0.5 \cdot 1.0 = 0.05 \\ &\text{Sum for } b=0: \quad 0 + 0.16 + 0.03 + 0.05 = 0.24 \end{aligned} $$

Final Result

$$ P(S=s) = 0.444 + 0.24 = \boxed{0.684} $$

b)

From the diagram’s CPT for $S$, once you know $R = r$ and $A = a$, $S$ depends only on $r$ and $a$. Hence the conditional probabilities $P(s \mid r, a)$ can be read directly from the $S$‐table:

$$ P(s=1 \mid r,a)= \begin{cases} 0 & (r=1,\,a=1)\\ 0.8 & (r=1,\,a=0)\\ 0.6 & (r=0,\,a=1)\\ 1.0 & (r=0,\,a=0) \end{cases} \quad \text{and} \quad P(s=0 \mid r,a)\;=\;1 \;-\; P(s=1 \mid r,a). $$

c)

Let

$$ P(r,a \mid s) = P(r \mid s)\, P(a \mid s) $$

Use the Joint and Normalize

We compute:

$$ P(r,a \mid s) = \frac{P(r,a,s)}{P(s)} = \frac{\sum_{b\in\{0,1\}} P(b)\,P(r\mid b)\,P(a\mid b)\,P(s\mid r,a)}{P(s)} $$

Once we have the joint distribution ( P(r,a \mid s) ), we can get the marginals by summing:

$$ P(r \mid s) = \sum_{a} P(r,a \mid s), \quad P(a \mid s) = \sum_{r} P(r,a \mid s) $$

Numerical Values for ( s = 1 )

From earlier computations:

$$ P(S=s) = 0.684 $$

and from the same summations,

  • ( P(r=1, s=1) = 0.268 )
  • ( P(r=0, s=1) = 0.416 )
  • ( P(a=1, s=1) = 0.051 )
  • ( P(a=0, s=1) = 0.633 )

Hence:

$$ P(r=1 \mid s=1) = \frac{0.268}{0.684} \approx 0.393 $$

$$ P(a=1 \mid s=1) = \frac{0.051}{0.684} \approx 0.075 $$

Check Conditional Independence

Note: $ P(r=1, a=1, s=1) = 0 $

Then:

$$ P(r=1, a=1 \mid s=1) = \frac{0}{0.684} = 0 $$

But:

$$ P(r=1 \mid s=1) \cdot P(a=1 \mid s=1) = (0.393)(0.075) \approx 0.0295 \neq 0 $$

This shows directly that

$$ P(r,a \mid s) \neq P(r \mid s)\, P(a \mid s) $$

d)

Computing $ P(b \mid s=1) $

We want to compute the posterior probability of $B$ given $ S=1 $. This can be done using:

$$ P(b \mid s) = \frac{P(b, s)}{P(s)} = \frac{\sum_{r \in \{0,1\}} \sum_{a \in \{0,1\}} P(b)\,P(r \mid b)\,P(a \mid b)\,P(s \mid r, a)}{P(s)} $$

From earlier, we know that:

$$ P(s = 1) = 0.684 $$

Compute $ P(b=1, s=1)$

This involves summing over all combinations of $ r $ and $ a $:

  • $ r=1, a=1$: contributes 0 (since $ P(s=1 \mid r=1, a=1) = 0 $)
  • $r=1, a=0$:

    $$ 0.5 \cdot 0.3 \cdot 0.9 \cdot 0.8 = 0.108 $$

  • $r=0, a=1$:

    $$ 0.5 \cdot 0.7 \cdot 0.1 \cdot 0.6 = 0.021 $$

  • $r=0, a=0$:

    $$ 0.5 \cdot 0.7 \cdot 0.9 \cdot 1.0 = 0.315 $$

Adding these:

$$ P(b=1, s=1) = 0.108 + 0.021 + 0.315 = 0.444 $$

Compute $ P(b=0, s=1) $

Again summing over all $ r, a $:

  • $ r=1, a=0 $:

    $$ 0.5 \cdot 0.8 \cdot 0.5 \cdot 0.8 = 0.16 $$

  • $ r=0, a=1 $:

    $$ 0.5 \cdot 0.2 \cdot 0.5 \cdot 0.6 = 0.03 $$

  • $ r=0, a=0 $:

    $$ 0.5 \cdot 0.2 \cdot 0.5 \cdot 1.0 = 0.05 $$

Adding:

$$ P(b=0, s=1) = 0.16 + 0.03 + 0.05 = 0.24 $$

Now compute the posterior:

$$ P(b=1 \mid s=1) = \frac{0.444}{0.684} \approx 0.65 $$

$$ P(b=0 \mid s=1) = \frac{0.24}{0.684} \approx 0.35 $$

Final Answer:

$$ \boxed{ P(b=1 \mid s=1) \approx 0.65, \quad P(b=0 \mid s=1) \approx 0.35 } $$

e)

Sample $ B $

  • Prior: $ P(B = 1) = 0.5 $
  • Given $ u_1 = 0.94 $, since $ 0.94 > 0.5 $, we set:

    $$ B = 0 $$

Sample $ R \mid B = 0 $

  • Conditional: $ P(R = 1 \mid B = 0) = 0.8 $
  • Given $ u_2 = 0.27 $, since $ 0.27 < 0.8 $, we set:

    $$ R = 1 $$

Sample $ A \mid B = 0 $

  • Conditional: $ P(A = 1 \mid B = 0) = 0.5 $
  • Given $ u_3 = 0.30 $, since $ 0.30 < 0.5 $, we set:

    $$ A = 1 $$

Sample $ S \mid R = 1, A = 1 $

  • From the conditional table:
    $ P(S = 1 \mid R = 1, A = 1) = 0 \Rightarrow P(S = 0) = 1 $
  • So regardless of $ u_4 = 0.18 $, we must set:

    $$ S = 0 $$

Putting it all together:

$$ \boxed{(B, R, A, S) = (0,\; 1,\; 1,\; 0)} $$

Q7

a)

b)

Last modification:March 27, 2025
如果觉得我的文章对你有用,请随意赞赏