SYSC5108_A2

Author： William.W
发布时间：February 14, 2025
9 views
No comments
34766 words
Categories：代码-Code

# Q1

1. Given n independent sample points $ x_i, i = 1, \dots, n $ from a Laplace distribution with PDF
   
   $$
   f(x) = \frac{\lambda}{2} e^{-\lambda |x|}
   $$
   
   Provide MLE estimators for the parameter $ \lambda $.
2. The log-likelihood for $n$ i.i.d. samples $x_1,\dots,x_n$ from the Laplace distribution

$$
f(x)=\frac{\lambda}{2} e^{-\lambda|x|}\space \Rightarrow \space \ell(\lambda)
=\sum_{i=1}^n \ln\left(\frac{\lambda}{2}\right)
- \lambda\sum_{i=1}^n |x_i|
= n \ln(\lambda) - n\ln(2) - \lambda \sum_{i=1}^n |x_i|.
$$

2. Differentiate and set to zero:

$$
\frac{d\,\ell(\lambda)}{d\lambda}
= \frac{n}{\lambda} - \sum_{i=1}^n |x_i| = 0
\quad\Longrightarrow\quad
\hat{\lambda} 
= \frac{n}{\displaystyle \sum_{i=1}^n |x_i|}.
$$

3. The second derivative
   $\frac{d^2\,\ell(\lambda)}{d\lambda^2}=-\frac{n}{\lambda^2}<0$
   confirms a maximum.

Thus, the MLE for $\lambda$ is

$$
\boxed{\hat{\lambda} = \frac{n}{\sum_{i=1}^n |x_i|}}.
$$

# Q2

2. Maximum entropy is often used to select an a priori distribution when there is very little information about the RV. Consider the Bernoulli distribution $ p(x) = \begin{cases} p, & x = 1 \\ 1 - p, & x = 0 \\ 0, & \text{otherwise} \end{cases} $. Find its entropy $ H(X) $ and prove analytically that $ H(X) $ achieves its maximum when $ p = \frac{1}{2} $.
3. For a Bernoulli random variable $X$ with $\Pr(X=1)=p$ and $\Pr(X=0)=1-p$, its entropy is

$$
H(X)=-p\ln p-(1-p)\ln(1-p).
$$

2. Differentiate

$$
\frac{d\,H}{dp}
= -\ln p - 1 + \ln(1-p) + 1
= \ln\!\bigl(\tfrac{1-p}{p}\bigr) 
= 0
\quad\Longrightarrow\quad
\tfrac{1-p}{p}=1 \quad\Longrightarrow\quad p=\tfrac12.
$$

Second derivative $ \tfrac{d^2\,H}{dp^2}=-\tfrac{1}{p} - \tfrac{1}{1-p} $ is negative for $0<p<1$, implying a maximum at $p=\tfrac12$.

# Q3

3. A multivariate logistic regression model has been built to predict the propensity of shoppers to perform a repeat purchase of a free gift that they are given. The input features used by the model are the age of the customer, the socioeconomic band to which the customer belongs (a, b, or c), the average amount of money the customer spends on each visit to the shop, and the average number of visits the customer makes to the shop per week. This model is being used by the marketing department to determine who should be given the free gift. The weights in the trained model are shown in the following table. It is assumed that one hot coding is used for a, b, and c. Furthermore, if both b and c are 0, a will be used. Therefore, a is encoded as $b = c = 0$.

| Feature              | Weight   |
| -------------------- | -------- |
| Intercept ($w[0]$)   | -3.82398 |
| AGE                  | -0.02909 |
| Socioeconomic BAND B | -0.00980 |
| Socioeconomic BAND C | -0.19558 |
| SHOP VALUE           | 0.02999  |
| SHOP FREQUENCY       | 0.74572  |

Use this model to make predictions for each of the following query instances.

| ID | AGE | Socioeconomic BAND | SHOP FREQUENCY | SHOP VALUE |
| -- | --- | ------------------ | -------------- | ---------- |
| 1  | 56  | b                  | 1.60           | 109.32     |
| 2  | 21  | c                  | 4.92           | 11.28      |
| 3  | 48  | b                  | 1.21           | 161.19     |
| 4  | 37  | c                  | 0.72           | 170.65     |
| 5  | 32  | a                  | 1.08           | 165.39     |

```python
import math

# Given weights:
w0          = -3.82398
w_age       = -0.02909
w_band_b    = -0.09089
w_band_c    = -0.19558
w_value     =  0.02999
w_frequency =  0.74572

# Helper function to compute sigmoid
def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-z))

# (Age, Band, ShopFrequency, ShopValue)
queries = [
    (56, 'b', 1.60, 109.32),
    (21, 'c', 4.92,  11.28),
    (48, 'a', 1.21, 161.19),
    (37, 'c', 0.72, 170.65),
    (32, 'a', 1.08, 165.39)
]

# One-hot encoding for socioeconomic band: a => (0,0), b => (1,0), c => (0,1)
def encode_band(band):
    if band == 'a':
        return (0,0)
    elif band == 'b':
        return (1,0)
    elif band == 'c':
        return (0,1)

# Compute predictions
for i, (age, band, freq, val) in enumerate(queries, start=1):
    b_enc, c_enc = encode_band(band)
    logit = (w0
             + w_age       * age
             + w_band_b    * b_enc
             + w_band_c    * c_enc
             + w_value     * val
             + w_frequency * freq)
    # Probability
    prob = sigmoid(logit)
    print(f"Query {i}: logit = {logit:.4f}, predicted probability = {prob:.4f}")
```

Query 1: logit = -1.0723, predicted probability = 0.2550
Query 2: logit = -0.6232, predicted probability = 0.3490
Query 3: logit = 0.5161, predicted probability = 0.6262
Query 4: logit = 0.5588, predicted probability = 0.6362
Query 5: logit = 1.0106, predicted probability = 0.7331

Regression model:

$$
\text{logit} = w_0 + w_{\text{age}}\cdot\text{Age} 
+ w_{\text{bandB}}\cdot I_{\{\text{band=B}\}}
+ w_{\text{bandC}}\cdot I_{\{\text{band=C}\}}
+ w_{\text{value}}\cdot\text{ShopValue}
+ w_{\text{freq}}\cdot\text{ShopFrequency}.
$$

Then the predicted probability is the sigmoid of this linear combination:

$$
= \frac{1}{1 + e^{-\text{logit}}}.
$$

For each row, substitute the features (with appropriate one‐hot encoding for the band) and compute for the five queries are:

- $p \approx 0.255$
- $p \approx 0.349$
- $p \approx 0.626$
- $p \approx 0.637$
- $p \approx 0.731$

# Q4

4. In building multivariate logistic regression models, it is recommended that all continuous input features be normalized to the range [-1,1]. The following table shows a data quality report for the dataset used to train the model described in the last question.

### Data Quality Report

| Feature        | % | Count | Miss. | Card. | Min. | Qrt. | Mean  | Median | Qrt.   | Max.  | Std. Dev. |
| -------------- | - | ----- | ----- | ----- | ---- | ---- | ----- | ------ | ------ | ----- | --------- |
| AGE            | % | 5,200 | 6     | 40    | 18   | 22   | 32.7  | 32     | 32     | 63    | 12.2      |
| SHOP FREQUENCY | % | 5,200 | 0     | 316   | 0.2  | 1.0  | 2.2   | 1.3    | 1.3    | 4.3   | 5.4       |
| SHOP VALUE     | % | 5,200 | 0     | 3,730 | 5    | 11.8 | 101.9 | 100.14 | 100.14 | 174.6 | 230.7     |

| Feature            | % | Count | Miss. | Card. | Mode | Mode Count | %    |
| ------------------ | - | ----- | ----- | ----- | ---- | ---------- | ---- |
| SOCIOECONOMIC BAND | % | 5,200 | 8     | 3     | a    | 2,664      | 51.2 |
| REPEAT PURCHASE    | % | 5,200 | 0     | 2     | no   | 2,791      | 53.7 |

On the basis of the information in this report, all continuous features were normalized using range normalization, and any missing values were replaced using mean for imputing continuous features and mode for imputing categorical features. After applying these data preparation operations, a multivariate logistic regression model was trained to give the weights shown in the following table.

### Model Weights

| Feature              | Weight  |
| -------------------- | ------- |
| Intercept ($w[0]$)   | 0.6679  |
| AGE                  | -0.5795 |
| SOCIOECONOMIC BAND B | -0.1981 |
| SOCIOECONOMIC BAND C | -0.2318 |
| SHOP VALUE           | 3.4091  |
| SHOP FREQUENCY       | 2.0499  |

Use this model to make predictions for each of the query instances shown in the following table (question marks refer to missing values).

### Query Instances

| ID | AGE | Socioeconomic BAND | SHOP FREQUENCY | SHOP VALUE |
| -- | --- | ------------------ | -------------- | ---------- |
| 1  | 38  | a                  | 1.90           | 165.39     |
| 2  | 56  | b                  | 1.60           | 109.32     |
| 3  | 18  | c                  | 6.00           | 10.09      |
| 4  | ?   | b                  | 1.33           | 204.62     |
| 5  | 62  | ?                  | 0.85           | 110.50     |

```python
import numpy as np
import pandas as pd

# Given data
min_age, max_age = 18, 63
min_shop_freq, max_shop_freq = 0.2, 4.3
min_shop_value, max_shop_value = 5, 174.6
mean_age = 32.7
mode_socioeconomic_band = "a"

# Model weights
intercept = 0.6679
weights = {
    "AGE": -0.5795,
    "SOCIOECONOMIC_BAND_B": -0.1981,
    "SOCIOECONOMIC_BAND_C": -0.2318,
    "SHOP_VALUE": 3.4091,
    "SHOP_FREQUENCY": 2.0499,
}

# Query data (before imputation and normalization)
query_instances = pd.DataFrame([
    [1, 38, 'a', 1.90, 165.39],
    [2, 56, 'b', 1.60, 109.32],
    [3, 18, 'c', 6.00, 10.09],
    [4, np.nan, 'b', 1.33, 204.62],
    [5, 62, np.nan, 0.85, 110.50],
], columns=["ID", "AGE", "SOCIOECONOMIC_BAND", "SHOP_FREQUENCY", "SHOP_VALUE"])

#missing values
query_instances["AGE"].fillna(mean_age, inplace=True)
query_instances["SOCIOECONOMIC_BAND"].fillna(mode_socioeconomic_band, inplace=True)

# Normalize continuous features to [-1,1] range
query_instances["AGE"] = 2 * (query_instances["AGE"] - min_age) / (max_age - min_age) - 1
query_instances["SHOP_FREQUENCY"] = 2 * (query_instances["SHOP_FREQUENCY"] - min_shop_freq) / (max_shop_freq - min_shop_freq) - 1
query_instances["SHOP_VALUE"] = 2 * (query_instances["SHOP_VALUE"] - min_shop_value) / (max_shop_value - min_shop_value) - 1

# Encode categorical variables
query_instances["SOCIOECONOMIC_BAND_B"] = (query_instances["SOCIOECONOMIC_BAND"] == 'b').astype(int)
query_instances["SOCIOECONOMIC_BAND_C"] = (query_instances["SOCIOECONOMIC_BAND"] == 'c').astype(int)

# Compute logistic regression predictions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Compute linear combination of inputs and weights
linear_combination = (
    intercept +
    weights["AGE"] * query_instances["AGE"] +
    weights["SOCIOECONOMIC_BAND_B"] * query_instances["SOCIOECONOMIC_BAND_B"] +
    weights["SOCIOECONOMIC_BAND_C"] * query_instances["SOCIOECONOMIC_BAND_C"] +
    weights["SHOP_FREQUENCY"] * query_instances["SHOP_FREQUENCY"] +
    weights["SHOP_VALUE"] * query_instances["SHOP_VALUE"]
)

# Apply sigmoid function to get probabilities
query_instances["PREDICTION"] = sigmoid(linear_combination)
query_instances[["ID", "PREDICTION"]]
```

- Query 1: $0.968$
- Query 2: $0.551$
- Query 3: $0.826$
- Query 4: $0.987$
- Query 5: $0.388$

# Q5

5. Consider the training for logistic regression model. Assume $K = 2$, $y = 0$ or $1$. The loss function is defined below.

$$
J(w) = -y \log(\sigma(z)) - (1 - y) \log(1 - \sigma(z))
$$

where $z = w^T x, x_0 = 1$.

## (a) Show that

$$
\nabla_w J(w) = -x(y - \sigma(z))
$$

$$
z = w^T x, 
\quad \sigma(z) = \frac{1}{1+e^{-z}}, 
\quad
J(w) = -\,y \,\log\!\bigl(\sigma(z)\bigr) \;-\; \bigl(1-y\bigr)\,\log\!\bigl(1-\sigma(z)\bigr).
$$

We want $\nabla_{w} J(w)$.  A standard result (or by direct differentiation) shows:

$$
\frac{\partial}{\partial z}\,J(w) 
= -\,\frac{y}{\sigma(z)}\,\sigma'(z) 
\;-\;\frac{1-y}{1-\sigma(z)}\,[-\sigma'(z)],
$$

where $\sigma'(z) = \sigma(z)\bigl(1 - \sigma(z)\bigr)$.

Simplify:

$$
\frac{\partial}{\partial z}\,J(w) 
= -\,y \,\frac{\sigma'(z)}{\sigma(z)} \;+\;(1-y)\,\frac{\sigma'(z)}{1-\sigma(z)}.
$$

$$
\frac{\partial}{\partial z}\,J(w)
= \sigma(z)\bigl(1-\sigma(z)\bigr)\!
\Bigl(
-\,\frac{y}{\sigma(z)} \;+\; \frac{1-y}{\,1-\sigma(z}\Bigr).
$$

A little algebra shows that this combination inside the parentheses becomes $\sigma(z) - y$.  Indeed:

$$
-\,y \,\frac{1}{\sigma(z)} + (1-y)\,\frac{1}{1-\sigma(z)}
= \frac{-\,y\,[1-\sigma(z)] + (1-y)\,\sigma(z)}{\sigma(z)\,\bigl(1-\sigma(z)\bigr)} 
= \frac{\sigma(z) - y}{\sigma(z)\,\bigl(1-\sigma(z)\bigr)}.

\frac{\partial J}{\partial z} 
= \sigma(z) - y.
$$

Finally, by the chain rule with $z = w^T x$ and $\nabla_z z = x$:

$$
\nabla_{w}\,J(w) 
\;=\; \bigl[\sigma(z) - y\bigr]\,x 
\;=\;
-\,x\,(y \;-\;\sigma(z)).
$$

## (b)

Now change the encoding for $y$ to $y = -1$, or $1$. Show the result again. Explain the impact of this change.

$$
\nabla_{\mathbf{w}} J(\mathbf{w}) = -\,\mathbf{x}\,[y - \sigma(z)].
$$

Now, we switch the labels to $y \in \{-1, +1\}$. Below is the new loss function and the corresponding gradient derivation.

### Loss Function ($y \in \{-1, +1\}$)

When $y$ takes values $\pm 1$, we can replace the original $\{0,1\}$ indicator with the following:

$$
\mathbb{1}[\,y=+1\,] = \frac{1 + y}{2},
\quad
\mathbb{1}[\,y=-1\,] = \frac{1 - y}{2}.
$$

$$
\boxed{
J(\mathbf{w}) = -\,\frac{1 + y}{2}\,\ln\bigl(\sigma(z)\bigr) - \frac{1 - y}{2}\,\ln\bigl(1 - \sigma(z)\bigr),
}
$$

where $z = \mathbf{w}^\top \mathbf{x}$ and $\sigma(z) = \frac{1}{1+e^{-z}}$.

### Gradient Derivation Preparation

$$
z = \mathbf{w}^\top \mathbf{x}, 
\quad 
\sigma(z) = \frac{1}{1 + e^{-z}}.
$$

To compute the gradient of the loss function $J(\mathbf{w})$ with respect to $\mathbf{w}$, we first compute the derivative of $z$, and then multiply by $\frac{\partial z}{\partial \mathbf{w}} = \mathbf{x}$:

$$
\nabla_{\mathbf{w}} J(\mathbf{w}) = \frac{\partial J}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}} = \left( \frac{\partial J}{\partial z} \right) \mathbf{x}.
$$

### Derivation of $\frac{\partial J}{\partial z}$

$$
\begin{aligned}
J(\mathbf{w}) &= -\frac{1 + y}{2}\,\ln\bigl(\sigma(z)\bigr) - \frac{1 - y}{2}\,\ln\bigl(1-\sigma(z)\bigr), \\
\frac{\partial J}{\partial z} &= -\frac{1 + y}{2}\,\underbrace{\frac{\partial}{\partial z}\bigl[\ln(\sigma(z))\bigr]}_{\frac{1}{\sigma(z)}\,\sigma'(z)} - \frac{1 - y}{2}\,\underbrace{\frac{\partial}{\partial z}\bigl[\ln(1-\sigma(z))\bigr]}_{\frac{1}{1-\sigma(z)}\cdot[-\sigma'(z)]}.
\end{aligned}
$$

$$
\sigma'(z) = \sigma(z)\,[1 - \sigma(z)],
$$

$$
\frac{\partial}{\partial z}\ln(\sigma(z)) = \frac{\sigma'(z)}{\sigma(z)} = \frac{\sigma(z)[1 - \sigma(z)]}{\sigma(z)} = 1 - \sigma(z),
$$

$$
\frac{\partial}{\partial z}\ln\bigl(1 - \sigma(z)\bigr) = \frac{-\,\sigma'(z)}{1 - \sigma(z)} = -\,\frac{\sigma(z)[1 - \sigma(z)]}{1 - \sigma(z)} = -\,\sigma(z).
$$

we can write the equation as:

$$
\begin{aligned}
\frac{\partial J}{\partial z} &= -\frac{1 + y}{2}\,\Bigl[\,1 - \sigma(z)\Bigr] + \frac{1 - y}{2}\,\bigl[-\,\sigma(z)\bigr] \\
&= -\frac{1 + y}{2}\;\Bigl[\,1 - \sigma(z)\Bigr] + \frac{1 - y}{2}\;\sigma(z).
\end{aligned}
$$

$$
\begin{aligned}
\frac{\partial J}{\partial z} &= -\frac{1 + y}{2} + \frac{1 + y}{2}\,\sigma(z) + \frac{1 - y}{2}\,\sigma(z) \\
&= -\frac{1 + y}{2} + \sigma(z)\,\Bigl[\,\frac{1 + y}{2} + \frac{1 - y}{2}\Bigr] \\
&= -\frac{1 + y}{2} + \sigma(z)\,\Bigl[\frac{(1 + y)+(1 - y)}{2}\Bigr] \\
&= -\frac{1 + y}{2} + \sigma(z)\,\Bigl[\frac{2}{2}\Bigr] \\
&= -\frac{1 + y}{2} + \sigma(z) \\
&= \boxed{\sigma(z) - \frac{1 + y}{2}}.
\end{aligned}
$$

### Substituting into the Gradient

Using the chain rule:

$$
\nabla_{\mathbf{w}} J(\mathbf{w}) = \left( \sigma(z) - \frac{1 + y}{2} \right) \mathbf{x}.
$$

Alternatively, we can write it in the "negative" form:

$$
\nabla_{\mathbf{w}} J(\mathbf{w}) = -\,\mathbf{x}\,\Bigl[\frac{1 + y}{2} - \sigma(z)\Bigr].
$$

- **Original labels $\{0,1\}$**
  The gradient is:
  
  $$
  \nabla_{\mathbf{w}} J(\mathbf{w}) = -\,\mathbf{x}\,[y - \sigma(z)].
  $$
- **New labels $\{-1,+1\}$**
  The gradient is:
  
  $$
  nabla_{\mathbf{w}} J(\mathbf{w}) = \left[ \sigma(z) - \frac{1 + y}{2} \right]\,\mathbf{x}.
  $$
  
  Here, $\frac{1 + y}{2}$ equals 1 when $y = +1$ and equals 0 when $y = -1$

# Q6

6. Consider the training for logistic regression model using LSE. Assume $K = 2$, and the loss function is defined through the errors in probabilities as below.

$$
J(\beta) = \sum_{i=1}^{M} \left[ I_1(y^{(i)}) \left( 1 - p_1(x^{(i)}) \right)^2 + I_2(y^{(i)}) \left( p_1(x^{(i)}) \right)^2 \right]
$$

where $z = \beta^\top x, x_0 = 1$. Find the gradient of the loss function and compare with the MLE discussed in class.

$$
K = 2 \quad\Longrightarrow\quad \text{class 1 or class 2}.
$$

We define indicator functions

$$
I_1(y^{(i)})=
\begin{cases}
1 & \text{if }y^{(i)}=\text{class 1},\\
0 & \text{otherwise},
\end{cases}
\qquad
I_2(y^{(i)})=
\begin{cases}
1 & \text{if }y^{(i)}=\text{class 2},\\
0 & \text{otherwise}.
\end{cases}
$$

The model is logistic regression:

- We write
  
  $$
  z^{(i)} = \beta^\top x^{(i)}, 
    \quad 
    p_1\bigl(x^{(i)}\bigr) = \sigma\bigl(z^{(i)}\bigr) = \frac{1}{1+e^{-z^{(i)}}}.
  $$
- Here $p_1(x^{(i)})$ is the predicted probability that $x^{(i)}$ belongs to **class 1**.

In words:

- If the true label is class 1, the “error” term is $\bigl(1 - p_1(x^{(i)})\bigr)^2$.
- If the true label is class 2, the “error” term is $\bigl(p_1(x^{(i)})\bigr)^2$.

Focus on the $i$-th data point. Its contribution to $J(\beta)$ is:

$$
J_i(\beta) 
=\;
I_1(y^{(i)})\;\bigl(1 - p_1^{(i)}\bigr)^2 
\;+\;
I_2(y^{(i)})\;\bigl(p_1^{(i)}\bigr)^2,
$$

where we set $p_1^{(i)} := p_1\bigl(x^{(i)}\bigr) = \sigma\bigl(\beta^\top x^{(i)}\bigr).$

We will compute $\nabla_{\beta} J_i(\beta)$. By the chain rule:

1. $$
   \frac{\partial p_1^{(i)}}{\partial \beta} = \sigma'\bigl(\beta^\top x^{(i)}\bigr)\,x^{(i)} = p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}.
   $$
2. Hence,

$$
\begin{aligned}
\nabla_{\beta} \Bigl[\bigl(1 - p_1^{(i)}\bigr)^2\Bigr]
&= 
2\bigl(1 - p_1^{(i)}\bigr)\,\nabla_{\beta}\bigl(1 - p_1^{(i)}\bigr)
\;=\;
2\bigl(1 - p_1^{(i)}\bigr)\,\bigl[-\,\nabla_{\beta} p_1^{(i)}\bigr]
\\[4pt]
&= 
-2\,\bigl(1 - p_1^{(i)}\bigr)\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}
\;=\;
-2\,p_1^{(i)}\,\bigl(1 - p_1^{(i)}\bigr)^2\,x^{(i)}.
\\
\nabla_{\beta} \Bigl[\bigl(p_1^{(i)}\bigr)^2\Bigr]
&=
2\,p_1^{(i)}\,\nabla_{\beta} p_1^{(i)}
=
2\,p_1^{(i)}\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}
=
2\,\bigl(p_1^{(i)}\bigr)^2\,\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}.
\end{aligned}
$$

Therefore,

$$
\nabla_{\beta} J_i(\beta)
=
I_1\bigl(y^{(i)}\bigr)
\bigl[-2\,p_1^{(i)}\,\bigl(1 - p_1^{(i)}\bigr)^2\,x^{(i)}\bigr]
\;+\;
I_2\bigl(y^{(i)}\bigr)
\bigl[\,2\,\bigl(p_1^{(i)}\bigr)^2\,\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}\bigr].
$$

Factor out $2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}$:

$$
\begin{aligned}
\nabla_{\beta} J_i(\beta)
&=
2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)} 
\Bigl[
I_2\bigl(y^{(i)}\bigr)\,p_1^{(i)}
\;-\;
I_1\bigl(y^{(i)}\bigr)\,\bigl(1 - p_1^{(i)}\bigr)
\Bigr]
\\[6pt]
&=
2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}
\Bigl[
-\,I_1\bigl(y^{(i)}\bigr)\,\bigl(1 - p_1^{(i)}\bigr)
\;+\;
I_2\bigl(y^{(i)}\bigr)\,p_1^{(i)}
\Bigr].
\end{aligned}
$$

Now notice:

- If $y^{(i)}=\text{class 1}$, then $I_1(y^{(i)})=1$, $I_2(y^{(i)})=0$. The bracket becomes:
  
  $$
  -\,\bigl(1 - p_1^{(i)}\bigr) 
  \;=\; 
  p_1^{(i)} \;-\;1.
  $$
- If $y^{(i)}=\text{class 2}$, then $I_1(y^{(i)})=0$, $I_2(y^{(i)})=1$. The bracket becomes:
  
  $$
  p_1^{(i)}.
  $$

In fact, that bracket is exactly

$$
p_1^{(i)} - I_1\bigl(y^{(i)}\bigr)
$$

in both cases. Therefore, a more compact final expression is:

$$
\boxed{
\nabla_{\beta} J_i(\beta)
\;=\;
2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,
\Bigl[
p_1^{(i)} \;-\; I_1\bigl(y^{(i)}\bigr)
\Bigr]
\,x^{(i)}.
}
$$

Summing over all $i=1,\dots,M$,

$$
\boxed{
\nabla_{\beta} J(\beta)
\;=\;
\sum_{i=1}^M 
\nabla_{\beta} J_i(\beta)
\;=\;
2\,\sum_{i=1}^M
p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,\Bigl[p_1^{(i)} - I_1\bigl(y^{(i)}\bigr)\Bigr]\,x^{(i)}.
}
$$

where $p_1^{(i)} := \sigma\bigl(\beta^\top x^{(i)}\bigr)$.

Comparison with MLE (Cross-Entropy) Gradient

For **logistic regression** with the standard cross-entropy loss,

$$
\text{(Cross-Entropy)} \quad
J_{\text{CE}}(\beta)
\;=\;
-\,\sum_{i=1}^M
\Bigl[
I_1\bigl(y^{(i)}\bigr)\,\ln\bigl(p_1^{(i)}\bigr)
\;+\;
I_2\bigl(y^{(i)}\bigr)\,\ln\bigl(1 - p_1^{(i)})\bigr)
\Bigr],
$$

the gradient is famously

$$
\boxed{
\nabla_{\beta} J_{\text{CE}}(\beta)
=
\sum_{i=1}^M
\Bigl[
p_1^{(i)} - I_1\bigl(y^{(i)}\bigr)
\Bigr]\,
x^{(i)}.
}
$$

## Conclusion

1. **Gradient for LSE Loss**
   
   $$
   \nabla_{\beta} J(\beta)
   = 
   2 \sum_{i=1}^M 
   p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)
   \Bigl[
     p_1^{(i)} \;-\; I_1(y^{(i)})
   \Bigr]
   x^{(i)}.
   $$
2. **Gradient for MLE (Cross-Entropy) Loss**
   
   $$
   \nabla_{\beta} J_{\text{CE}}(\beta)
   = 
   \sum_{i=1}^M 
   \Bigl[
     p_1^{(i)} \;-\; I_1(y^{(i)})
   \Bigr]
   x^{(i)}.
   $$

# Q7

The following images are handwritten instances of the digits 0 and 1. The images are small, 8 pixels by 8 pixels, and each pixel contains a gray level from the range [0, 7].

![image.png](https://wwang7.synology.me:8031/usr/uploads/2025/02/3336760807.png)

Rather than use individual pixel values, which can lead to very high-dimensional feature vectors, a simpler way to represent images for use with regression models is to calculate a histogram for each image and use this as the feature vector instead. In this case, the histograms simply count the frequency of occurrence of each possible gray level in each image. The table that follows shows the histograms for a small dataset of 10 images split between examples of digits 0 and 1.

| ID | GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 | DIGIT |
| -- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
| 0  | 31   | 3    | 6    | 2    | 7    | 5    | 6    | 4    | 0     |
| 1  | 37   | 3    | 1    | 4    | 1    | 3    | 2    | 13   | 1     |
| 2  | 31   | 3    | 4    | 1    | 8    | 7    | 3    | 7    | 0     |
| 3  | 38   | 2    | 3    | 0    | 1    | 1    | 5    | 14   | 1     |
| 4  | 31   | 5    | 3    | 2    | 5    | 2    | 5    | 11   | 0     |
| 5  | 32   | 6    | 3    | 2    | 1    | 5    | 1    | 5    | 1     |
| 6  | 31   | 3    | 5    | 2    | 3    | 6    | 2    | 12   | 0     |
| 7  | 31   | 4    | 3    | 4    | 1    | 5    | 5    | 11   | 1     |
| 8  | 38   | 4    | 2    | 2    | 2    | 4    | 4    | 8    | 1     |
| 9  | 38   | 3    | 2    | 3    | 4    | 4    | 4    | 9    | 1     |

A logistic regression model has been trained to classify digits as either 0 or 1. The weights in this model are as follows:

| Intercept | GL-0  | GL-1   | GL-2   | GL-3  | GL-4   | GL-5   | GL-6   | GL-7   |
| --------- | ----- | ------ | ------ | ----- | ------ | ------ | ------ | ------ |
| w[0]      | w[1]  | w[2]   | w[3]   | w[4]  | w[5]   | w[6]   | w[7]   | w[8]   |
| 0.309     | 0.100 | -0.152 | -0.163 | 0.191 | -0.631 | -0.716 | -0.478 | -0.171 |

This model has been used to make predictions for the instances in the training set above. These predictions, and the related calculations required for calculating error and $w[j]$ error values are shown in the following table.

| ID | σ(zi) | yi | error   | s.e.   | w[0]    | w[1]    | w[2]   | w[3]    | w[4]    | w[5]    | w[6]    | w[7]    | w[8]    |
| -- | ------ | -- | ------- | ------ | ------- | ------- | ------ | ------- | ------- | ------- | ------- | ------- | ------- |
| 0  | 0.0001 | 0  | -0.0001 | ?      | 0.0000  | 0.0000  | 0.0000 | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  |
| 1  | ?      | 1  | 0.8586  | 0.7373 | 0.1042  | 3.8562  | 0.3127 | 0.1042  | 0.4169  | 0.1042  | 0.3127  | 0.2084  | 1.3549  |
| 2  | 0.0000 | 0  | 0.0000  | 0.0000 | 0.0000  | 0.0000  | 0.0000 | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | ?       |
| 3  | 0.0565 | 1  | 0.9435  | 0.8902 | ?       | 1.9118  | 0.1006 | 0.1509  | 0.0000  | 0.0503  | 0.0503  | 0.2516  | 0.7044  |
| 4  | ?      | 0  | -0.0018 | 0.0000 | -0.0001 | 0.0000  | 0.0000 | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  |
| 5  | 0.0256 | 1  | ?       | 0.9495 | 0.0243  | 0.7765  | 0.1456 | 0.0728  | 0.0485  | 0.0243  | 0.0243  | 0.1213  | 0.3397  |
| 6  | ?      | 0  | -0.0013 | 0.0000 | 0.0000  | 0.0000  | 0.0000 | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  |
| 7  | 0.0045 | 0  | ?       | 0.0000 | 0.0000  | -0.0006 | ?      | -0.0001 | -0.0001 | -0.0000 | -0.0001 | -0.0001 | -0.0002 |
| 8  | 0.0209 | 1  | 0.9791  | ?      | 0.0200  | 0.7598  | 0.0800 | 0.0800  | 0.0400  | 0.0400  | 0.0800  | 0.8000  | 0.1600  |
| 9  | 0.0292 | 1  | 0.9708  | 0.9425 | 0.0275  | 1.0447  | 0.0825 | 0.0550  | 0.0825  | 0.1100  | 0.1100  | 0.0275  | 0.2474  |

In the table, the $w[j]$ errors are calculated as $\frac{\partial L(w)}{\partial w[j]} = (\gamma_i - \sigma(z_i))\sigma(z_i)(1-\sigma(z_i))x_i[j]$ which is the gradient for $W[l]$ estimated using least squared error (LSE) instead of MLE. $x_i[j]$ is the $j^{th}$ feature of the $i^{th}$ sample point.

## a)

Some of the model predictions are missing in the preceding table (marked with a ?). Calculate these.

ID = 1

- **$ y_1 = 1 $**
- **$ \text{error}_1 = 0.8586 $**

Hence,

$$
\sigma(z_1)
= y_1 - \text{error}_1
= 1 - 0.8586
= 0.1414.
$$

ID = 4

- **$ y_4 = 0 $**
- **$ \text{error}_4 = -0.0018 $**

Hence,

$$
\sigma(z_4)
= y_4 - \text{error}_4
= 0 - (-0.0018)
= 0.0018.
$$

ID = 6

- **$ y_6 = 0 $**
- **$ \text{error}_6 = -0.0013 $**

Hence,

$$
\sigma(z_6)
= y_6 - \text{error}_6
= 0 - (-0.0013)
= 0.0013.
$$

Therefore, the three missing model predictions ($\sigma(z_i)$ values) are:

- **ID 1**: $\sigma(z_1) = 0.1414$
- **ID 4**: $\sigma(z_4) = 0.0018$
- **ID 6**: $\sigma(z_6) = 0.0013$.

## b)

Some of the Error and Squared Error values are missing in the preceding table (marked with a ?). Calculate these values.

ID = 0

Given:

- $\sigma(z_0) = 0.0001$
- $y_0 = 0$
- $\text{error}_0 = -0.0001$ (already provided in the table)
- $\text{s.e.}_0 = ?$

Thus:

$$
\text{s.e.}_0
\;=\;
(-0.0001)^2
\;=\;
0.00000001
\;\approx\;
0.0000
\quad
\text{(rounded to four decimal places)}
$$

Hence the missing **s.e.** for ID=0 is **0.0000**.

ID = 5

Given:

- $\sigma(z_5) = 0.0256$
- $y_5 = 1$
- $\text{error}_0 = ?$ (already provided in the table)
- $\text{s.e.}_0 = 0.9495$

Thus:

$$
\text{error}_5
\;=\;
y_5 - \sigma(z_5)
\;=\;
1 - 0.0256
\;=\;
0.9744.
$$

Hence the missing **error** for ID=5 is **0.9744**.

ID = 7

Given:

- $\sigma(z_7) = 0.0045$
- $y_7 = 0$
- $\text{error}_7 = ?$
- $\text{s.e.}_7 = ?$

$$
\text{error}_7
\;=\;
y_7 - \sigma(z_7)
\;=\;
0 - 0.0045
\;=\;
-0.0045.
$$

Hence the missing values for ID=7 are **error = -0.0045**.

ID = 8

Given:

- $\sigma(z_8) = 0.0209$
- $y_8 = 1$
- $\text{error}_8 = 0.9791$ (already in the table)
- $\text{s.e.}_8 = ?$

$$
\text{s.e.}_8
\;=\;
(0.9791)^2
\;=\;
0.95864881
\;\approx\;
0.9586
\quad
\text{(rounded to four decimal places).}
$$

Hence the missing **s.e.** for ID=8 is **0.9586**.

Final Answers the missing values:

- **ID 0**:
  $\text{s.e.} = 0.0000$
- **ID 5**:
  $\text{error} = 0.9744$
- **ID 7**:
  $\text{error} = -0.0045$
- **ID 8**:
  $\text{s.e.} = 0.9586$

## c)

Some of the $w[j]$ error values are missing in the preceding table (marked with a ?). Calculate these.

The **$w[j]$ error** for sample $i$:

$$
\frac{\partial L(w)}{\partial w[j]}
= (\gamma_i - \sigma(z_i)) \;\sigma(z_i)\bigl(1 - \sigma(z_i)\bigr)\; x_i[j].
$$

- $ \gamma_i $ is the true label.
- $ \sigma(z_i) $ is the model prediction.
- $ x_i[j] $ is the feature value.
- $ w[0] $ is the intercept ($ x_i[0] = 1 $).

### **ID = 2, $w[8]$ error**

- $ \sigma(z_2) = 0.0000 $, $ \gamma_2 = 0$
- $ \gamma_2 - \sigma(z_2) = 0 $
- $ \sigma(z_2)(1 - \sigma(z_2)) = 0 $

$$
(\gamma_2 - \sigma(z_2))\,\sigma(z_2)(1-\sigma(z_2))\,x_2[8] = 0.0 \times 0.0 \times (\dots) = 0.0.
$$

$$
\boxed{w[8]\text{ error for ID=2} = 0.0000.}
$$

### **ID = 3, $w[0]$ error**

- $ \sigma(z_3) = 0.0565 $, $ \gamma_3 = 1 $
- $ \gamma_3 - \sigma(z_3) = 0.9435 $
- $ \sigma(z_3)(1 - \sigma(z_3)) \approx 0.0533 $

$$
\frac{\partial L}{\partial w[0]}
= (0.9435) \times (0.0533) \times (1)
\approx 0.0503.
$$

$$
\boxed{w[0]\text{ error for ID=3} = 0.0503.}
$$

### **ID = 7, $w[2]$ error**

- $ \sigma(z_7) = 0.0045 $, $ \gamma_7 = 0 $
- $ \gamma_7 - \sigma(z_7) = -0.0045 $
- $ \sigma(z_7)(1-\sigma(z_7)) \approx 0.00448 $

$$
(\gamma_7 - \sigma(z_7))\;\sigma(z_7)(1-\sigma(z_7))
= (-0.0045) \times 0.00448
\approx -0.00002016.
$$

$$
\frac{\partial L}{\partial w[0]} = -0.00002016 \times 1 = -0.00002016.
$$

Rounded: **-0.0000**.

$$
\boxed{w[0]\text{ error for ID=7} = -0.0000.}
$$

**Final Missing $w[j]$ Errors**

- **ID 2, $w[8]$** = 0.0000
- **ID 3, $w[0]$** = 0.0503
- **ID 7, $w[2]$** = -0.0000

## d)

Calculate a new set of weights for this model using a learning rate of 0.01 with LSE.

New $w[j]$ error

| ID | σ(zi) | yi | error   | s.e.   | w[0]    | w[1]    | w[2]    | w[3]    | w[4]    | w[5]    | w[6]    | w[7]    | w[8]    |
| -- | ------ | -- | ------- | ------ | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
| 0  | 0.0001 | 0  | -0.0001 | 0.0000 | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  |
| 1  | 0.1414 | 1  | 0.8586  | 0.7373 | 0.1042  | 3.8562  | 0.3127  | 0.1042  | 0.4169  | 0.1042  | 0.3127  | 0.2084  | 1.3549  |
| 2  | 0.0000 | 0  | 0.0000  | 0.0000 | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  |
| 3  | 0.0565 | 1  | 0.9435  | 0.8902 | 0.0503  | 1.9118  | 0.1006  | 0.1509  | 0.0000  | 0.0503  | 0.0503  | 0.2516  | 0.7044  |
| 4  | 0.0018 | 0  | -0.0018 | 0.0000 | -0.0001 | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  |
| 5  | 0.0256 | 1  | 0.9744  | 0.9495 | 0.0243  | 0.7765  | 0.1456  | 0.0728  | 0.0485  | 0.0243  | 0.0243  | 0.1213  | 0.3397  |
| 6  | 0.0013 | 0  | -0.0013 | 0.0000 | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  | 0.0000  |
| 7  | 0.0045 | 0  | -0.0045 | 0.0000 | 0.0000  | -0.0006 | -0.0000 | -0.0001 | -0.0001 | -0.0000 | -0.0001 | -0.0001 | -0.0002 |
| 8  | 0.0209 | 1  | 0.9791  | 0.9586 | 0.0200  | 0.7598  | 0.0800  | 0.0800  | 0.0400  | 0.0400  | 0.0800  | 0.8000  | 0.1600  |
| 9  | 0.0292 | 1  | 0.9708  | 0.9425 | 0.0275  | 1.0447  | 0.0825  | 0.0550  | 0.0825  | 0.1100  | 0.1100  | 0.0275  | 0.2474  |

A single GD update with learning rate $ \alpha=0.01 $:

$$
w[j] \;\leftarrow\; w[j] \;-\; \alpha\,G_j 
\;=\;
w[j] \;-\; 0.01 \times G_j.
$$

1. **$w[0]$**:
   
   $$
   0.309 - 0.01 \times 0.2262 = 0.309 - 0.002262 \approx 0.3067.
   $$
2. **$w[1]$**:
   
   $$
   0.100 - 0.01 \times 8.3484 = 0.100 - 0.083484 \approx 0.0165.
   $$
3. **$w[2]$**:
   
   $$
   -0.152 - 0.01 \times 0.7214 = -0.152 - 0.007214 \approx -0.1592.
   $$
4. **$w[3]$**:
   
   $$
   -0.163 - 0.01 \times 0.4628 = -0.163 - 0.004628 \approx -0.1676.
   $$
5. **$w[4]$**:
   
   $$
   0.191 - 0.01 \times 0.5878 = 0.191 - 0.005878 \approx 0.1851.
   $$
6. **$w[5]$**:
   
   $$
   -0.631 - 0.01 \times 0.3288 = -0.631 - 0.003288 \approx -0.6343.
   $$
7. **$w[6]$**:
   
   $$
   -0.716 - 0.01 \times 0.5772 = -0.716 - 0.005772 \approx -0.7218.
   $$
8. **$w[7]$**:
   
   $$
   -0.478 - 0.01 \times 1.4087 = -0.478 - 0.014087 \approx -0.4921.
   $$
9. **$w[8]$**:
   
   $$
   -0.171 - 0.01 \times 2.8062 = -0.171 - 0.028062 \approx -0.1991.
   $$

Final Updated Weights

| $w[0]$ | $w[1]$ | $w[2]$  | $w[3]$  | $w[4]$ | $w[5]$  | $w[6]$  | $w[7]$  | $w[8]$  |
| ------ | ------ | ------- | ------- | ------ | ------- | ------- | ------- | ------- |
| 0.3067 | 0.0165 | -0.1592 | -0.1676 | 0.1851 | -0.6343 | -0.7218 | -0.4921 | -0.1991 |

## e)

The following table shows handwritten examples of the digits 7 and 8 and their corresponding histogram values.

| ID                                                                               | GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 |
| -------------------------------------------------------------------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| ![image.png](https://wwang7.synology.me:8031/usr/uploads/2025/02/3558083896.png) | 35   | 1    | 5    | 4    | 5    | 2    | 4    | 8    |
| ![image.png](https://wwang7.synology.me:8031/usr/uploads/2025/02/3309629696.png) | 30   | 6    | 2    | 0    | 5    | 4    | 4    | 13   |

### i

Calculate the output of the model (using the updated weights calculated in the previous part) for these two instances.

$$
z \;=\; w[0]\cdot 1 \;+\; w[1]\cdot \text{GL-0} \;+\; \dots \;+\; w[8]\cdot \text{GL-7},
$$

$$
\sigma(z) \;=\; \frac{1}{1 + e^{-z}}.
$$

For the digit “7”

The histogram is:

| GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| 35   | 1    | 5    | 4    | 5    | 2    | 4    | 8    |

Compute $z_7$:

$$
\begin{aligned}
z_7 
&= w[0]\cdot 1 
   + w[1]\cdot 35
   + w[2]\cdot 1 
   + w[3]\cdot 5 
   + w[4]\cdot 4 
   + w[5]\cdot 5 
   + w[6]\cdot 2 
   + w[7]\cdot 4 
   + w[8]\cdot 8 \\[6pt]
&= 0.3067
   + (0.0165)(35)
   + (-0.1592)(1)
   + (-0.1676)(5)
   + (0.1851)(4)
   + (-0.6343)(5)
   + (-0.7218)(2)
   + (-0.4921)(4)
   + (-0.1991)(8).\\
\end{aligned}
\\
z_7 \approx -7.5489
\\
\sigma(z_7) 
= \frac{1}{1+e^{-(-7.5489)}}
= \frac{1}{1+e^{7.5489}}
\;\approx\; 0.00053.
$$

For the digit “8”

The histogram is:

| GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| 30   | 6    | 2    | 0    | 5    | 4    | 4    | 13   |

Compute $z_8$:

$$
\begin{aligned}
z_8 
&= w[0]\cdot 1
   + w[1]\cdot 30
   + w[2]\cdot 6
   + w[3]\cdot 2
   + w[4]\cdot 0
   + w[5]\cdot 5
   + w[6]\cdot 4
   + w[7]\cdot 4
   + w[8]\cdot 13\\[6pt]
&= 0.3067
   + (0.0165)(30)
   + (-0.1592)(6)
   + (-0.1676)(2)
   + (0.1851)(0)
   + (-0.6343)(5)
   + (-0.7218)(4)
   + (-0.4921)(4)
   + (-0.1991)(13). 
\end{aligned}
\\
z_8 \approx -11.1041.
\\
\sigma(z_8)
= \frac{1}{1 + e^{11.1041}}
\;\approx\;
0.000015.
$$

### ii

Comment on the appropriateness of these outputs.

Using the updated logistic‐regression weights, both new images produce very negative $z$-values, hence very small sigmoid outputs:

- **Predicted probability for “7”** $\approx 0.00053$.
- **Predicted probability for “8”** $\approx 0.000015$.

In other words, the model (after that single gradient‐descent step) strongly classifies both samples as closer to the “digit 0” class rather than “digit 1.”

## f)

Now calculate new weights using MLE and one sample point ID = 0 only instead of step (d) and repeat step (e). Use the same learning rate.

$$
\begin{aligned}
\frac{\partial L}{\partial w[0]} 
&=\; 0.0001\times1
\,=\,0.0001,\\
\frac{\partial L}{\partial w[1]} 
&=\; 0.0001\times31
\,=\,0.0031,\\
\frac{\partial L}{\partial w[2]} 
&=\; 0.0001\times3
\,=\,0.0003,\\
\frac{\partial L}{\partial w[3]} 
&=\; 0.0001\times6
\,=\,0.0006,\\
\frac{\partial L}{\partial w[4]} 
&=\; 0.0001\times2
\,=\,0.0002,\\
\frac{\partial L}{\partial w[5]} 
&=\; 0.0001\times7
\,=\,0.0007,\\
\frac{\partial L}{\partial w[6]} 
&=\; 0.0001\times5
\,=\,0.0005,\\
\frac{\partial L}{\partial w[7]} 
&=\; 0.0001\times6
\,=\,0.0006,\\
\frac{\partial L}{\partial w[8]} 
&=\; 0.0001\times4
\,=\,0.0004.
\end{aligned}
$$

$$
\begin{aligned}
w[0] &\approx 0.309 - 0.01 \times 0.0001 = 0.309 - 0.000001 = 0.308999 \approx 0.3090,\\
w[1] &\approx 0.100 - 0.01 \times 0.0031 = 0.100 - 0.000031 = 0.099969 \approx 0.1000,\\
w[2] &\approx -0.152 - 0.01 \times 0.0003 = -0.152003 \approx -0.1520,\\
w[3] &\approx -0.163 - 0.01 \times 0.0006 = -0.163006 \approx -0.1630,\\
w[4] &\approx 0.191 - 0.01 \times 0.0002 = 0.190998 \approx 0.1910,\\
w[5] &\approx -0.631 - 0.01 \times 0.0007 = -0.631007 \approx -0.6310,\\
w[6] &\approx -0.716 - 0.01 \times 0.0005 = -0.716005 \approx -0.7160,\\
w[7] &\approx -0.478 - 0.01 \times 0.0006 = -0.478006 \approx -0.4780,\\
w[8] &\approx -0.171 - 0.01 \times 0.0004 = -0.171004 \approx -0.1710.\\
\end{aligned}
$$

Digit “7”

Histogram: $\mathrm{GL{-}0}=35,\,\mathrm{GL{-}1}=1,\,\mathrm{GL{-}2}=5,\,\mathrm{GL{-}3}=4,\,\mathrm{GL{-}4}=5,\,\mathrm{GL{-}5}=2,\,\mathrm{GL{-}6}=4,\,\mathrm{GL{-}7}=8.$

Using the (slightly) updated weights $w^{\text{new}}[j]$ (to 6 decimals for accuracy), we get

$$
\begin{aligned}
z_{7}
&\;=\;0.308999 
 \;+\; (0.099969)\times35 
 \;+\;(-0.152003)\times1 
 \;+\;(-0.163006)\times5 
 \;+\;(0.190998)\times4 
 \;+\;(-0.631007)\times5 
 \;+\;(-0.716005)\times2 
 \;+\;(-0.478006)\times4 
 \;+\;(-0.171004)\times8\\
&\;\approx\;-4.2621.
\end{aligned}
\\
\sigma(z_7)
\;=\;
\frac{1}{1+e^{-(-4.2621)}}
\;=\;
\frac{1}{1 + e^{4.2621}}
\;\approx\;
0.014.
$$

Digit “8”

Histogram: $\mathrm{GL{-}0}=30,\,\mathrm{GL{-}1}=6,\,\mathrm{GL{-}2}=2,\,\mathrm{GL{-}3}=0,\,\mathrm{GL{-}4}=5,\,\mathrm{GL{-}5}=4,\,\mathrm{GL{-}6}=4,\,\mathrm{GL{-}7}=13.$

Similarly,

$$
\begin{aligned}
z_{8}
&=\;0.308999
 + (0.099969)\times30
 + (-0.152003)\times6
 + (-0.163006)\times2
 + (0.190998)\times0
 + (-0.631007)\times5
 + (-0.716005)\times4
 + (-0.478006)\times4
 + (-0.171004)\times13\\
&\approx\;-8.0840.
\end{aligned}
\\
\sigma(z_8)
\;=\;
\frac{1}{1 + e^{8.0840}}
\;\approx\;0.0003.
$$

## Conclution

Because the update from a single almost‐correct example $(y_0=0,\sigma(z_0)=0.0001)$ is vanishingly small,

- **all new weights remain nearly the same** as before (to four decimals).
- Consequently, the new model outputs for the “7” and “8” digits are still very close to zero—about 1.4% for the “7” and 0.03% for the “8.”

Both are strongly classified as “digit 0” in this scenario

Last modification：February 14, 2025

如果觉得我的文章对你有用，请随意赞赏

SYSC5108_A2

William.W • 2025 年 02 月 14 日

# Q1

2. Differentiate and set to zero:

$$
\frac{d\,\ell(\lambda)}{d\lambda}
= \frac{n}{\lambda} - \sum_{i=1}^n |x_i| = 0
\quad\Longrightarrow\quad
\hat{\lambda} 
= \frac{n}{\displaystyle \sum_{i=1}^n |x_i|}.
$$

3. The second derivative
   $\frac{d^2\,\ell(\lambda)}{d\lambda^2}=-\frac{n}{\lambda^2}<0$
   confirms a maximum.

Thus, the MLE for $\lambda$ is

$$
\boxed{\hat{\lambda} = \frac{n}{\sum_{i=1}^n |x_i|}}.
$$

# Q2

$$
H(X)=-p\ln p-(1-p)\ln(1-p).
$$

2. Differentiate

$$
\frac{d\,H}{dp}
= -\ln p - 1 + \ln(1-p) + 1
= \ln\!\bigl(\tfrac{1-p}{p}\bigr) 
= 0
\quad\Longrightarrow\quad
\tfrac{1-p}{p}=1 \quad\Longrightarrow\quad p=\tfrac12.
$$

Second derivative $ \tfrac{d^2\,H}{dp^2}=-\tfrac{1}{p} - \tfrac{1}{1-p} $ is negative for $0<p<1$, implying a maximum at $p=\tfrac12$.

# Q3

Use this model to make predictions for each of the following query instances.

```python
import math

# Given weights:
w0          = -3.82398
w_age       = -0.02909
w_band_b    = -0.09089
w_band_c    = -0.19558
w_value     =  0.02999
w_frequency =  0.74572

# Helper function to compute sigmoid
def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-z))

# (Age, Band, ShopFrequency, ShopValue)
queries = [
    (56, 'b', 1.60, 109.32),
    (21, 'c', 4.92,  11.28),
    (48, 'a', 1.21, 161.19),
    (37, 'c', 0.72, 170.65),
    (32, 'a', 1.08, 165.39)
]

Regression model:

Then the predicted probability is the sigmoid of this linear combination:

$$
= \frac{1}{1 + e^{-\text{logit}}}.
$$

For each row, substitute the features (with appropriate one‐hot encoding for the band) and compute for the five queries are:

- $p \approx 0.255$
- $p \approx 0.349$
- $p \approx 0.626$
- $p \approx 0.637$
- $p \approx 0.731$

# Q4

### Data Quality Report

### Model Weights

Use this model to make predictions for each of the query instances shown in the following table (question marks refer to missing values).

### Query Instances

```python
import numpy as np
import pandas as pd

# Given data
min_age, max_age = 18, 63
min_shop_freq, max_shop_freq = 0.2, 4.3
min_shop_value, max_shop_value = 5, 174.6
mean_age = 32.7
mode_socioeconomic_band = "a"

# Model weights
intercept = 0.6679
weights = {
    "AGE": -0.5795,
    "SOCIOECONOMIC_BAND_B": -0.1981,
    "SOCIOECONOMIC_BAND_C": -0.2318,
    "SHOP_VALUE": 3.4091,
    "SHOP_FREQUENCY": 2.0499,
}

#missing values
query_instances["AGE"].fillna(mean_age, inplace=True)
query_instances["SOCIOECONOMIC_BAND"].fillna(mode_socioeconomic_band, inplace=True)

# Compute logistic regression predictions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Apply sigmoid function to get probabilities
query_instances["PREDICTION"] = sigmoid(linear_combination)
query_instances[["ID", "PREDICTION"]]
```

- Query 1: $0.968$
- Query 2: $0.551$
- Query 3: $0.826$
- Query 4: $0.987$
- Query 5: $0.388$

# Q5

5. Consider the training for logistic regression model. Assume $K = 2$, $y = 0$ or $1$. The loss function is defined below.

$$
J(w) = -y \log(\sigma(z)) - (1 - y) \log(1 - \sigma(z))
$$

where $z = w^T x, x_0 = 1$.

## (a) Show that

$$
\nabla_w J(w) = -x(y - \sigma(z))
$$

$$
z = w^T x, 
\quad \sigma(z) = \frac{1}{1+e^{-z}}, 
\quad
J(w) = -\,y \,\log\!\bigl(\sigma(z)\bigr) \;-\; \bigl(1-y\bigr)\,\log\!\bigl(1-\sigma(z)\bigr).
$$

We want $\nabla_{w} J(w)$.  A standard result (or by direct differentiation) shows:

$$
\frac{\partial}{\partial z}\,J(w) 
= -\,\frac{y}{\sigma(z)}\,\sigma'(z) 
\;-\;\frac{1-y}{1-\sigma(z)}\,[-\sigma'(z)],
$$

where $\sigma'(z) = \sigma(z)\bigl(1 - \sigma(z)\bigr)$.

Simplify:

$$
\frac{\partial}{\partial z}\,J(w) 
= -\,y \,\frac{\sigma'(z)}{\sigma(z)} \;+\;(1-y)\,\frac{\sigma'(z)}{1-\sigma(z)}.
$$

$$
\frac{\partial}{\partial z}\,J(w)
= \sigma(z)\bigl(1-\sigma(z)\bigr)\!
\Bigl(
-\,\frac{y}{\sigma(z)} \;+\; \frac{1-y}{\,1-\sigma(z}\Bigr).
$$

A little algebra shows that this combination inside the parentheses becomes $\sigma(z) - y$.  Indeed:

\frac{\partial J}{\partial z} 
= \sigma(z) - y.
$$

Finally, by the chain rule with $z = w^T x$ and $\nabla_z z = x$:

$$
\nabla_{w}\,J(w) 
\;=\; \bigl[\sigma(z) - y\bigr]\,x 
\;=\;
-\,x\,(y \;-\;\sigma(z)).
$$

## (b)

Now change the encoding for $y$ to $y = -1$, or $1$. Show the result again. Explain the impact of this change.

$$
\nabla_{\mathbf{w}} J(\mathbf{w}) = -\,\mathbf{x}\,[y - \sigma(z)].
$$

Now, we switch the labels to $y \in \{-1, +1\}$. Below is the new loss function and the corresponding gradient derivation.

### Loss Function ($y \in \{-1, +1\}$)

When $y$ takes values $\pm 1$, we can replace the original $\{0,1\}$ indicator with the following:

$$
\mathbb{1}[\,y=+1\,] = \frac{1 + y}{2},
\quad
\mathbb{1}[\,y=-1\,] = \frac{1 - y}{2}.
$$

$$
\boxed{
J(\mathbf{w}) = -\,\frac{1 + y}{2}\,\ln\bigl(\sigma(z)\bigr) - \frac{1 - y}{2}\,\ln\bigl(1 - \sigma(z)\bigr),
}
$$

where $z = \mathbf{w}^\top \mathbf{x}$ and $\sigma(z) = \frac{1}{1+e^{-z}}$.

### Gradient Derivation Preparation

$$
z = \mathbf{w}^\top \mathbf{x}, 
\quad 
\sigma(z) = \frac{1}{1 + e^{-z}}.
$$

$$
\nabla_{\mathbf{w}} J(\mathbf{w}) = \frac{\partial J}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}} = \left( \frac{\partial J}{\partial z} \right) \mathbf{x}.
$$

### Derivation of $\frac{\partial J}{\partial z}$

$$
\sigma'(z) = \sigma(z)\,[1 - \sigma(z)],
$$

$$
\frac{\partial}{\partial z}\ln(\sigma(z)) = \frac{\sigma'(z)}{\sigma(z)} = \frac{\sigma(z)[1 - \sigma(z)]}{\sigma(z)} = 1 - \sigma(z),
$$

$$
\frac{\partial}{\partial z}\ln\bigl(1 - \sigma(z)\bigr) = \frac{-\,\sigma'(z)}{1 - \sigma(z)} = -\,\frac{\sigma(z)[1 - \sigma(z)]}{1 - \sigma(z)} = -\,\sigma(z).
$$

we can write the equation as:

### Substituting into the Gradient

Using the chain rule:

$$
\nabla_{\mathbf{w}} J(\mathbf{w}) = \left( \sigma(z) - \frac{1 + y}{2} \right) \mathbf{x}.
$$

Alternatively, we can write it in the "negative" form:

$$
\nabla_{\mathbf{w}} J(\mathbf{w}) = -\,\mathbf{x}\,\Bigl[\frac{1 + y}{2} - \sigma(z)\Bigr].
$$

# Q6

6. Consider the training for logistic regression model using LSE. Assume $K = 2$, and the loss function is defined through the errors in probabilities as below.

$$
J(\beta) = \sum_{i=1}^{M} \left[ I_1(y^{(i)}) \left( 1 - p_1(x^{(i)}) \right)^2 + I_2(y^{(i)}) \left( p_1(x^{(i)}) \right)^2 \right]
$$

where $z = \beta^\top x, x_0 = 1$. Find the gradient of the loss function and compare with the MLE discussed in class.

$$
K = 2 \quad\Longrightarrow\quad \text{class 1 or class 2}.
$$

We define indicator functions

The model is logistic regression:

In words:

- If the true label is class 1, the “error” term is $\bigl(1 - p_1(x^{(i)})\bigr)^2$.
- If the true label is class 2, the “error” term is $\bigl(p_1(x^{(i)})\bigr)^2$.

Focus on the $i$-th data point. Its contribution to $J(\beta)$ is:

$$
J_i(\beta) 
=\;
I_1(y^{(i)})\;\bigl(1 - p_1^{(i)}\bigr)^2 
\;+\;
I_2(y^{(i)})\;\bigl(p_1^{(i)}\bigr)^2,
$$

where we set $p_1^{(i)} := p_1\bigl(x^{(i)}\bigr) = \sigma\bigl(\beta^\top x^{(i)}\bigr).$

We will compute $\nabla_{\beta} J_i(\beta)$. By the chain rule:

1. $$
   \frac{\partial p_1^{(i)}}{\partial \beta} = \sigma'\bigl(\beta^\top x^{(i)}\bigr)\,x^{(i)} = p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}.
   $$
2. Hence,

Therefore,

Factor out $2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}$:

Now notice:

In fact, that bracket is exactly

$$
p_1^{(i)} - I_1\bigl(y^{(i)}\bigr)
$$

in both cases. Therefore, a more compact final expression is:

$$
\boxed{
\nabla_{\beta} J_i(\beta)
\;=\;
2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,
\Bigl[
p_1^{(i)} \;-\; I_1\bigl(y^{(i)}\bigr)
\Bigr]
\,x^{(i)}.
}
$$

Summing over all $i=1,\dots,M$,

where $p_1^{(i)} := \sigma\bigl(\beta^\top x^{(i)}\bigr)$.

Comparison with MLE (Cross-Entropy) Gradient

For **logistic regression** with the standard cross-entropy loss,

the gradient is famously

$$
\boxed{
\nabla_{\beta} J_{\text{CE}}(\beta)
=
\sum_{i=1}^M
\Bigl[
p_1^{(i)} - I_1\bigl(y^{(i)}\bigr)
\Bigr]\,
x^{(i)}.
}
$$

## Conclusion

# Q7

The following images are handwritten instances of the digits 0 and 1. The images are small, 8 pixels by 8 pixels, and each pixel contains a gray level from the range [0, 7].

![image.png](https://wwang7.synology.me:8031/usr/uploads/2025/02/3336760807.png)

A logistic regression model has been trained to classify digits as either 0 or 1. The weights in this model are as follows:

## a)

Some of the model predictions are missing in the preceding table (marked with a ?). Calculate these.

ID = 1

- **$ y_1 = 1 $**
- **$ \text{error}_1 = 0.8586 $**

Hence,

$$
\sigma(z_1)
= y_1 - \text{error}_1
= 1 - 0.8586
= 0.1414.
$$

ID = 4

- **$ y_4 = 0 $**
- **$ \text{error}_4 = -0.0018 $**

Hence,

$$
\sigma(z_4)
= y_4 - \text{error}_4
= 0 - (-0.0018)
= 0.0018.
$$

ID = 6

- **$ y_6 = 0 $**
- **$ \text{error}_6 = -0.0013 $**

Hence,

$$
\sigma(z_6)
= y_6 - \text{error}_6
= 0 - (-0.0013)
= 0.0013.
$$

Therefore, the three missing model predictions ($\sigma(z_i)$ values) are:

- **ID 1**: $\sigma(z_1) = 0.1414$
- **ID 4**: $\sigma(z_4) = 0.0018$
- **ID 6**: $\sigma(z_6) = 0.0013$.

## b)

Some of the Error and Squared Error values are missing in the preceding table (marked with a ?). Calculate these values.

ID = 0

Given:

- $\sigma(z_0) = 0.0001$
- $y_0 = 0$
- $\text{error}_0 = -0.0001$ (already provided in the table)
- $\text{s.e.}_0 = ?$

Thus:

$$
\text{s.e.}_0
\;=\;
(-0.0001)^2
\;=\;
0.00000001
\;\approx\;
0.0000
\quad
\text{(rounded to four decimal places)}
$$

Hence the missing **s.e.** for ID=0 is **0.0000**.

ID = 5

Given:

- $\sigma(z_5) = 0.0256$
- $y_5 = 1$
- $\text{error}_0 = ?$ (already provided in the table)
- $\text{s.e.}_0 = 0.9495$

Thus:

$$
\text{error}_5
\;=\;
y_5 - \sigma(z_5)
\;=\;
1 - 0.0256
\;=\;
0.9744.
$$

Hence the missing **error** for ID=5 is **0.9744**.

ID = 7

Given:

- $\sigma(z_7) = 0.0045$
- $y_7 = 0$
- $\text{error}_7 = ?$
- $\text{s.e.}_7 = ?$

$$
\text{error}_7
\;=\;
y_7 - \sigma(z_7)
\;=\;
0 - 0.0045
\;=\;
-0.0045.
$$

Hence the missing values for ID=7 are **error = -0.0045**.

ID = 8

Given:

- $\sigma(z_8) = 0.0209$
- $y_8 = 1$
- $\text{error}_8 = 0.9791$ (already in the table)
- $\text{s.e.}_8 = ?$

$$
\text{s.e.}_8
\;=\;
(0.9791)^2
\;=\;
0.95864881
\;\approx\;
0.9586
\quad
\text{(rounded to four decimal places).}
$$

Hence the missing **s.e.** for ID=8 is **0.9586**.

Final Answers the missing values:

- **ID 0**:
  $\text{s.e.} = 0.0000$
- **ID 5**:
  $\text{error} = 0.9744$
- **ID 7**:
  $\text{error} = -0.0045$
- **ID 8**:
  $\text{s.e.} = 0.9586$

## c)

Some of the $w[j]$ error values are missing in the preceding table (marked with a ?). Calculate these.

The **$w[j]$ error** for sample $i$:

$$
\frac{\partial L(w)}{\partial w[j]}
= (\gamma_i - \sigma(z_i)) \;\sigma(z_i)\bigl(1 - \sigma(z_i)\bigr)\; x_i[j].
$$

- $ \gamma_i $ is the true label.
- $ \sigma(z_i) $ is the model prediction.
- $ x_i[j] $ is the feature value.
- $ w[0] $ is the intercept ($ x_i[0] = 1 $).

### **ID = 2, $w[8]$ error**

- $ \sigma(z_2) = 0.0000 $, $ \gamma_2 = 0$
- $ \gamma_2 - \sigma(z_2) = 0 $
- $ \sigma(z_2)(1 - \sigma(z_2)) = 0 $

$$
(\gamma_2 - \sigma(z_2))\,\sigma(z_2)(1-\sigma(z_2))\,x_2[8] = 0.0 \times 0.0 \times (\dots) = 0.0.
$$

$$
\boxed{w[8]\text{ error for ID=2} = 0.0000.}
$$

### **ID = 3, $w[0]$ error**

- $ \sigma(z_3) = 0.0565 $, $ \gamma_3 = 1 $
- $ \gamma_3 - \sigma(z_3) = 0.9435 $
- $ \sigma(z_3)(1 - \sigma(z_3)) \approx 0.0533 $

$$
\frac{\partial L}{\partial w[0]}
= (0.9435) \times (0.0533) \times (1)
\approx 0.0503.
$$

$$
\boxed{w[0]\text{ error for ID=3} = 0.0503.}
$$

### **ID = 7, $w[2]$ error**

- $ \sigma(z_7) = 0.0045 $, $ \gamma_7 = 0 $
- $ \gamma_7 - \sigma(z_7) = -0.0045 $
- $ \sigma(z_7)(1-\sigma(z_7)) \approx 0.00448 $

$$
(\gamma_7 - \sigma(z_7))\;\sigma(z_7)(1-\sigma(z_7))
= (-0.0045) \times 0.00448
\approx -0.00002016.
$$

$$
\frac{\partial L}{\partial w[0]} = -0.00002016 \times 1 = -0.00002016.
$$

Rounded: **-0.0000**.

$$
\boxed{w[0]\text{ error for ID=7} = -0.0000.}
$$

**Final Missing $w[j]$ Errors**

- **ID 2, $w[8]$** = 0.0000
- **ID 3, $w[0]$** = 0.0503
- **ID 7, $w[2]$** = -0.0000

## d)

Calculate a new set of weights for this model using a learning rate of 0.01 with LSE.

New $w[j]$ error

A single GD update with learning rate $ \alpha=0.01 $:

$$
w[j] \;\leftarrow\; w[j] \;-\; \alpha\,G_j 
\;=\;
w[j] \;-\; 0.01 \times G_j.
$$

Final Updated Weights

## e)

The following table shows handwritten examples of the digits 7 and 8 and their corresponding histogram values.

### i

Calculate the output of the model (using the updated weights calculated in the previous part) for these two instances.

$$
z \;=\; w[0]\cdot 1 \;+\; w[1]\cdot \text{GL-0} \;+\; \dots \;+\; w[8]\cdot \text{GL-7},
$$

$$
\sigma(z) \;=\; \frac{1}{1 + e^{-z}}.
$$

For the digit “7”

The histogram is:

| GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| 35   | 1    | 5    | 4    | 5    | 2    | 4    | 8    |

Compute $z_7$:

For the digit “8”

The histogram is:

| GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| 30   | 6    | 2    | 0    | 5    | 4    | 4    | 13   |

Compute $z_8$:

### ii

Comment on the appropriateness of these outputs.

Using the updated logistic‐regression weights, both new images produce very negative $z$-values, hence very small sigmoid outputs:

- **Predicted probability for “7”** $\approx 0.00053$.
- **Predicted probability for “8”** $\approx 0.000015$.

In other words, the model (after that single gradient‐descent step) strongly classifies both samples as closer to the “digit 0” class rather than “digit 1.”

## f)

Now calculate new weights using MLE and one sample point ID = 0 only instead of step (d) and repeat step (e). Use the same learning rate.

Digit “7”

Histogram: $\mathrm{GL{-}0}=35,\,\mathrm{GL{-}1}=1,\,\mathrm{GL{-}2}=5,\,\mathrm{GL{-}3}=4,\,\mathrm{GL{-}4}=5,\,\mathrm{GL{-}5}=2,\,\mathrm{GL{-}6}=4,\,\mathrm{GL{-}7}=8.$

Using the (slightly) updated weights $w^{\text{new}}[j]$ (to 6 decimals for accuracy), we get

Digit “8”

Histogram: $\mathrm{GL{-}0}=30,\,\mathrm{GL{-}1}=6,\,\mathrm{GL{-}2}=2,\,\mathrm{GL{-}3}=0,\,\mathrm{GL{-}4}=5,\,\mathrm{GL{-}5}=4,\,\mathrm{GL{-}6}=4,\,\mathrm{GL{-}7}=13.$

Similarly,

## Conclution

Because the update from a single almost‐correct example $(y_0=0,\sigma(z_0)=0.0001)$ is vanishingly small,

Both are strongly classified as “digit 0” in this scenario

SYSC5108_A2

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

此内容被密码保护

玫瑰星云NGC2244

SYS5405F_A1

证书安装

来自深渊-深渊图像

For River (Sarah and Tommy's Version) from To the Moon

NGC7000

此内容被密码保护

玫瑰星云NGC2244

维持健康的小妙招

SYSC5108_A2

Leave a Comment Cancel reply 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

SYSC5108_A2

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款