Loading... # Q1 1. Given n independent sample points $ x_i, i = 1, \dots, n $ from a Laplace distribution with PDF $$ f(x) = \frac{\lambda}{2} e^{-\lambda |x|} $$ Provide MLE estimators for the parameter $ \lambda $. 2. The log-likelihood for $n$ i.i.d. samples $x_1,\dots,x_n$ from the Laplace distribution $$ f(x)=\frac{\lambda}{2} e^{-\lambda|x|}\space \Rightarrow \space \ell(\lambda) =\sum_{i=1}^n \ln\left(\frac{\lambda}{2}\right) - \lambda\sum_{i=1}^n |x_i| = n \ln(\lambda) - n\ln(2) - \lambda \sum_{i=1}^n |x_i|. $$ 2. Differentiate and set to zero: $$ \frac{d\,\ell(\lambda)}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n |x_i| = 0 \quad\Longrightarrow\quad \hat{\lambda} = \frac{n}{\displaystyle \sum_{i=1}^n |x_i|}. $$ 3. The second derivative $\frac{d^2\,\ell(\lambda)}{d\lambda^2}=-\frac{n}{\lambda^2}<0$ confirms a maximum. Thus, the MLE for $\lambda$ is $$ \boxed{\hat{\lambda} = \frac{n}{\sum_{i=1}^n |x_i|}}. $$ # Q2 2. Maximum entropy is often used to select an a priori distribution when there is very little information about the RV. Consider the Bernoulli distribution $ p(x) = \begin{cases} p, & x = 1 \\ 1 - p, & x = 0 \\ 0, & \text{otherwise} \end{cases} $. Find its entropy $ H(X) $ and prove analytically that $ H(X) $ achieves its maximum when $ p = \frac{1}{2} $. 3. For a Bernoulli random variable $X$ with $\Pr(X=1)=p$ and $\Pr(X=0)=1-p$, its entropy is $$ H(X)=-p\ln p-(1-p)\ln(1-p). $$ 2. Differentiate $$ \frac{d\,H}{dp} = -\ln p - 1 + \ln(1-p) + 1 = \ln\!\bigl(\tfrac{1-p}{p}\bigr) = 0 \quad\Longrightarrow\quad \tfrac{1-p}{p}=1 \quad\Longrightarrow\quad p=\tfrac12. $$ Second derivative $ \tfrac{d^2\,H}{dp^2}=-\tfrac{1}{p} - \tfrac{1}{1-p} $ is negative for $0<p<1$, implying a maximum at $p=\tfrac12$. # Q3 3. A multivariate logistic regression model has been built to predict the propensity of shoppers to perform a repeat purchase of a free gift that they are given. The input features used by the model are the age of the customer, the socioeconomic band to which the customer belongs (a, b, or c), the average amount of money the customer spends on each visit to the shop, and the average number of visits the customer makes to the shop per week. This model is being used by the marketing department to determine who should be given the free gift. The weights in the trained model are shown in the following table. It is assumed that one hot coding is used for a, b, and c. Furthermore, if both b and c are 0, a will be used. Therefore, a is encoded as $b = c = 0$. | Feature | Weight | | -------------------- | -------- | | Intercept ($w[0]$) | -3.82398 | | AGE | -0.02909 | | Socioeconomic BAND B | -0.00980 | | Socioeconomic BAND C | -0.19558 | | SHOP VALUE | 0.02999 | | SHOP FREQUENCY | 0.74572 | Use this model to make predictions for each of the following query instances. | ID | AGE | Socioeconomic BAND | SHOP FREQUENCY | SHOP VALUE | | -- | --- | ------------------ | -------------- | ---------- | | 1 | 56 | b | 1.60 | 109.32 | | 2 | 21 | c | 4.92 | 11.28 | | 3 | 48 | b | 1.21 | 161.19 | | 4 | 37 | c | 0.72 | 170.65 | | 5 | 32 | a | 1.08 | 165.39 | ```python import math # Given weights: w0 = -3.82398 w_age = -0.02909 w_band_b = -0.09089 w_band_c = -0.19558 w_value = 0.02999 w_frequency = 0.74572 # Helper function to compute sigmoid def sigmoid(z): return 1.0 / (1.0 + math.exp(-z)) # (Age, Band, ShopFrequency, ShopValue) queries = [ (56, 'b', 1.60, 109.32), (21, 'c', 4.92, 11.28), (48, 'a', 1.21, 161.19), (37, 'c', 0.72, 170.65), (32, 'a', 1.08, 165.39) ] # One-hot encoding for socioeconomic band: a => (0,0), b => (1,0), c => (0,1) def encode_band(band): if band == 'a': return (0,0) elif band == 'b': return (1,0) elif band == 'c': return (0,1) # Compute predictions for i, (age, band, freq, val) in enumerate(queries, start=1): b_enc, c_enc = encode_band(band) logit = (w0 + w_age * age + w_band_b * b_enc + w_band_c * c_enc + w_value * val + w_frequency * freq) # Probability prob = sigmoid(logit) print(f"Query {i}: logit = {logit:.4f}, predicted probability = {prob:.4f}") ``` Query 1: logit = -1.0723, predicted probability = 0.2550 Query 2: logit = -0.6232, predicted probability = 0.3490 Query 3: logit = 0.5161, predicted probability = 0.6262 Query 4: logit = 0.5588, predicted probability = 0.6362 Query 5: logit = 1.0106, predicted probability = 0.7331 Regression model: $$ \text{logit} = w_0 + w_{\text{age}}\cdot\text{Age} + w_{\text{bandB}}\cdot I_{\{\text{band=B}\}} + w_{\text{bandC}}\cdot I_{\{\text{band=C}\}} + w_{\text{value}}\cdot\text{ShopValue} + w_{\text{freq}}\cdot\text{ShopFrequency}. $$ Then the predicted probability is the sigmoid of this linear combination: $$ = \frac{1}{1 + e^{-\text{logit}}}. $$ For each row, substitute the features (with appropriate one‐hot encoding for the band) and compute for the five queries are: - $p \approx 0.255$ - $p \approx 0.349$ - $p \approx 0.626$ - $p \approx 0.637$ - $p \approx 0.731$ # Q4 4. In building multivariate logistic regression models, it is recommended that all continuous input features be normalized to the range [-1,1]. The following table shows a data quality report for the dataset used to train the model described in the last question. ### Data Quality Report | Feature | % | Count | Miss. | Card. | Min. | Qrt. | Mean | Median | Qrt. | Max. | Std. Dev. | | -------------- | - | ----- | ----- | ----- | ---- | ---- | ----- | ------ | ------ | ----- | --------- | | AGE | % | 5,200 | 6 | 40 | 18 | 22 | 32.7 | 32 | 32 | 63 | 12.2 | | SHOP FREQUENCY | % | 5,200 | 0 | 316 | 0.2 | 1.0 | 2.2 | 1.3 | 1.3 | 4.3 | 5.4 | | SHOP VALUE | % | 5,200 | 0 | 3,730 | 5 | 11.8 | 101.9 | 100.14 | 100.14 | 174.6 | 230.7 | | Feature | % | Count | Miss. | Card. | Mode | Mode Count | % | | ------------------ | - | ----- | ----- | ----- | ---- | ---------- | ---- | | SOCIOECONOMIC BAND | % | 5,200 | 8 | 3 | a | 2,664 | 51.2 | | REPEAT PURCHASE | % | 5,200 | 0 | 2 | no | 2,791 | 53.7 | On the basis of the information in this report, all continuous features were normalized using range normalization, and any missing values were replaced using mean for imputing continuous features and mode for imputing categorical features. After applying these data preparation operations, a multivariate logistic regression model was trained to give the weights shown in the following table. ### Model Weights | Feature | Weight | | -------------------- | ------- | | Intercept ($w[0]$) | 0.6679 | | AGE | -0.5795 | | SOCIOECONOMIC BAND B | -0.1981 | | SOCIOECONOMIC BAND C | -0.2318 | | SHOP VALUE | 3.4091 | | SHOP FREQUENCY | 2.0499 | Use this model to make predictions for each of the query instances shown in the following table (question marks refer to missing values). ### Query Instances | ID | AGE | Socioeconomic BAND | SHOP FREQUENCY | SHOP VALUE | | -- | --- | ------------------ | -------------- | ---------- | | 1 | 38 | a | 1.90 | 165.39 | | 2 | 56 | b | 1.60 | 109.32 | | 3 | 18 | c | 6.00 | 10.09 | | 4 | ? | b | 1.33 | 204.62 | | 5 | 62 | ? | 0.85 | 110.50 | ```python import numpy as np import pandas as pd # Given data min_age, max_age = 18, 63 min_shop_freq, max_shop_freq = 0.2, 4.3 min_shop_value, max_shop_value = 5, 174.6 mean_age = 32.7 mode_socioeconomic_band = "a" # Model weights intercept = 0.6679 weights = { "AGE": -0.5795, "SOCIOECONOMIC_BAND_B": -0.1981, "SOCIOECONOMIC_BAND_C": -0.2318, "SHOP_VALUE": 3.4091, "SHOP_FREQUENCY": 2.0499, } # Query data (before imputation and normalization) query_instances = pd.DataFrame([ [1, 38, 'a', 1.90, 165.39], [2, 56, 'b', 1.60, 109.32], [3, 18, 'c', 6.00, 10.09], [4, np.nan, 'b', 1.33, 204.62], [5, 62, np.nan, 0.85, 110.50], ], columns=["ID", "AGE", "SOCIOECONOMIC_BAND", "SHOP_FREQUENCY", "SHOP_VALUE"]) #missing values query_instances["AGE"].fillna(mean_age, inplace=True) query_instances["SOCIOECONOMIC_BAND"].fillna(mode_socioeconomic_band, inplace=True) # Normalize continuous features to [-1,1] range query_instances["AGE"] = 2 * (query_instances["AGE"] - min_age) / (max_age - min_age) - 1 query_instances["SHOP_FREQUENCY"] = 2 * (query_instances["SHOP_FREQUENCY"] - min_shop_freq) / (max_shop_freq - min_shop_freq) - 1 query_instances["SHOP_VALUE"] = 2 * (query_instances["SHOP_VALUE"] - min_shop_value) / (max_shop_value - min_shop_value) - 1 # Encode categorical variables query_instances["SOCIOECONOMIC_BAND_B"] = (query_instances["SOCIOECONOMIC_BAND"] == 'b').astype(int) query_instances["SOCIOECONOMIC_BAND_C"] = (query_instances["SOCIOECONOMIC_BAND"] == 'c').astype(int) # Compute logistic regression predictions def sigmoid(x): return 1 / (1 + np.exp(-x)) # Compute linear combination of inputs and weights linear_combination = ( intercept + weights["AGE"] * query_instances["AGE"] + weights["SOCIOECONOMIC_BAND_B"] * query_instances["SOCIOECONOMIC_BAND_B"] + weights["SOCIOECONOMIC_BAND_C"] * query_instances["SOCIOECONOMIC_BAND_C"] + weights["SHOP_FREQUENCY"] * query_instances["SHOP_FREQUENCY"] + weights["SHOP_VALUE"] * query_instances["SHOP_VALUE"] ) # Apply sigmoid function to get probabilities query_instances["PREDICTION"] = sigmoid(linear_combination) query_instances[["ID", "PREDICTION"]] ``` - Query 1: $0.968$ - Query 2: $0.551$ - Query 3: $0.826$ - Query 4: $0.987$ - Query 5: $0.388$ # Q5 5. Consider the training for logistic regression model. Assume $K = 2$, $y = 0$ or $1$. The loss function is defined below. $$ J(w) = -y \log(\sigma(z)) - (1 - y) \log(1 - \sigma(z)) $$ where $z = w^T x, x_0 = 1$. ## (a) Show that $$ \nabla_w J(w) = -x(y - \sigma(z)) $$ $$ z = w^T x, \quad \sigma(z) = \frac{1}{1+e^{-z}}, \quad J(w) = -\,y \,\log\!\bigl(\sigma(z)\bigr) \;-\; \bigl(1-y\bigr)\,\log\!\bigl(1-\sigma(z)\bigr). $$ We want $\nabla_{w} J(w)$. A standard result (or by direct differentiation) shows: $$ \frac{\partial}{\partial z}\,J(w) = -\,\frac{y}{\sigma(z)}\,\sigma'(z) \;-\;\frac{1-y}{1-\sigma(z)}\,[-\sigma'(z)], $$ where $\sigma'(z) = \sigma(z)\bigl(1 - \sigma(z)\bigr)$. Simplify: $$ \frac{\partial}{\partial z}\,J(w) = -\,y \,\frac{\sigma'(z)}{\sigma(z)} \;+\;(1-y)\,\frac{\sigma'(z)}{1-\sigma(z)}. $$ $$ \frac{\partial}{\partial z}\,J(w) = \sigma(z)\bigl(1-\sigma(z)\bigr)\! \Bigl( -\,\frac{y}{\sigma(z)} \;+\; \frac{1-y}{\,1-\sigma(z}\Bigr). $$ A little algebra shows that this combination inside the parentheses becomes $\sigma(z) - y$. Indeed: $$ -\,y \,\frac{1}{\sigma(z)} + (1-y)\,\frac{1}{1-\sigma(z)} = \frac{-\,y\,[1-\sigma(z)] + (1-y)\,\sigma(z)}{\sigma(z)\,\bigl(1-\sigma(z)\bigr)} = \frac{\sigma(z) - y}{\sigma(z)\,\bigl(1-\sigma(z)\bigr)}. \frac{\partial J}{\partial z} = \sigma(z) - y. $$ Finally, by the chain rule with $z = w^T x$ and $\nabla_z z = x$: $$ \nabla_{w}\,J(w) \;=\; \bigl[\sigma(z) - y\bigr]\,x \;=\; -\,x\,(y \;-\;\sigma(z)). $$ ## (b) Now change the encoding for $y$ to $y = -1$, or $1$. Show the result again. Explain the impact of this change. $$ \nabla_{\mathbf{w}} J(\mathbf{w}) = -\,\mathbf{x}\,[y - \sigma(z)]. $$ Now, we switch the labels to $y \in \{-1, +1\}$. Below is the new loss function and the corresponding gradient derivation. ### Loss Function ($y \in \{-1, +1\}$) When $y$ takes values $\pm 1$, we can replace the original $\{0,1\}$ indicator with the following: $$ \mathbb{1}[\,y=+1\,] = \frac{1 + y}{2}, \quad \mathbb{1}[\,y=-1\,] = \frac{1 - y}{2}. $$ $$ \boxed{ J(\mathbf{w}) = -\,\frac{1 + y}{2}\,\ln\bigl(\sigma(z)\bigr) - \frac{1 - y}{2}\,\ln\bigl(1 - \sigma(z)\bigr), } $$ where $z = \mathbf{w}^\top \mathbf{x}$ and $\sigma(z) = \frac{1}{1+e^{-z}}$. ### Gradient Derivation Preparation $$ z = \mathbf{w}^\top \mathbf{x}, \quad \sigma(z) = \frac{1}{1 + e^{-z}}. $$ To compute the gradient of the loss function $J(\mathbf{w})$ with respect to $\mathbf{w}$, we first compute the derivative of $z$, and then multiply by $\frac{\partial z}{\partial \mathbf{w}} = \mathbf{x}$: $$ \nabla_{\mathbf{w}} J(\mathbf{w}) = \frac{\partial J}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}} = \left( \frac{\partial J}{\partial z} \right) \mathbf{x}. $$ ### Derivation of $\frac{\partial J}{\partial z}$ $$ \begin{aligned} J(\mathbf{w}) &= -\frac{1 + y}{2}\,\ln\bigl(\sigma(z)\bigr) - \frac{1 - y}{2}\,\ln\bigl(1-\sigma(z)\bigr), \\ \frac{\partial J}{\partial z} &= -\frac{1 + y}{2}\,\underbrace{\frac{\partial}{\partial z}\bigl[\ln(\sigma(z))\bigr]}_{\frac{1}{\sigma(z)}\,\sigma'(z)} - \frac{1 - y}{2}\,\underbrace{\frac{\partial}{\partial z}\bigl[\ln(1-\sigma(z))\bigr]}_{\frac{1}{1-\sigma(z)}\cdot[-\sigma'(z)]}. \end{aligned} $$ $$ \sigma'(z) = \sigma(z)\,[1 - \sigma(z)], $$ $$ \frac{\partial}{\partial z}\ln(\sigma(z)) = \frac{\sigma'(z)}{\sigma(z)} = \frac{\sigma(z)[1 - \sigma(z)]}{\sigma(z)} = 1 - \sigma(z), $$ $$ \frac{\partial}{\partial z}\ln\bigl(1 - \sigma(z)\bigr) = \frac{-\,\sigma'(z)}{1 - \sigma(z)} = -\,\frac{\sigma(z)[1 - \sigma(z)]}{1 - \sigma(z)} = -\,\sigma(z). $$ we can write the equation as: $$ \begin{aligned} \frac{\partial J}{\partial z} &= -\frac{1 + y}{2}\,\Bigl[\,1 - \sigma(z)\Bigr] + \frac{1 - y}{2}\,\bigl[-\,\sigma(z)\bigr] \\ &= -\frac{1 + y}{2}\;\Bigl[\,1 - \sigma(z)\Bigr] + \frac{1 - y}{2}\;\sigma(z). \end{aligned} $$ $$ \begin{aligned} \frac{\partial J}{\partial z} &= -\frac{1 + y}{2} + \frac{1 + y}{2}\,\sigma(z) + \frac{1 - y}{2}\,\sigma(z) \\ &= -\frac{1 + y}{2} + \sigma(z)\,\Bigl[\,\frac{1 + y}{2} + \frac{1 - y}{2}\Bigr] \\ &= -\frac{1 + y}{2} + \sigma(z)\,\Bigl[\frac{(1 + y)+(1 - y)}{2}\Bigr] \\ &= -\frac{1 + y}{2} + \sigma(z)\,\Bigl[\frac{2}{2}\Bigr] \\ &= -\frac{1 + y}{2} + \sigma(z) \\ &= \boxed{\sigma(z) - \frac{1 + y}{2}}. \end{aligned} $$ ### Substituting into the Gradient Using the chain rule: $$ \nabla_{\mathbf{w}} J(\mathbf{w}) = \left( \sigma(z) - \frac{1 + y}{2} \right) \mathbf{x}. $$ Alternatively, we can write it in the "negative" form: $$ \nabla_{\mathbf{w}} J(\mathbf{w}) = -\,\mathbf{x}\,\Bigl[\frac{1 + y}{2} - \sigma(z)\Bigr]. $$ - **Original labels $\{0,1\}$** The gradient is: $$ \nabla_{\mathbf{w}} J(\mathbf{w}) = -\,\mathbf{x}\,[y - \sigma(z)]. $$ - **New labels $\{-1,+1\}$** The gradient is: $$ nabla_{\mathbf{w}} J(\mathbf{w}) = \left[ \sigma(z) - \frac{1 + y}{2} \right]\,\mathbf{x}. $$ Here, $\frac{1 + y}{2}$ equals 1 when $y = +1$ and equals 0 when $y = -1$ # Q6 6. Consider the training for logistic regression model using LSE. Assume $K = 2$, and the loss function is defined through the errors in probabilities as below. $$ J(\beta) = \sum_{i=1}^{M} \left[ I_1(y^{(i)}) \left( 1 - p_1(x^{(i)}) \right)^2 + I_2(y^{(i)}) \left( p_1(x^{(i)}) \right)^2 \right] $$ where $z = \beta^\top x, x_0 = 1$. Find the gradient of the loss function and compare with the MLE discussed in class. $$ K = 2 \quad\Longrightarrow\quad \text{class 1 or class 2}. $$ We define indicator functions $$ I_1(y^{(i)})= \begin{cases} 1 & \text{if }y^{(i)}=\text{class 1},\\ 0 & \text{otherwise}, \end{cases} \qquad I_2(y^{(i)})= \begin{cases} 1 & \text{if }y^{(i)}=\text{class 2},\\ 0 & \text{otherwise}. \end{cases} $$ The model is logistic regression: - We write $$ z^{(i)} = \beta^\top x^{(i)}, \quad p_1\bigl(x^{(i)}\bigr) = \sigma\bigl(z^{(i)}\bigr) = \frac{1}{1+e^{-z^{(i)}}}. $$ - Here $p_1(x^{(i)})$ is the predicted probability that $x^{(i)}$ belongs to **class 1**. In words: - If the true label is class 1, the “error” term is $\bigl(1 - p_1(x^{(i)})\bigr)^2$. - If the true label is class 2, the “error” term is $\bigl(p_1(x^{(i)})\bigr)^2$. Focus on the $i$-th data point. Its contribution to $J(\beta)$ is: $$ J_i(\beta) =\; I_1(y^{(i)})\;\bigl(1 - p_1^{(i)}\bigr)^2 \;+\; I_2(y^{(i)})\;\bigl(p_1^{(i)}\bigr)^2, $$ where we set $p_1^{(i)} := p_1\bigl(x^{(i)}\bigr) = \sigma\bigl(\beta^\top x^{(i)}\bigr).$ We will compute $\nabla_{\beta} J_i(\beta)$. By the chain rule: 1. $$ \frac{\partial p_1^{(i)}}{\partial \beta} = \sigma'\bigl(\beta^\top x^{(i)}\bigr)\,x^{(i)} = p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}. $$ 2. Hence, $$ \begin{aligned} \nabla_{\beta} \Bigl[\bigl(1 - p_1^{(i)}\bigr)^2\Bigr] &= 2\bigl(1 - p_1^{(i)}\bigr)\,\nabla_{\beta}\bigl(1 - p_1^{(i)}\bigr) \;=\; 2\bigl(1 - p_1^{(i)}\bigr)\,\bigl[-\,\nabla_{\beta} p_1^{(i)}\bigr] \\[4pt] &= -2\,\bigl(1 - p_1^{(i)}\bigr)\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)} \;=\; -2\,p_1^{(i)}\,\bigl(1 - p_1^{(i)}\bigr)^2\,x^{(i)}. \\ \nabla_{\beta} \Bigl[\bigl(p_1^{(i)}\bigr)^2\Bigr] &= 2\,p_1^{(i)}\,\nabla_{\beta} p_1^{(i)} = 2\,p_1^{(i)}\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)} = 2\,\bigl(p_1^{(i)}\bigr)^2\,\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}. \end{aligned} $$ Therefore, $$ \nabla_{\beta} J_i(\beta) = I_1\bigl(y^{(i)}\bigr) \bigl[-2\,p_1^{(i)}\,\bigl(1 - p_1^{(i)}\bigr)^2\,x^{(i)}\bigr] \;+\; I_2\bigl(y^{(i)}\bigr) \bigl[\,2\,\bigl(p_1^{(i)}\bigr)^2\,\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}\bigr]. $$ Factor out $2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)}$: $$ \begin{aligned} \nabla_{\beta} J_i(\beta) &= 2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)} \Bigl[ I_2\bigl(y^{(i)}\bigr)\,p_1^{(i)} \;-\; I_1\bigl(y^{(i)}\bigr)\,\bigl(1 - p_1^{(i)}\bigr) \Bigr] \\[6pt] &= 2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,x^{(i)} \Bigl[ -\,I_1\bigl(y^{(i)}\bigr)\,\bigl(1 - p_1^{(i)}\bigr) \;+\; I_2\bigl(y^{(i)}\bigr)\,p_1^{(i)} \Bigr]. \end{aligned} $$ Now notice: - If $y^{(i)}=\text{class 1}$, then $I_1(y^{(i)})=1$, $I_2(y^{(i)})=0$. The bracket becomes: $$ -\,\bigl(1 - p_1^{(i)}\bigr) \;=\; p_1^{(i)} \;-\;1. $$ - If $y^{(i)}=\text{class 2}$, then $I_1(y^{(i)})=0$, $I_2(y^{(i)})=1$. The bracket becomes: $$ p_1^{(i)}. $$ In fact, that bracket is exactly $$ p_1^{(i)} - I_1\bigl(y^{(i)}\bigr) $$ in both cases. Therefore, a more compact final expression is: $$ \boxed{ \nabla_{\beta} J_i(\beta) \;=\; 2\,p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\, \Bigl[ p_1^{(i)} \;-\; I_1\bigl(y^{(i)}\bigr) \Bigr] \,x^{(i)}. } $$ Summing over all $i=1,\dots,M$, $$ \boxed{ \nabla_{\beta} J(\beta) \;=\; \sum_{i=1}^M \nabla_{\beta} J_i(\beta) \;=\; 2\,\sum_{i=1}^M p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr)\,\Bigl[p_1^{(i)} - I_1\bigl(y^{(i)}\bigr)\Bigr]\,x^{(i)}. } $$ where $p_1^{(i)} := \sigma\bigl(\beta^\top x^{(i)}\bigr)$. Comparison with MLE (Cross-Entropy) Gradient For **logistic regression** with the standard cross-entropy loss, $$ \text{(Cross-Entropy)} \quad J_{\text{CE}}(\beta) \;=\; -\,\sum_{i=1}^M \Bigl[ I_1\bigl(y^{(i)}\bigr)\,\ln\bigl(p_1^{(i)}\bigr) \;+\; I_2\bigl(y^{(i)}\bigr)\,\ln\bigl(1 - p_1^{(i)})\bigr) \Bigr], $$ the gradient is famously $$ \boxed{ \nabla_{\beta} J_{\text{CE}}(\beta) = \sum_{i=1}^M \Bigl[ p_1^{(i)} - I_1\bigl(y^{(i)}\bigr) \Bigr]\, x^{(i)}. } $$ ## Conclusion 1. **Gradient for LSE Loss** $$ \nabla_{\beta} J(\beta) = 2 \sum_{i=1}^M p_1^{(i)}\bigl(1 - p_1^{(i)}\bigr) \Bigl[ p_1^{(i)} \;-\; I_1(y^{(i)}) \Bigr] x^{(i)}. $$ 2. **Gradient for MLE (Cross-Entropy) Loss** $$ \nabla_{\beta} J_{\text{CE}}(\beta) = \sum_{i=1}^M \Bigl[ p_1^{(i)} \;-\; I_1(y^{(i)}) \Bigr] x^{(i)}. $$ # Q7 The following images are handwritten instances of the digits 0 and 1. The images are small, 8 pixels by 8 pixels, and each pixel contains a gray level from the range [0, 7].  Rather than use individual pixel values, which can lead to very high-dimensional feature vectors, a simpler way to represent images for use with regression models is to calculate a histogram for each image and use this as the feature vector instead. In this case, the histograms simply count the frequency of occurrence of each possible gray level in each image. The table that follows shows the histograms for a small dataset of 10 images split between examples of digits 0 and 1. | ID | GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 | DIGIT | | -- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- | | 0 | 31 | 3 | 6 | 2 | 7 | 5 | 6 | 4 | 0 | | 1 | 37 | 3 | 1 | 4 | 1 | 3 | 2 | 13 | 1 | | 2 | 31 | 3 | 4 | 1 | 8 | 7 | 3 | 7 | 0 | | 3 | 38 | 2 | 3 | 0 | 1 | 1 | 5 | 14 | 1 | | 4 | 31 | 5 | 3 | 2 | 5 | 2 | 5 | 11 | 0 | | 5 | 32 | 6 | 3 | 2 | 1 | 5 | 1 | 5 | 1 | | 6 | 31 | 3 | 5 | 2 | 3 | 6 | 2 | 12 | 0 | | 7 | 31 | 4 | 3 | 4 | 1 | 5 | 5 | 11 | 1 | | 8 | 38 | 4 | 2 | 2 | 2 | 4 | 4 | 8 | 1 | | 9 | 38 | 3 | 2 | 3 | 4 | 4 | 4 | 9 | 1 | A logistic regression model has been trained to classify digits as either 0 or 1. The weights in this model are as follows: | Intercept | GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 | | --------- | ----- | ------ | ------ | ----- | ------ | ------ | ------ | ------ | | w[0] | w[1] | w[2] | w[3] | w[4] | w[5] | w[6] | w[7] | w[8] | | 0.309 | 0.100 | -0.152 | -0.163 | 0.191 | -0.631 | -0.716 | -0.478 | -0.171 | This model has been used to make predictions for the instances in the training set above. These predictions, and the related calculations required for calculating error and $w[j]$ error values are shown in the following table. | ID | σ(zi) | yi | error | s.e. | w[0] | w[1] | w[2] | w[3] | w[4] | w[5] | w[6] | w[7] | w[8] | | -- | ------ | -- | ------- | ------ | ------- | ------- | ------ | ------- | ------- | ------- | ------- | ------- | ------- | | 0 | 0.0001 | 0 | -0.0001 | ? | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | 1 | ? | 1 | 0.8586 | 0.7373 | 0.1042 | 3.8562 | 0.3127 | 0.1042 | 0.4169 | 0.1042 | 0.3127 | 0.2084 | 1.3549 | | 2 | 0.0000 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | ? | | 3 | 0.0565 | 1 | 0.9435 | 0.8902 | ? | 1.9118 | 0.1006 | 0.1509 | 0.0000 | 0.0503 | 0.0503 | 0.2516 | 0.7044 | | 4 | ? | 0 | -0.0018 | 0.0000 | -0.0001 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | 5 | 0.0256 | 1 | ? | 0.9495 | 0.0243 | 0.7765 | 0.1456 | 0.0728 | 0.0485 | 0.0243 | 0.0243 | 0.1213 | 0.3397 | | 6 | ? | 0 | -0.0013 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | 7 | 0.0045 | 0 | ? | 0.0000 | 0.0000 | -0.0006 | ? | -0.0001 | -0.0001 | -0.0000 | -0.0001 | -0.0001 | -0.0002 | | 8 | 0.0209 | 1 | 0.9791 | ? | 0.0200 | 0.7598 | 0.0800 | 0.0800 | 0.0400 | 0.0400 | 0.0800 | 0.8000 | 0.1600 | | 9 | 0.0292 | 1 | 0.9708 | 0.9425 | 0.0275 | 1.0447 | 0.0825 | 0.0550 | 0.0825 | 0.1100 | 0.1100 | 0.0275 | 0.2474 | In the table, the $w[j]$ errors are calculated as $\frac{\partial L(w)}{\partial w[j]} = (\gamma_i - \sigma(z_i))\sigma(z_i)(1-\sigma(z_i))x_i[j]$ which is the gradient for $W[l]$ estimated using least squared error (LSE) instead of MLE. $x_i[j]$ is the $j^{th}$ feature of the $i^{th}$ sample point. ## a) Some of the model predictions are missing in the preceding table (marked with a ?). Calculate these. ID = 1 - **$ y_1 = 1 $** - **$ \text{error}_1 = 0.8586 $** Hence, $$ \sigma(z_1) = y_1 - \text{error}_1 = 1 - 0.8586 = 0.1414. $$ ID = 4 - **$ y_4 = 0 $** - **$ \text{error}_4 = -0.0018 $** Hence, $$ \sigma(z_4) = y_4 - \text{error}_4 = 0 - (-0.0018) = 0.0018. $$ ID = 6 - **$ y_6 = 0 $** - **$ \text{error}_6 = -0.0013 $** Hence, $$ \sigma(z_6) = y_6 - \text{error}_6 = 0 - (-0.0013) = 0.0013. $$ Therefore, the three missing model predictions ($\sigma(z_i)$ values) are: - **ID 1**: $\sigma(z_1) = 0.1414$ - **ID 4**: $\sigma(z_4) = 0.0018$ - **ID 6**: $\sigma(z_6) = 0.0013$. ## b) Some of the Error and Squared Error values are missing in the preceding table (marked with a ?). Calculate these values. ID = 0 Given: - $\sigma(z_0) = 0.0001$ - $y_0 = 0$ - $\text{error}_0 = -0.0001$ (already provided in the table) - $\text{s.e.}_0 = ?$ Thus: $$ \text{s.e.}_0 \;=\; (-0.0001)^2 \;=\; 0.00000001 \;\approx\; 0.0000 \quad \text{(rounded to four decimal places)} $$ Hence the missing **s.e.** for ID=0 is **0.0000**. ID = 5 Given: - $\sigma(z_5) = 0.0256$ - $y_5 = 1$ - $\text{error}_0 = ?$ (already provided in the table) - $\text{s.e.}_0 = 0.9495$ Thus: $$ \text{error}_5 \;=\; y_5 - \sigma(z_5) \;=\; 1 - 0.0256 \;=\; 0.9744. $$ Hence the missing **error** for ID=5 is **0.9744**. ID = 7 Given: - $\sigma(z_7) = 0.0045$ - $y_7 = 0$ - $\text{error}_7 = ?$ - $\text{s.e.}_7 = ?$ $$ \text{error}_7 \;=\; y_7 - \sigma(z_7) \;=\; 0 - 0.0045 \;=\; -0.0045. $$ Hence the missing values for ID=7 are **error = -0.0045**. ID = 8 Given: - $\sigma(z_8) = 0.0209$ - $y_8 = 1$ - $\text{error}_8 = 0.9791$ (already in the table) - $\text{s.e.}_8 = ?$ $$ \text{s.e.}_8 \;=\; (0.9791)^2 \;=\; 0.95864881 \;\approx\; 0.9586 \quad \text{(rounded to four decimal places).} $$ Hence the missing **s.e.** for ID=8 is **0.9586**. Final Answers the missing values: - **ID 0**: $\text{s.e.} = 0.0000$ - **ID 5**: $\text{error} = 0.9744$ - **ID 7**: $\text{error} = -0.0045$ - **ID 8**: $\text{s.e.} = 0.9586$ ## c) Some of the $w[j]$ error values are missing in the preceding table (marked with a ?). Calculate these. The **$w[j]$ error** for sample $i$: $$ \frac{\partial L(w)}{\partial w[j]} = (\gamma_i - \sigma(z_i)) \;\sigma(z_i)\bigl(1 - \sigma(z_i)\bigr)\; x_i[j]. $$ - $ \gamma_i $ is the true label. - $ \sigma(z_i) $ is the model prediction. - $ x_i[j] $ is the feature value. - $ w[0] $ is the intercept ($ x_i[0] = 1 $). ### **ID = 2, $w[8]$ error** - $ \sigma(z_2) = 0.0000 $, $ \gamma_2 = 0$ - $ \gamma_2 - \sigma(z_2) = 0 $ - $ \sigma(z_2)(1 - \sigma(z_2)) = 0 $ $$ (\gamma_2 - \sigma(z_2))\,\sigma(z_2)(1-\sigma(z_2))\,x_2[8] = 0.0 \times 0.0 \times (\dots) = 0.0. $$ $$ \boxed{w[8]\text{ error for ID=2} = 0.0000.} $$ ### **ID = 3, $w[0]$ error** - $ \sigma(z_3) = 0.0565 $, $ \gamma_3 = 1 $ - $ \gamma_3 - \sigma(z_3) = 0.9435 $ - $ \sigma(z_3)(1 - \sigma(z_3)) \approx 0.0533 $ $$ \frac{\partial L}{\partial w[0]} = (0.9435) \times (0.0533) \times (1) \approx 0.0503. $$ $$ \boxed{w[0]\text{ error for ID=3} = 0.0503.} $$ ### **ID = 7, $w[2]$ error** - $ \sigma(z_7) = 0.0045 $, $ \gamma_7 = 0 $ - $ \gamma_7 - \sigma(z_7) = -0.0045 $ - $ \sigma(z_7)(1-\sigma(z_7)) \approx 0.00448 $ $$ (\gamma_7 - \sigma(z_7))\;\sigma(z_7)(1-\sigma(z_7)) = (-0.0045) \times 0.00448 \approx -0.00002016. $$ $$ \frac{\partial L}{\partial w[0]} = -0.00002016 \times 1 = -0.00002016. $$ Rounded: **-0.0000**. $$ \boxed{w[0]\text{ error for ID=7} = -0.0000.} $$ **Final Missing \(w[j]\) Errors** - **ID 2, \(w[8]\)** = 0.0000 - **ID 3, \(w[0]\)** = 0.0503 - **ID 7, \(w[2]\)** = -0.0000 ## d) Calculate a new set of weights for this model using a learning rate of 0.01 with LSE. New $w[j]$ error | ID | σ(zi) | yi | error | s.e. | w[0] | w[1] | w[2] | w[3] | w[4] | w[5] | w[6] | w[7] | w[8] | | -- | ------ | -- | ------- | ------ | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | | 0 | 0.0001 | 0 | -0.0001 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | 1 | 0.1414 | 1 | 0.8586 | 0.7373 | 0.1042 | 3.8562 | 0.3127 | 0.1042 | 0.4169 | 0.1042 | 0.3127 | 0.2084 | 1.3549 | | 2 | 0.0000 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | 3 | 0.0565 | 1 | 0.9435 | 0.8902 | 0.0503 | 1.9118 | 0.1006 | 0.1509 | 0.0000 | 0.0503 | 0.0503 | 0.2516 | 0.7044 | | 4 | 0.0018 | 0 | -0.0018 | 0.0000 | -0.0001 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | 5 | 0.0256 | 1 | 0.9744 | 0.9495 | 0.0243 | 0.7765 | 0.1456 | 0.0728 | 0.0485 | 0.0243 | 0.0243 | 0.1213 | 0.3397 | | 6 | 0.0013 | 0 | -0.0013 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | | 7 | 0.0045 | 0 | -0.0045 | 0.0000 | 0.0000 | -0.0006 | -0.0000 | -0.0001 | -0.0001 | -0.0000 | -0.0001 | -0.0001 | -0.0002 | | 8 | 0.0209 | 1 | 0.9791 | 0.9586 | 0.0200 | 0.7598 | 0.0800 | 0.0800 | 0.0400 | 0.0400 | 0.0800 | 0.8000 | 0.1600 | | 9 | 0.0292 | 1 | 0.9708 | 0.9425 | 0.0275 | 1.0447 | 0.0825 | 0.0550 | 0.0825 | 0.1100 | 0.1100 | 0.0275 | 0.2474 | A single GD update with learning rate $ \alpha=0.01 $: $$ w[j] \;\leftarrow\; w[j] \;-\; \alpha\,G_j \;=\; w[j] \;-\; 0.01 \times G_j. $$ 1. **$w[0]$**: $$ 0.309 - 0.01 \times 0.2262 = 0.309 - 0.002262 \approx 0.3067. $$ 2. **$w[1]$**: $$ 0.100 - 0.01 \times 8.3484 = 0.100 - 0.083484 \approx 0.0165. $$ 3. **$w[2]$**: $$ -0.152 - 0.01 \times 0.7214 = -0.152 - 0.007214 \approx -0.1592. $$ 4. **$w[3]$**: $$ -0.163 - 0.01 \times 0.4628 = -0.163 - 0.004628 \approx -0.1676. $$ 5. **$w[4]$**: $$ 0.191 - 0.01 \times 0.5878 = 0.191 - 0.005878 \approx 0.1851. $$ 6. **$w[5]$**: $$ -0.631 - 0.01 \times 0.3288 = -0.631 - 0.003288 \approx -0.6343. $$ 7. **$w[6]$**: $$ -0.716 - 0.01 \times 0.5772 = -0.716 - 0.005772 \approx -0.7218. $$ 8. **$w[7]$**: $$ -0.478 - 0.01 \times 1.4087 = -0.478 - 0.014087 \approx -0.4921. $$ 9. **$w[8]$**: $$ -0.171 - 0.01 \times 2.8062 = -0.171 - 0.028062 \approx -0.1991. $$ Final Updated Weights | $w[0]$ | $w[1]$ | $w[2]$ | $w[3]$ | $w[4]$ | $w[5]$ | $w[6]$ | $w[7]$ | $w[8]$ | | ------ | ------ | ------- | ------- | ------ | ------- | ------- | ------- | ------- | | 0.3067 | 0.0165 | -0.1592 | -0.1676 | 0.1851 | -0.6343 | -0.7218 | -0.4921 | -0.1991 | ## e) The following table shows handwritten examples of the digits 7 and 8 and their corresponding histogram values. | ID | GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 | | -------------------------------------------------------------------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |  | 35 | 1 | 5 | 4 | 5 | 2 | 4 | 8 | |  | 30 | 6 | 2 | 0 | 5 | 4 | 4 | 13 | ### i Calculate the output of the model (using the updated weights calculated in the previous part) for these two instances. $$ z \;=\; w[0]\cdot 1 \;+\; w[1]\cdot \text{GL-0} \;+\; \dots \;+\; w[8]\cdot \text{GL-7}, $$ $$ \sigma(z) \;=\; \frac{1}{1 + e^{-z}}. $$ For the digit “7” The histogram is: | GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | 35 | 1 | 5 | 4 | 5 | 2 | 4 | 8 | Compute $z_7$: $$ \begin{aligned} z_7 &= w[0]\cdot 1 + w[1]\cdot 35 + w[2]\cdot 1 + w[3]\cdot 5 + w[4]\cdot 4 + w[5]\cdot 5 + w[6]\cdot 2 + w[7]\cdot 4 + w[8]\cdot 8 \\[6pt] &= 0.3067 + (0.0165)(35) + (-0.1592)(1) + (-0.1676)(5) + (0.1851)(4) + (-0.6343)(5) + (-0.7218)(2) + (-0.4921)(4) + (-0.1991)(8).\\ \end{aligned} \\ z_7 \approx -7.5489 \\ \sigma(z_7) = \frac{1}{1+e^{-(-7.5489)}} = \frac{1}{1+e^{7.5489}} \;\approx\; 0.00053. $$ For the digit “8” The histogram is: | GL-0 | GL-1 | GL-2 | GL-3 | GL-4 | GL-5 | GL-6 | GL-7 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | 30 | 6 | 2 | 0 | 5 | 4 | 4 | 13 | Compute $z_8$: $$ \begin{aligned} z_8 &= w[0]\cdot 1 + w[1]\cdot 30 + w[2]\cdot 6 + w[3]\cdot 2 + w[4]\cdot 0 + w[5]\cdot 5 + w[6]\cdot 4 + w[7]\cdot 4 + w[8]\cdot 13\\[6pt] &= 0.3067 + (0.0165)(30) + (-0.1592)(6) + (-0.1676)(2) + (0.1851)(0) + (-0.6343)(5) + (-0.7218)(4) + (-0.4921)(4) + (-0.1991)(13). \end{aligned} \\ z_8 \approx -11.1041. \\ \sigma(z_8) = \frac{1}{1 + e^{11.1041}} \;\approx\; 0.000015. $$ ### ii Comment on the appropriateness of these outputs. Using the updated logistic‐regression weights, both new images produce very negative \(z\)-values, hence very small sigmoid outputs: - **Predicted probability for “7”** $\approx 0.00053$. - **Predicted probability for “8”** $\approx 0.000015$. In other words, the model (after that single gradient‐descent step) strongly classifies both samples as closer to the “digit 0” class rather than “digit 1.” ## f) Now calculate new weights using MLE and one sample point ID = 0 only instead of step (d) and repeat step (e). Use the same learning rate. $$ \begin{aligned} \frac{\partial L}{\partial w[0]} &=\; 0.0001\times1 \,=\,0.0001,\\ \frac{\partial L}{\partial w[1]} &=\; 0.0001\times31 \,=\,0.0031,\\ \frac{\partial L}{\partial w[2]} &=\; 0.0001\times3 \,=\,0.0003,\\ \frac{\partial L}{\partial w[3]} &=\; 0.0001\times6 \,=\,0.0006,\\ \frac{\partial L}{\partial w[4]} &=\; 0.0001\times2 \,=\,0.0002,\\ \frac{\partial L}{\partial w[5]} &=\; 0.0001\times7 \,=\,0.0007,\\ \frac{\partial L}{\partial w[6]} &=\; 0.0001\times5 \,=\,0.0005,\\ \frac{\partial L}{\partial w[7]} &=\; 0.0001\times6 \,=\,0.0006,\\ \frac{\partial L}{\partial w[8]} &=\; 0.0001\times4 \,=\,0.0004. \end{aligned} $$ $$ \begin{aligned} w[0] &\approx 0.309 - 0.01 \times 0.0001 = 0.309 - 0.000001 = 0.308999 \approx 0.3090,\\ w[1] &\approx 0.100 - 0.01 \times 0.0031 = 0.100 - 0.000031 = 0.099969 \approx 0.1000,\\ w[2] &\approx -0.152 - 0.01 \times 0.0003 = -0.152003 \approx -0.1520,\\ w[3] &\approx -0.163 - 0.01 \times 0.0006 = -0.163006 \approx -0.1630,\\ w[4] &\approx 0.191 - 0.01 \times 0.0002 = 0.190998 \approx 0.1910,\\ w[5] &\approx -0.631 - 0.01 \times 0.0007 = -0.631007 \approx -0.6310,\\ w[6] &\approx -0.716 - 0.01 \times 0.0005 = -0.716005 \approx -0.7160,\\ w[7] &\approx -0.478 - 0.01 \times 0.0006 = -0.478006 \approx -0.4780,\\ w[8] &\approx -0.171 - 0.01 \times 0.0004 = -0.171004 \approx -0.1710.\\ \end{aligned} $$ Digit “7” Histogram: $\mathrm{GL{-}0}=35,\,\mathrm{GL{-}1}=1,\,\mathrm{GL{-}2}=5,\,\mathrm{GL{-}3}=4,\,\mathrm{GL{-}4}=5,\,\mathrm{GL{-}5}=2,\,\mathrm{GL{-}6}=4,\,\mathrm{GL{-}7}=8.$ Using the (slightly) updated weights \(w^{\text{new}}[j]\) (to 6 decimals for accuracy), we get $$ \begin{aligned} z_{7} &\;=\;0.308999 \;+\; (0.099969)\times35 \;+\;(-0.152003)\times1 \;+\;(-0.163006)\times5 \;+\;(0.190998)\times4 \;+\;(-0.631007)\times5 \;+\;(-0.716005)\times2 \;+\;(-0.478006)\times4 \;+\;(-0.171004)\times8\\ &\;\approx\;-4.2621. \end{aligned} \\ \sigma(z_7) \;=\; \frac{1}{1+e^{-(-4.2621)}} \;=\; \frac{1}{1 + e^{4.2621}} \;\approx\; 0.014. $$ Digit “8” Histogram: $\mathrm{GL{-}0}=30,\,\mathrm{GL{-}1}=6,\,\mathrm{GL{-}2}=2,\,\mathrm{GL{-}3}=0,\,\mathrm{GL{-}4}=5,\,\mathrm{GL{-}5}=4,\,\mathrm{GL{-}6}=4,\,\mathrm{GL{-}7}=13.$ Similarly, $$ \begin{aligned} z_{8} &=\;0.308999 + (0.099969)\times30 + (-0.152003)\times6 + (-0.163006)\times2 + (0.190998)\times0 + (-0.631007)\times5 + (-0.716005)\times4 + (-0.478006)\times4 + (-0.171004)\times13\\ &\approx\;-8.0840. \end{aligned} \\ \sigma(z_8) \;=\; \frac{1}{1 + e^{8.0840}} \;\approx\;0.0003. $$ ## Conclution Because the update from a single almost‐correct example $(y_0=0,\sigma(z_0)=0.0001)$ is vanishingly small, - **all new weights remain nearly the same** as before (to four decimals). - Consequently, the new model outputs for the “7” and “8” digits are still very close to zero—about 1.4% for the “7” and 0.03% for the “8.” Both are strongly classified as “digit 0” in this scenario Last modification:February 14, 2025 © Allow specification reprint Support Appreciate the author Like 如果觉得我的文章对你有用,请随意赞赏