The course was taught in Spring 2017. The course material is here.
If an input has $N$ possible discrete value, we should encode the $i$-th value as
\[(0, \dots, \stackrel{i}{1}, \dots, 0)\]In essence, we introduce an $N$-dimensional binary-valued subspace for the input.
It is not possible to represent it as an variable in $\mathbb{Z}_N$ because it is embedded in $\mathbb{R}$ and will introduce magnitude.
Ridge regression share the idea of weight decay in machine learning. But their starting points differ.
when we introduce the regularization term, it will lead to new solution.
\[w = (X^T X + \lambda I)^{-1} X y\]I originally thought that a small enough $\lambda$ should both ensure the stability of solution and approximate the original solution. But actually the $\lambda$ here can be plugged back to the original loss function. The loss function will become
\[L(w) = ||y-Xw||_2 + \lambda||w||_2\]The latter one is simply the weight decaying factor in machine learning.
Written on April 17th, 2017 by Hanezu