Machine Learning 機器學習

Performance Metrics

Confusion Matrix

		Predicted Condition
		Positive (PP)	Negative (PN)
Actuall Condition	Positive (P)	True Positive	False Negative (Type II error)
Actuall Condition	Negative (N)	False Positive (Type I Error)	True Negative

Total Population = P + N

準確率 Accuracy = (TP + TN) / (TP + FP + FN + TN)

全部樣本中的預測正確率

精確率 Precision = TP / (TP + FP)

預測陽性的樣本 (PP) 中的預測正確率

FP 越小越好 ⇔ 降低誤判 False 為 True 的可能性

安全系統

召回率 Recall = TP / (TP + FN)

陽性的樣本 (P) 中的預測正確率

提升 Positive (P) 中的預測正確率 ⇔ 盡可能地找出所有 P

人臉辨識找通緝犯

F₁-score = 2 / [(1 / Precision) + (1 / Recall) ]

F-score 能兼顧 Precision 和 Recall

$$ F \textit{-} score :=\frac{(1 + \beta^2) Precision \times Recall}{\beta^2 \cdot Precision + Recall} $$

令 β 為 0 ⇒ F₁-score:

$$ \frac{2}{F_1 \textit{-} score} = \frac{1}{Precision} + \frac{1}{Recall} \Rightarrow F_1 \textit{-} score = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} $$

Supervised Learning 監督式學習

Defined by its use of labeled datasets
to train algorithms that to classify data (Classification)
or predict outcomes accurately (Regression).

Features

\begin{align}
x_j & := j^{th} \text{ feature} \\
n & := \text{number of features} \\
\vec{x}^{(i)} & := \text{features of } i^{th} \text{ training example} \\
x_j^{(i)} & := \text{value of feature } j \text{ in } i^{th} \text{ training example}
\end{align}

Data Annotation/Labeling

Data annotation is the process of labeling individual elements of training data (whether text, images, audio, or video) to help machines understand what exactly is in it and what is important. Data annotation also plays a part in the larger quality control process of data collection, as well-annotated datasets become ground truth datasets: data that is held up as a gold standard and used to measure model performance and the quality of other datasets.

Features Scaling

\begin{align}
\mu_{x} &: \text{mean of }x \\
\sigma_{x} &: \text{standard deviation of }x
\end{align}

Mean Normalization

$$ x^{(i)} = \frac{x^{(i)} – \mu_{x}}{\mathop{max}(x) – \mathop{min}(x)} $$

Z-score Normalization

$$ x^{(i)} = \frac{x^{(i)} – \mu_{x}}{\mathop{\sigma_{x}}} $$

Feature Engineering

Using intuition to design new features,
by transforming or combining original features.

Regression Analysis 迴歸分析（應變數 Y 為連續）

定義

尋找一至多個自變數 X / [X_1, X_2, …, X_n] 和應變數 Y 之間關係的 model f

\begin{align}
(x^{(i)} , y^{(i)}) \ 或 \ ([x_1^{(i)}, x_2^{(i)}, …, x_n^{(i)}], y^{(i)}) & := i^{th} \text{ training example} \\
\hat{y}^{(i)} = f(x^{(i)}) \ 或 \ \hat{y}^{(i)} = f(x_1^{(i)}, x_2^{(i)}, …, x_n^{(i)}) & := \text{model}
\end{align}

※ 自變數 (Independent Variable)：
「因 (cause)」、features；用以解釋其他變數

※ 應變數 (Dependent Variable)：
「果 (effect)」、targets；被解釋的變數

Linear Regression 線性迴歸

Univariate Linear Regression Model（單變數線性迴歸模型）
$$ \hat{y} = f_{w, b}(x) = w \dot x + b $$
Multiple Linear Regression Model（單變數線性迴歸模型）
$$ \hat{y} = f_{\vec{w}, b}(\vec{x}) = w_1 \dot x_1 + w_2 \dot x_2 + … + w_n \dot x_n + b $$

Cost Function

A cost function J is an important parameter that determines how well a machine learning model performs for a given dataset. The goal is to find the values of model parameters for which the cost function return as small a number as possible.

$$ \text{Goal: Find } \mathop{min \ }_{\theta_1, \theta_2, …}{J(\theta_1, \theta_2, …)} $$

Squared Error Cost Function

\begin{align}
& J(\theta_1, \theta_2, …) = \frac{1}{2m}\sum_{i = 1}^{m}{(\hat{y}^{(i)} – y^{(i)})^2} \\
& \theta_1, \theta_2, … := \text{parameters of } y = f(x)
\end{align}

Polynomial Regression 多項式迴歸

$$ \hat{y} = f_{\vec{w}, b}(x) = w_1 \dot x + w_2 \dot x^2 + … + w_n \dot x^n + b $$

Gradient Descent

$$ \theta_{k + 1} = \theta_k – \alpha \frac{\partial}{\partial \theta} J $$

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point.

Checking Gradient Descent for Convergence.

\begin{eqnarray}
& \text{Convergence } : \text{ If } J_k – J_{k – 1} < \varepsilon & \\
& \varepsilon : \text{ A variable representing a small number (such as 0.001)} &
\end{eqnarray}

Choosing the learning rate.

Make sure α > 0 with a proper size.
You may start like by trying a learning rate of 0.001
and try learning rate as 10 times as large say 0.01 and 0.1 and so on.

Classification 分類（應變數 Y 為離散）

Predict categories.

Decision Boundary

In a statistical-classification problem with two classes,
a decision boundary or decision surface
is a hypersurface that partitions the underlying vector space into two sets,
one for each class.

The classifier will classify all the points on one side of the decision boundary
as belonging to one class
and all those on the other side as belonging to the other class.

Logistic Regression

Logistic regression is a binary classification method.

Sigmoid Function

$$ g(z) = \tfrac{1}{1 + e^{-z}} \quad 0 < g(z) < 1 $$

Logistic Function

$$ f(x) = \frac{L}{1 + e^{-k(x – x_0)}} $$

\begin{align}
& x_0 \text{: the } x \text{ value of the function’s midpoint} \\
& L \text{: the supremum of the values of the function} \\
& k \text{: the logistic growth rate or steepness of the curve}
\end{align}

Definition of Logistic Regression

\begin{align}
& \text{Let } z = 0 \text{ is the decision boundary.} \\
& P(Y = 1|X = x) = g(z) = \frac{1}{1 + e^{-z}} \text{ is the “probability” that y is 1.} \\
& \hat{y} =
\begin{cases}
1 &\quad \text{if } P(Y = 1|X = x) \geq \text{threshold (usually 0.5)} \\
0 &\quad \text{otherwise}
\end{cases}
\end{align}

Loss Function

Unlike the lost function used in linear regression is always a convex function,

\begin{align}
L(\hat{y}, y) =
\begin{cases}
-log(\hat{y}^{(i)}) &\quad \text{if } y^{(i)} = 1 \\
-log(1 – \hat{y}^{(i)}) &\quad \text{if } y^{(i)} = 0
\end{cases}
\end{align}

Simplified:

$$ L(\hat{y}, y) = -y^{(i)} log(\hat{y}^{(i)}) – (1 – y^{(i)})log(1 – \hat{y}^{(i)}) $$

Cost Function

$$ J = \frac{1}{m}\sum^m_{i = 1}L(\hat{y}^{(i)}, y^{(i)}) = -\frac{1}{m}\sum^m_{i = 1}(y^{(i)} log(\hat{y}^{(i)}) + (1 – y^{(i)})log(1 – \hat{y}^{(i)})) $$

Ensemble learning

In statistics and machine learning,
ensemble methods use multiple learning algorithms
to obtain better predictive performance
than could be obtained from any of the constituent learning algorithms alone.

Boosting

A strong learner created by a set of weak.

A weak learner is defined to be a classifier
that is only slightly correlated with the true classification.
(It can label examples better than random guessing)

In contrast,
a strong learner is a classifier
that is arbitrarily well-correlated with the true classification.

Random Forest

k-Nearest Neighbors Algorithm

Support Vector Machine

Relevance Vector Machine

Neural Network

Underfitting | Generalization | Overfitting

Our goal when creating a model
is to be able to use the model to predict outcomes correctly for new examples.
A model which does this is said to generalize well.

Underfitting: High Bias

Underfitting occurs when a mathematical model
cannot adequately capture the underlying structure of the data.

An under-fitted model is a model where some parameters or terms
that would appear in a correctly specified model are missing.

Under-fitting would occur,
for example, when fitting a linear model to non-linear data.
Such a model will tend to have poor predictive performance.

泛化 Generalization: Just Right

Overfitting: High Variance

In mathematical modeling, overfitting is
“the production of an analysis that corresponds too closely or exactly to a particular set of data,
and may therefore fail to fit additional data or predict future observations reliably”.

An overfitted model is a mathematical model
that contains more parameters than can be justified by the data.

In a mathematical sense,
these parameters represent the degree of a polynomial.
The essence of overfitting is
to have unknowingly extracted some of the residual variation (i.e., the noise)
as if that variation represented the underlying model structure.

Addressing Overfitting

To Collect More Training Data

One way to address Overfitting is to collect more training data.

If you’re able to get more data, with the larger training set,
the learning algorithm will learn to fit a function that is less wiggly.

You can continue to fit a high-order polynomial
or some of the functions with a lot of features,
and if you have enough training examples, it will still do okay.

But getting more data isn’t always an option.

Feature Selection

If you have too many features but don’t have enough training data,
then your learning algorithm may also overfit to your training set.

If we were to pick just a subset of the most useful ones.
If you think those are the most relevant features,
then using just that smallest subset of features,
you may find that your model no longer overfits as badly.

One way you could do so is to use your intuition
to choose what you think is the best set of features.

Choosing the most appropriate set of features to use
is sometimes also called feature selection.

Now, one disadvantage of feature selection
is that by using only a subset of the features,
the algorithm is throwing away some of the information,
useful features could be lost.
Maybe you don’t want to throw away some of the information
by throwing away some of the features.

Regularization

Regularization lets you keep all of your features,
but they just prevent the features from having an overly large effect,
which is what sometimes can cause overfitting.

Machine Learning 機器學習

Performance Metrics

Confusion Matrix

準確率 Accuracy = (TP + TN) / (TP + FP + FN + TN)

精確率 Precision = TP / (TP + FP)

FP 越小越好 ⇔ 降低誤判 False 為 True 的可能性

召回率 Recall = TP / (TP + FN)

提升 Positive (P) 中的預測正確率 ⇔ 盡可能地找出所有 P

F1-score = 2 / [(1 / Precision) + (1 / Recall) ]

Supervised Learning 監督式學習

Features

Data Annotation/Labeling

Features Scaling

Mean Normalization

Z-score Normalization

Feature Engineering

Regression Analysis 迴歸分析 （應變數 Y 為連續）

定義

Linear Regression 線性迴歸

Cost Function

Squared Error Cost Function

Polynomial Regression 多項式迴歸

Gradient Descent

Checking Gradient Descent for Convergence.

Choosing the learning rate.

Classification 分類（應變數 Y 為離散）

Decision Boundary

Logistic Regression

Sigmoid Function

Logistic Function

Definition of Logistic Regression

Loss Function

Cost Function

Ensemble learning

Boosting

Random Forest

k-Nearest Neighbors Algorithm

Support Vector Machine

Relevance Vector Machine

Neural Network

Underfitting | Generalization | Overfitting

Underfitting: High Bias

泛化 Generalization: Just Right

Overfitting: High Variance

Addressing Overfitting

To Collect More Training Data

Feature Selection

Regularization

Unsupervised Learning 非監督式學習

Clustering 分群

Dimensionality Reduction 降維

Anomaly Detection 異常檢測

Reinforcement learning 強化學習

Recommender system 推薦系統

F₁-score = 2 / [(1 / Precision) + (1 / Recall) ]

Regression Analysis 迴歸分析（應變數 Y 為連續）