Notes for Machine Learning
发表于 2023-07-14更新于 365 日前语言暂不支持许可证 CC BY-NC-SA 4.0aimachine-learningcourseranotepythonoctavelinear-regressionlogistic-regression目录
- Lesson 1
- Lesson 2
- Machine Learning definition
- Machine learning algorithms
- Lesson 3
- Lesson 4
- Lesson 5
- Lesson 6
- Lesson 7
- Lesson 8
- Lesson 9
- Lesson 10
- Lesson 11
- Lesson 12
- Lesson 13
- Lesson 14 - 19
- Lesson 20 - 26
- Lesson 27
- Lesson 28
- Lesson 29
- Feature Scaling (特征缩放)
- Mean normalization (均值归一化)
- Lesson 30
- "Debugging": How to make sure gradient descent is working correctly
- How to choose learning rate
- Lesson 31
- Lesson 32
- Lesson 33
- Lesson 34 - 35
- Lesson 36
- Lesson 37
- Lesson 38 - 42
- Lesson 43 (Exercise 1)
- ex1-plotData.py
- ex1-computeCost.py
- ex1-gradientDescent.py
- ex1-featureNormalize.py
- ex1-normalEqn.py
- Lesson 44
- Lesson 45
- Logistic Regression Model
- Interpretation of Hypothesis Output
- Lesson 46
- Logistic regression
- Decision Boundary (决策界限)
- Lesson 47
- Lesson 48
- Lesson 49
- Lesson 50
- Multiclass classification (多类别分类)
- One-vs-all (one-vs-rest)
- Lesson 51
- Lesson 52
- What's overfitting?
- Addressing overfitting
- Lesson 53
- Lesson 54
- Gradient descent
- Normal equation
- Lesson 55
- Lesson 56 (Exercise 2)
- ex2-plotData.py
- ex2-sigmoid.py
- ex2-constFunction.py
- ex2-predict.py
- ex2-costFunctionReg.py
- Lesson 57
- Lesson 58
- Lesson 59
- Neural Networks
- The "one learning algorithm" hypothesis
- Lesson 60
- Neural Network
- Lesson 61
- Forward propagation (前向传播): Vectorized implementation
- Lesson 62 (Exercise 3)
- Lesson 63 - 64
- Lesson 65
- Lesson 66
- Lesson 67
- Lesson 68
- Gradient computation: Backpropagation algorithm (反向传播算法)
- Backpropagation algorithm
Here are my notes for course Machine Learning taught by Andrew Ng.
About exercise: I didn't do the original version exercise which depends on Octave or Matlab, but a third-party Python version (See nsoojin/coursera-ml-py).
Lesson 1
Spam: 垃圾邮件 Spam filter: 垃圾邮件过滤器
Examples:
- Database mining (Web click data, medical records, biology, engineering)
- Applications can't program by hand (Autonomous helicopter, handwriting recognition, most of Natural Language Processing(NLP), Computer Vision)
- Self-customizing programs (Amazon, Netflix product recommendations)
- Understanding human learning (brain, real AI)
Lesson 2
Machine Learning definition
- Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
- Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. (如果一个计算机程序在任务 T 和性能指标 P 方面的性能指标随着经验 E 的增加而提高,则称该计算机程序从经验 E 中学习。)
Machine learning algorithms
- Supervised learning (监督学习)
- Unsupervised learning (无监督学习)
Others:
- Reinforcement learning (强化学习)
- Recommender systems (推荐系统)
Lesson 3
In supervised learning, "right answers" are given.
A regression (回归) problem: predict continuous valued output
A classification (分类) problem: discrete valued output
Lesson 4
Given data set without labels, the machine finds some structures hiding in it.
Examples:
- Organize computing clusters (管理计算集群)
- Social network analysis
- Market segmentation
- Astronomical data analysis
Cocktail party problem (鸡尾酒会问题): extracting two audio from two audio sources
Lesson 5
...
Lesson 6
Linear Regression (线性回归)
Notation:
- = Number of training examples 
- 's = "input" variable / features 
- 's = "output" variable / "target" variable 
- = one training example 
- = - training example 
How does supervised learning work?

How do we represent 
where 
This is Linear regression with one variable, or Univariate linear regression (单变量线性回归).
Lesson 7
Cost function (代价函数)
Hypothesis: 
How to choose 
Idea: Choose 
Cost function (Square error cost function, 平方误差代价函数): 
We want to minimize the cost function.
Lesson 8
Hypothesis: 
Parameters: 
Cost Function: 
Goal: 
Lesson 9
contour plots / contour figures (等高线图)
We use contour plots / contour figures to show 
Lesson 10
Gradient Descent (梯度下降)
Here we use this algorithm to minimize the cost function.
Outline:
- Start with some , 
- Keep changing , to reduce until we hopefully end up at a minimum 
Gradient descent algorithm:
Repeat until convergence (收敛) {
}
Notice that we need to update 
Lesson 11
- If is too small, gradient descent can be slow; 
- If is too large, gradient descent can overshoot the minimum. It may fail to converge (收敛), or even diverge (发散). 
Gradient descent can converge to a local minimum, even with the learning rate 
Lesson 12
In linear regression, there is no local optimum except for a global optimum in its cost function.
"Batch" Gradient Descent (批量梯度下降): which means that each step of gradient descent uses all the training examples.
Lesson 13
...
Lesson 14 - 19
Matrix and Vector
Matrix Addition and Scalar Multiplication (标量乘法)
Matrix-vector multiplication
Matrix-matrix multiplication
Matrix multiplication properties
Identity Matrix (单位矩阵)
Matrix Inverse (逆) and Matrix Transpose (转置)
Lesson 20 - 26
...
Lesson 27
Multiple features (variables)
Notation:
- = number of features 
- = input (features) of - training example. 
- = value of feature - in - training example. 
Hypothesis:
For convenience of notation, define 
Multivariate linear regression (多元线性回归)
Lesson 28
Multivariate gradient descent:
- Hypothesis: 
- Parameters: , , ..., or (n+1)-dimensioned vector 
- Cost function: 
Gradient Descent:
Repeat: {
}
Lesson 29
Feature Scaling (特征缩放)
Idea: Make sure features are on a similar scale. (If you can make sure that the features are on a similar scale, by which I mean make sure that the different features take on similar ranges of values, then gradient descents can converge more quickly.)
Get every feature into approximately a 
Mean normalization (均值归一化)
Replace 
Lesson 30
"Debugging": How to make sure gradient descent is working correctly
画图:纵轴为 
How to choose learning rate 
- If is too small: slow convergence; 
- If is too large: may not decrease on every iteration; may not converge. 
To choose 
Lesson 31
Polynomial regression (多项式回归)
Use multivariate linear regression: 
Lesson 32
Normal equation (正规方程)
不迭代,直接求解 
In Octave: pinv(X'*X)*X'*y
| Gradient Descent | Normal Equation | 
|---|---|
| Need to choose | No need to choose | 
| Need many iterations | Don't need to iterate | 
| Work well even when | Need to compute | 
Lesson 33
Normal equation and non-invertibility (optional)
What if 
In Octave, there is two functions pinv (pseudo-inverse, 伪逆) and inv (inverse, 逆). The former will actually compute the answer you want even if 
Lesson 34 - 35
...
Lesson 36
Basic operations of Octave
% comment stating with %
5+6  % plus
3-2  % minus
5*8  % multiply
1/2  % divide by
2^6  % pow
% logic operation
1 == 2  % equal
1 ~= 2  % not equal
1 && 0  % AND
1 || 0  % OR
xor(1, 0)  % XOR
% customize prompt
PS1('>> '); % change the prompt to '>> '
% variable and assignment
a=3;  % semicolon suppressing output
b='hi';
c=(3>=1);
a=pi;
a  % this print "a = 3.1416"
disp(a);  % this print "3.1416"
disp(sprintf('2 decimals: %0.2f', a))  % this print "2 decimals: 3.14"
format long  % display more decimal parts
format short % display less ...
% matrix
A=[1 2; 3 4; 5 6]
% vector
v=[1 2 3]  % row vector
v=[1;2;3]  % column vector
v=1:0.1:2  % v=[1.0000 1.1000, 1.2000, ..., 2.0000]
v=1:6      % v=[1, 2, 3, 4, 5, 6]
ones(2,3)  % [1, 1, 1; 1, 1, 1]
C=2*ones(2,3)  % C=[2, 2, 2; 2, 2, 2]
w=zeros(1,3)  % w=[0, 0, 0]
w=rand(1, 3)  % a 1*3 matrix with random numbers
w=randn(1, 3) % normal random variable (正态分布)
hist(w)  % plot a histogram (直方图)
eye(4)  % a 4*4 identity matrix (单位矩阵)
help eye  % show helpLesson 37
size(A)  % return the size of matrix A
size(A,1)  % return the size of the first dimension of A
length(v)  % return the size of vector v (actually the longest dimension)
pwd  % similar to linux pwd
ls  % similar to linux ls
load featuresX.dat  % load the featuresX.dat file
load('featuresX.dat')
who  % show all the variables in the current scope
whos  % more detailed who command
clear(featuresX)  % delete the variable featuresX
v=priceY(1:10)  % set v to be the first ten elements of priceY
save hello.mat v;  % save v to a file called hello.mat (in binary)
clear  % delete all the variables
save hello.txt v -ascii  % save v to a file called hello.txt with ascii characters
A(3,2)
A(2,:)  % colon means every element along that row / column
A([1 3], :)  % get all the elements from the first and the third row
A(:,2) = [10;11;12]  % replace the second column with [10;11;12]
A=[A,[100;101;102]]  % append another column vector to right
A(:)  % put all elements of A into a single vector
...Lesson 38 - 42
...
Lesson 43 (Exercise 1)
ex1-plotData.py
See scatter
import matplotlib.pyplot as plt
def plot_data(x, y):
    # ===================== Your Code Here =====================
    # Instructions : Plot the training data into a figure using the matplotlib.pyplot
    #                using the "plt.scatter" function. Set the axis labels using
    #                "plt.xlabel" and "plt.ylabel". Assume the population and revenue data
    #                have been passed in as the x and y.
    # Hint : You can use the 'marker' parameter in the "plt.scatter" function to change the marker type (e.g. "x", "o").
    #        Furthermore, you can change the color of markers with 'c' parameter.
    plt.scatter(x, y, marker='x', c='r')
    plt.xlabel("Population")
    plt.ylabel("Revenue")
    # ===========================================================
    plt.show()ex1-computeCost.py
I'm a green hand to numpy, so I used a loop.
import numpy as np
def compute_cost(X, y, theta):
    # Initialize some useful values
    m = y.size
    cost = 0
    # ===================== Your Code Here =====================
    # Instructions : Compute the cost of a particular choice of theta.
    #                You should set the variable "cost" to the correct value.
    for i in range(m):
        h_theta = np.dot(X[i, :], theta)
        cost += (h_theta - y[i]) ** 2
    cost /= (2 * m)
    # ==========================================================
    return costLearn the elegant code from nsoojin!
ex1-gradientDescent.py
This code is from nsoojin, which is so elegant! I spent a long time to understand the most important two lines.
import numpy as np
from computeCost import *
def gradient_descent_multi(X, y, theta, alpha, num_iters):
    # Initialize some useful values
    m = y.size
    J_history = np.zeros(num_iters)
    for i in range(0, num_iters):
        # ===================== Your Code Here =====================
        # Instructions : Perform a single gradient step on the parameter vector theta
        #
        error = np.dot(X, theta).flatten() - y  # 误差
        theta -= alpha / m * np.sum(X * error[:, np.newaxis], 0)
        # ===========================================================
        # Save the cost every iteration
        J_history[i] = compute_cost(X, y, theta)
    return theta, J_history
ex1-featureNormalize.py
numpy.std compute the standard deviation (标准差) along the specified axis. (See Doc)
The hint says that:
To get the same result as Octave 'std', use np.std(X, 0, ddof=1)
where ddof means Delta Degrees of Freedom. The divisor used in calculation is N - ddof, where N represents the number of elements. By default ddof is zero.
Note
这实际上就是总体标准差和样本标准差的区别:
总体标准差:
样本标准差:
import numpy as np
def feature_normalize(X):
    # You need to set these values correctly
    n = X.shape[1]  # the number of features
    X_norm = X
    mu = np.zeros(n)
    sigma = np.zeros(n)
    # ===================== Your Code Here =====================
    # Instructions : First, for each feature dimension, compute the mean
    #                of the feature and subtract it from the dataset,
    #                storing the mean value in mu. Next, compute the
    #                standard deviation of each feature and divide
    #                each feature by its standard deviation, storing
    #                the standard deviation in sigma
    #
    #                Note that X is a 2D array where each column is a
    #                feature and each row is an example. You need
    #                to perform the normalization separately for
    #                each feature.
    #
    # Hint: You might find the 'np.mean' and 'np.std' functions useful.
    #       To get the same result as Octave 'std', use np.std(X, 0, ddof=1)
    #
    mu = np.mean(X, axis=0)
    sigma = np.std(X, axis=0, ddof=1)
    X_norm = (X - mu) / sigma
    # ===========================================================
    return X_norm, mu, sigma
ex1-normalEqn.py
Compute pseudo inverse with numpy.linalg.pinv():
import numpy as np
def normal_eqn(X, y):
    theta = np.zeros((X.shape[1], 1))
    # ===================== Your Code Here =====================
    # Instructions : Complete the code to compute the closed form solution
    #                to linear regression and put the result in theta
    #
    theta = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
    return thetaLesson 44
Examples of classification problem:
- Email: Spam / Not Spam?
- Online Transactions (在线交易): Fraudulent (欺诈) (Yes / No)?
- Tumor (肿瘤): Malignant (恶性的) / Benign (良性的) ?
Binary classification problem:
Logistic Regression is a classification algorithm though it has "regression" in its name.
Lesson 45
Logistic Regression Model
Want 
Sigmoid function / Logistic function: 
And we get:
Interpretation of Hypothesis Output
Example: If
"Probability that 
Lesson 46
Logistic regression
Suppose
- predict "- predict "
Since we know that:
So:
- predict "- predict "
Decision Boundary (决策界限)
...
Lesson 47
Cost function:
- 
Linear regression: 
- 
Logistic regression cost function:
Captures intuition that if 
(When 
Lesson 48
Logistic regression cost function:
Since that 
To fit parameters 
To make a prediction given new 
Output 
Want 
Repeat {
}
in which
Algorithm looks identical to linear regression!
But attention that, there are two different 
Lesson 49
Optimization algorithms:
- Conjugate gradient (共轭梯度)
- BFGS
- L-BFGS
Advantages:
- No need to manually pick 
- Often faster than gradient descent
Disadvantages:
- More complex
In particular, you probably should not implement these algorithms (conjugate gradient, L-BFGS, BFGS) yourself, unless you're an expert in numerical computing.
Lesson 50
Multiclass classification (多类别分类)
Ex.
- Email foldering / tagging: Work, Friends, Family, Hobby
- Medical diagrams: Not ill, Cold, Flu
- Weather: Sunny, Cloudy, Rain, Snow
One-vs-all (one-vs-rest)
Train a logistic regression classifier 
On a new input 
Lesson 51
...
Lesson 52
The problem of overfitting (过拟合问题)
What's overfitting?

- Underfit: 欠拟合
- High bias: 高偏差
- Overfit: 过拟合
- High variance: 高方差
- generalize: 泛化
Addressing overfitting
Options:
- Reduce number of features
- Manually select which features to keep
- Model selection algorithm (later in course)
 
- Regularization (正则化)
- Keep all the features, but reduce magnitude (量级) / values of parameters 
- Works well when we have a lot of features, each of which contributes a bit to predicting 
 
- Keep all the features, but reduce magnitude (量级) / values of parameters 
Lesson 53
Regularization.
Small values for parameters 
- "Simpler" hypothesis
- Less prone to overfitting
Notice that usually we regularize only 
- Regularization term: 
- Regularization parameter (正则化参数): - Fit the training data well
- Keep the parameters small
 
What if 
Underfitting!
Lesson 54
Regularized linear regression
Gradient descent
Repeat {
}
Normal equation
Non-invertibility (optional / advanced)
...
Lesson 55
...
Lesson 56 (Exercise 2)
ex2-plotData.py
Use edgecolors= in numpy.scatter to customize the color of edges.
import matplotlib.pyplot as plt
import numpy as np
def plot_data(X, y):
    plt.figure()
    # ===================== Your Code Here =====================
    # Instructions : Plot the positive and negative examples on a
    #                2D plot, using the marker="+" for the positive
    #                examples and marker="o" for the negative examples
    #
    plt.scatter(x=X[y == 1, 0], y=X[y == 1, 1], marker='+', c="black")
    plt.scatter(x=X[y == 0, 0], y=X[y == 0, 1],
                marker='o', c="yellow", edgecolors="black")ex2-sigmoid.py
import numpy as np
def sigmoid(z):
    g = np.zeros(z.size)
    # ===================== Your Code Here =====================
    # Instructions : Compute the sigmoid of each value of z (z can be a matrix,
    #                vector or scalar
    #
    # Hint : Do not import math
    g = 1 / (1 + np.exp(-z))
    return gex2-constFunction.py
Cost function:
grad:
import numpy as np
from sigmoid import *
def cost_function(theta, X, y):
    m = y.size
    # You need to return the following values correctly
    cost = 0
    grad = np.zeros(theta.shape)
    # ===================== Your Code Here =====================
    # Instructions : Compute the cost of a particular choice of theta
    #                You should set cost and grad correctly.
    #
    h_theta = sigmoid(X @ theta)
    cost = 1 / m * np.sum(-y*np.log(h_theta)-(1-y)*np.log(1-h_theta), axis=0)
    grad = 1 / m * np.sum((h_theta-y)[:, np.newaxis]*X, axis=0)
    # ===========================================================
    return cost, grad
ex2-predict.py
import numpy as np
from sigmoid import *
def predict(theta, X):
    m = X.shape[0]
    # Return the following variable correctly
    p = np.zeros(m)
    # ===================== Your Code Here =====================
    # Instructions : Complete the following code to make predictions using
    #                your learned logistic regression parameters.
    #                You should set p to a 1D-array of 0's and 1's
    #
    h_theta = sigmoid(X @ theta)
    p[h_theta >= 0.5] = 1
    p[h_theta < 0.5] = 0
    # ===========================================================
    return p
ex2-costFunctionReg.py
Cost function:
grad:
For 
For 
import numpy as np
from sigmoid import *
def cost_function_reg(theta, X, y, lmd):
    m = y.size
    # You need to return the following values correctly
    cost = 0
    grad = np.zeros(theta.shape)
    # ===================== Your Code Here =====================
    # Instructions : Compute the cost of a particular choice of theta
    #                You should set cost and grad correctly.
    #
    h_theta = sigmoid(X @ theta)
    cost = 1 / m * np.sum((-y*np.log(h_theta)-(1-y) *
                          np.log(1-h_theta)), axis=0) + lmd / (2 * m) * (np.sum(np.power(theta, 2))-theta[0]**2)
    # let theta[0]=0 temporarily. This is helpful to calculation.
    theta_0 = theta[0]
    theta[0] = 0
    grad = (1 / m * np.sum((h_theta-y)
            [:, np.newaxis]*X, axis=0)) + lmd / m * theta
    theta[0] = theta_0
    # ===========================================================
    return cost, grad
Lesson 57
Regularized logistic regression
Gradient descent:
Repeat {
}
(看上去就和上边线性回归的一摸一样,但是使用了不同的 
Lesson 58
Non-linear hypotheses
When 
Lesson 59
Neurons and the brain
Neural Networks
Origins:
Algorithms that try to mimic (模仿) the brain.
Was very widely used in 80s and early 90s; popularity diminished in late 90s.
Recent resurgence (兴起): State-of-the-art technique for many applications
The "one learning algorithm" hypothesis
...
Lesson 60
Neural Network

- The input layer: the first layer
- The output layer: the last layer
- The hidden layer: the layer(s) between the first and the last layer
bias unit: 偏置单元
Notations:

If network has 
Lesson 61
Forward propagation (前向传播): Vectorized implementation

Lesson 62 (Exercise 3)
(Not yet completed)
Lesson 63 - 64
...
Lesson 65

Lesson 66
...
Lesson 67
Cost function in Neural network:
Lesson 68
Gradient computation: Backpropagation algorithm (反向传播算法)
Intuition: 
For each output unit (layer L = 4)
where
And then we can get:
Backpropagation algorithm
- Training set: 
- Set - For - Set 
- Perform forward propagation to compute - Using - Compute 
 
- Set