目录
四.the simplest neural network
model
1.Gradient Descent
Optimization(梯度下降算法)
[2.Multi-Layered Perceptrons and
Backpropagation(多层感知器和反向传播)](about:blank#2.Multi-
Layered%20Perceptrons%20and%20Backpropagation%EF%BC%88%E5%A4%9A%E5%B1%82%E6%84%9F%E7%9F%A5%E5%99%A8%E5%92%8C%E5%8F%8D%E5%90%91%E4%BC%A0%E6%92%AD%EF%BC%89)
<5>Loss Minimization Problem and Network
Training:
6.3-layer network 实现 mnist
手写数字识别
[<1>Training One-Layer Network
(Perceptron)](about:blank#%3C1%3ETraining%20One-
Layer%20Network%20%28Perceptron%29)
[<2>Multi-Class Classificatio(多分类问题)](about:blank#%3C2%3EMulti-
Class%20Classificatio%EF%BC%88%E5%A4%9A%E5%88%86%E7%B1%BB%E9%97%AE%E9%A2%98%EF%BC%89)
[<3>Multi-Label Classification(多标签分类)](about:blank#%3C3%3EMulti-
Label%20Classification%EF%BC%88%E5%A4%9A%E6%A0%87%E7%AD%BE%E5%88%86%E7%B1%BB%EF%BC%89)
<4>总结 Summary of Classification Loss
Functions
参考资料:microsoft/AI-For-Beginners: 12 Weeks, 24 Lessons, AI for All!
(github.com)
一.续opencv——级联分类器
OpenCV学习笔记——《基于OpenCV的数字图像处理》_opencv 数字图像处理-
CSDN博客
#include “opencv.hpp”
#include “highgui.hpp”
#include “imgproc.hpp”
#include
using namespace cv;
using namespace std;
#pragma comment(lib,”opencv_world480d.lib”)
VideoCapture capture(0);
Mat image;
CascadeClassifier face_cascade;
// 人脸检测
vector
int main()
{
Mat frame_gray;
face_cascade.load(“OPENCV安装路径/opencv/sources/data/haarcascades/haarcascade_frontalface_alt.xml”);
while (capture.isOpened())
{
capture >> image;
if (image.empty())break;
if (waitKey(1) == 27)break;
// BGR2GRAY
cvtColor(image, frame_gray, COLOR_BGR2GRAY);
face_cascade.detectMultiScale(frame_gray, faces);
for (size_t i = 0; i < faces.size(); i++)
{
// 人脸画框
rectangle(image, faces[i], Scalar(255, 0, 0), 1, 8);
}
imshow("Face detection",image);
}
}
二.c语言实现进化算法
三.LeNet 网络实现 MNIST 手写数字识别
四.the simplest neural network model
one-layered perceptron, a linear two-class classification model.(单层线性感知机)
Perceptron Model:
假设我们的模型中有N个特征,在这种情况下,输入向量将是一个大小为N的向量。感知器是一个二元分类模型,即它可以区分两类输入数据。我们将假设对于每个输入向量x,感知器的输出将是+1或-1,这取决于类别。输出将使用以下公式计算:
y(x) = f(wTx)
Training the Perceptron:
为了训练感知器,我们需要找到一个权重向量w,它能正确地分类大多数值,即产生最小的误差。该误差由感知器准则定义如下:
E(w) = -∑wTxiti
对那些导致错误分类的训练数据点I求和,xi是输入数据,对于负例和正例,ti分别为-1或+1。
这个标准被认为是权重w的函数,我们需要最小化它。通常,我们使用一种称为梯度下降的方法,在这种方法中,我们从一些初始权重w(0)开始,然后在每一步中根据公式更新权重:
w(t+1) = w(t) - η∇E(w)
这里η是所谓的学习率,∇E(w)表示E的梯度,计算出梯度后,我们得到
w(t+1) = w(t) + ∑ηxiti
//perceptron.h
#ifndef _PERCEPTRON_H
#define _PERCEPTRON_H
//the simplest neural network model - one-layered perceptron, a linear two-class classification model.
#include<stdio.h>
#include<time.h>
#define FREATURE_NUM 2 //特征数(输入向量维数)
#define LEARNING_RATE 1 //学习率
typedef struct input_data{
double freature[FREATURE_NUM];
int label;
}input_data;
typedef struct input_dataset{
input_data* input;
int set_num;
}input_dataset;
double weight[FREATURE_NUM]={0};
void train(input_dataset dataset,int iteration);
void perceptron(input_data *input);
#endif
//perceptron.c
#include”perceptron.h”
void train(input_dataset dataset,int iteration)
{
//生成随机数种子
srand((unsigned)time(NULL));
int set_num=dataset.set_num;
int i,j,k;
for(i=0;i<iteration;i++){
k=rand()%set_num;
//梯度下降方法搜寻
for(j=0;j<FREATURE_NUM;j++)
{
weight[j]+=1.0LEARNING_RATEdataset.input[k].freature[j]*dataset.input[k].label;
// printf(“%lf %lf\n”,weight[j],dataset.input[k].freature[j]);
}
}
return;
}
void perceptron(input_data *input){
int i,temp;
for(i=0,temp=0;i<FREATURE_NUM;i++)temp+=weight[i]*input->freature[i];
if(temp>=0)input->label=1;
else input->label=-1;
printf("label:%d\n",input->label);
return;
}
#include<stdio.h>
#include”perceptron.c”
int main(){
input_data input[2];
input[0].freature[0]=-3.0;
input[0].freature[1]=1.0;
input[0].label=1;
input[1].freature[0]=-1.0;
input[1].freature[1]=3.0;
input[1].label=1;
input[2].freature[0]=2.0;
input[2].freature[1]=4;
input[2].label=-1;
input[3].freature[0]=4.0;
input[3].freature[1]=-2.0;
input[3].label=-1;
input_dataset dataset;
dataset.input=input;
dataset.set_num=4;
train(dataset,10);
int i;
for(i=0;i<FREATURE_NUM;i++)printf(“%lf\n”,weight[i]);
input_data test;
scanf(“%lf%lf”,&test.freature[0],&test.freature[1]);
perceptron(&test);
return 0;
}
python实现及mnist手写数字识别(两类):[NeuralNetworks/03-Perceptron at
main](https://github.com/microsoft/AI-For-
Beginners/tree/main/lessons/3-NeuralNetworks/03-Perceptron
“NeuralNetworks/03-Perceptron at main”)
(特征:28pix*28pix)
实现N类感知器:训练N个感知器:
- Create 10 one-vs-all datasets for all digits
- Train 10 perceptrons
- Define
classify
function to perform digit classification - Measure the accuracy of classification and print confusion matrix
- [Optional] Create improved
classify
function that performs the classification using one matrix multiplication.
五.Multi-Layered Perceptron
简介:
we will extend themodel above into a more flexible framework, allowing us to:
- perform multi-class classification in addition to two-class
- solve regression problems in addition to classification
- separate classes that are not linearly separable
We will also develop our own modular framework in Python that will allow us to
construct different neural network architectures.
Suppose we have a training dataset X with labels Y , and we need to
build a **model f **that will make most accurate predictions. The quality
of predictions is measured by Loss function ℒ. The following loss
functions are often used:
- For regression problem(回归问题) , when we need to predict a number, we can use absolute error **∑i|f(x(i))-y(i)| **, or squared error **∑i(f(x(i))-y(i))^2 **
- For classification(分类问题) , we use 0-1 loss (which is essentially the same as accuracy of the model), or logistic loss.
从p对损失函数L的影响来看逻辑损失函数更好
For one-level perceptron , function f was defined as a **linear
function f(x)=wx+b **(here w is the weight matrix, x is the vector
of input features, and b is bias vector). For different neural network
architectures, this function can take more complex form.
In the case of classification , it is often desirable to get
probabilities of corresponding classes as network output. To
convert arbitrary numbers to probabilities (eg. to normalize the output), we
often use softmax function σ , and the function f becomes
f(x)=σ(wx+b)
In the definition of f above, w and b are called parameters θ=〈
w,b 〉. Given the dataset 〈X ,Y 〉, we can compute an overall error on
the whole dataset as a function of parameters θ.
✅ The goal of neural network training is to minimize the error (Loss
function ℒ) by varying parameters θ
1.Gradient Descent Optimization(梯度下降算法)
This can be formalized as follows:
- Initialize parameters by some random values w(0), b(0)
- Repeat the following step many times:
- w(i+1) = w(i)-η∂ℒ/∂w
- b(i+1) = b(i)-η∂ℒ/∂b
During training, the optimization steps are supposed to be calculated
considering the whole dataset (remember that loss is calculated as a sum
through all training samples). However, in real life we take small portions of
the dataset called minibatches , and calculate gradients based on a subset
of data. Because subset is taken randomly each time, such method is called
stochastic gradient descent (SGD).
2.Multi-Layered Perceptrons and Backpropagation(多层感知器和反向传播)
一个示例——两层感知器 :
One-layer network, as we have seen above, is capable of classifying linearly
separable classes. To build a richer model, we can combine several layers of
the network. Mathematically it would mean that the function f would have
a more complex form, and will be computed in several steps:
- z1=w1x+b1
- z2=w2α(z1)+b2
- f = σ(z2)
Here, α is a non-linear activation function , σ is a softmax
function, and parameters θ=<_w1,b1,w2,b2_ >.
The gradient descent algorithm would remain the same, but it would be more
difficult to calculate gradients. Given the chain differentiation rule, we can
calculate derivatives as:
- ∂ℒ/∂w2 = (∂ℒ/∂σ)(∂σ/∂z2)(∂z2/∂w2)
- ∂ℒ/∂w1 = (∂ℒ/∂σ)(∂σ/∂z2)(∂z2/∂α)(∂α/∂z1)(∂z1/∂w1)
✅ The chain differentiation rule is used to calculate derivatives of the
loss function with respect to parameters.
链式规则、后向传播更新参数θ :
Note that the left-most part of all those expressions is the same, and thus we
can** effectively calculate derivatives** **starting from the loss function
and going “backwards” **through the computational graph. Thus the method of
training a multi-layered perceptron is called backpropagation , or
‘backprop’.
即:
3.单层感知器模型
Two outputs of the network correspond to two classes, and the class with highest value among two outputs corresponds to the right solution.
The model is defined as:
相关依赖:
import matplotlib.pyplot as plt
from matplotlib import gridspec
from sklearn.datasets import make_classification
import numpy as np
# pick the seed for reproducibility - change it to explore the effects of random variations
np.random.seed(0)
import random
< 1>创建数据集,X为特征向量,Y为标签:
n = 100
X, Y = make_classification(n_samples = n, n_features=2,
n_redundant=0, n_informative=2, flip_y=0.2)
X = X.astype(np.float32)
Y = Y.astype(np.int32)
# Split into train and test dataset
train_x, test_x = np.split(X, [n8//10])
train_labels, test_labels = np.split(Y, [n8//10])
#显示数据集
print(train_x[:5])
print(train_labels[:5])
[[-0.836906 -1.382417 ]
[ 3.0352616 -1.1195285]
[ 1.6688806 2.4989042]
[-0.5790065 2.1814067]
[-0.8730455 -1.4692409]]
[0 1 1 1 0]
**< 2> 前向传播计算过程: **
class Linear:
#初始化权重
def init(self,nin,nout):
self.W = np.random.normal(0, 1.0/np.sqrt(nin), (nout, nin))
self.b = np.zeros((1,nout))
#前向传播计算
def forward(self, x):
return np.dot(x, self.W.T) + self.b
net = Linear(2,2)
net.forward(train_x[0:5])
#5个输入的输出
0,1.772021,-0.253845
1,0.283708,-0.396106
2,-0.300974,0.305132
3,-0.812048,0.560794
4,-1.235197,0.339497
<3>使用softmax函数转换为概率:
class Softmax:
def forward(self,z):
zmax = z.max(axis=1,keepdims=True)
expz = np.exp(z-zmax)
Z = expz.sum(axis=1,keepdims=True)
return expz / Z
softmax = Softmax()
softmax.forward(net.forward(train_x[0:10]))
In case we have more than 2 classes, softmax will normalize probabilities across all of them.
<4>交叉熵损失函数
A loss function in classification is typically a logistic function , which can be generalized as cross-entropy loss. Cross-entropy loss is a function that can calculate similarity between two arbitrary probability distributions.
def cross_ent(prediction, ground_truth):
t = 1 if ground_truth > 0.5 else 0
return -t * np.log(prediction) - (1 - t) * np.log(1 - prediction)
plot_cross_ent()
Cross-entropy loss will be defined again as a separate layer , but forward
function will have two input values: output of the previous layers of the network p
, and the expected class y
:
应用:
class CrossEntropyLoss:
def forward(self,p,y):
self.p = p
self.y = y
p_of_y = p[np.arange(len(y)), y]
log_prob = np.log(p_of_y)
return -log_prob.mean() # average over all input samples
cross_ent_loss = CrossEntropyLoss()
p = softmax.forward(net.forward(train_x[0:10]))
cross_ent_loss.forward(p,train_labels[0:10])
IMPORTANT : Loss function returns a number that shows how good (or bad)
our network performs. It should return us one number for the whole dataset,
or for the part of the dataset (minibatch). Thus after calculating cross-
entropy loss for each individual component of the input vector, we need to
average (or add) all components together - which is done by the call to.mean()
.(注意计算的是交叉熵均值 :return -log_prob.mean() # average over all input samples )
z = net.forward(train_x[0:10]) #输出
p = softmax.forward(z) #softmax归一化
loss = cross_ent_loss.forward(p,train_labels[0:10])#cross_ent_loss = CrossEntropyLoss()
print(loss)
<5>Loss Minimization Problem and Network Training:
数学描述:
采用梯度下降法进行计算(见2.)
网络训练 包括前向和后向传播两个过程(原理 见2和3<2>)
One pass of the network training consists of two parts:
- Forward pass , when we calculate the value of loss function for a given input minibatch
- Backward pass , when we try to minimize this error by distributing it back to the model parameters through the computational graph.
后向传播的具体实现:
注意参数的更新在一个minibatch完全计算完后,而不是单个样本
def update(self,lr):
self.W -= lrself.dW
self.b -= lrself.db
#LR是学习率
<6>函数小结
class Linear:
def init(self,nin,nout):
self.W = np.random.normal(0, 1.0/np.sqrt(nin), (nout, nin))
self.b = np.zeros((1,nout))
self.dW = np.zeros_like(self.W)
self.db = np.zeros_like(self.b)
def forward(self, x):
self.x=x
return np.dot(x, self.W.T) + self.b
def backward(self, dz):
dx = np.dot(dz, self.W)
dW = np.dot(dz.T, self.x)
db = dz.sum(axis=0)
self.dW = dW
self.db = db
return dx
def update(self,lr):
self.W -= lr*self.dW
self.b -= lr*self.db
class Softmax:
def forward(self,z):
self.z = z
zmax = z.max(axis=1,keepdims=True)
expz = np.exp(z-zmax)
Z = expz.sum(axis=1,keepdims=True)
return expz / Z
def backward(self,dp):
p = self.forward(self.z)
pdp = p * dp
return pdp - p * pdp.sum(axis=1, keepdims=True)
class CrossEntropyLoss:
def forward(self,p,y):
self.p = p
self.y = y
p_of_y = p[np.arange(len(y)), y]
log_prob = np.log(p_of_y)
return -log_prob.mean()
def backward(self,loss):
dlog_softmax = np.zeros_like(self.p)
dlog_softmax[np.arange(len(self.y)), self.y] -= 1.0/len(self.y)
return dlog_softmax / self.p
<7>Training the Model
Now we are ready to write the **training loop** , which will go through our dataset, and perform the optimization minibatch by minibatch._One complete pass through the dataset is often called**an epoch** :_
lin = Linear(2,2)
softmax = Softmax()
cross_ent_loss = CrossEntropyLoss()
learning_rate = 0.1
pred = np.argmax(lin.forward(train_x),axis=1)
acc = (pred==train_labels).mean()
print(“Initial accuracy: “,acc)
batch_size=4
for i in range(0,len(train_x),batch_size):
xb = train_x[i:i+batch_size]
yb = train_labels[i:i+batch_size]
# forward pass
z = lin.forward(xb)
p = softmax.forward(z)
loss = cross_ent_loss.forward(p,yb)
# backward pass
dp = cross_ent_loss.backward(loss)
dz = softmax.backward(dp)
dx = lin.backward(dz)
lin.update(learning_rate)
pred = np.argmax(lin.forward(train_x),axis=1)
acc = (pred==train_labels).mean()
print("Final accuracy: ",acc)
Initial accuracy: 0.2625
Final accuracy: 0.7875
4.网络模型
<1>定义网络类 :
Since in many cases neural network is just **a composition of layers** , we can build a class that will allow us to **stack layers together** and**make forward and backward passes** through them without explicitly programming that logic. We will **store the list of layers inside the`Net` class**, and **use`add()` function to add new layers**:
class Net:
def init(self):
self.layers = []
def add(self,l):
self.layers.append(l)
def forward(self,x):
for l in self.layers:
x = l.forward(x)
return x
def backward(self,z):
for l in self.layers[::-1]:
z = l.backward(z)
return z
def update(self,lr):
for l in self.layers:
if 'update' in l.__dir__():
l.update(lr)
定义网络和训练:
net = Net()
net.add(Linear(2,2))
net.add(Softmax())
loss = CrossEntropyLoss()
def get_loss_acc(x,y,loss=CrossEntropyLoss()):
p = net.forward(x)
l = loss.forward(p,y)
pred = np.argmax(p,axis=1)
acc = (pred==y).mean()
return l,acc
print(“Initial loss={}, accuracy={}: “.format(*get_loss_acc(train_x,train_labels)))
def train_epoch(net, train_x, train_labels, loss=CrossEntropyLoss(), batch_size=4, lr=0.1):
for i in range(0,len(train_x),batch_size):
xb = train_x[i:i+batch_size]
yb = train_labels[i:i+batch_size]
p = net.forward(xb)
l = loss.forward(p,yb)
dp = loss.backward(l)
dx = net.backward(dp)
net.update(lr)
train_epoch(net,train_x,train_labels)
print("Final loss={}, accuracy={}: ".format(*get_loss_acc(train_x,train_labels)))
print("Test loss={}, accuracy={}: ".format(*get_loss_acc(test_x,test_labels)))
Initial loss=0.8977914474068779, accuracy=0.4625:
Final loss=0.47908832233966514, accuracy=0.825:
Test loss=0.5317198099647931, accuracy=0.8:
<2>Multi-Layered Models
Very important thing to note, however, is that in between linear layers we need to have a non-linear activation function , such as tanh. Without such non-linearity, several linear layers would have the same expressive power as just one layers - because composition of linear functions is also linear!
在线性层之间添加激活函数,线性函数的叠加仍是线性。
class Tanh:
def forward(self,x):
y = np.tanh(x)
self.y = y
return y
def backward(self,dy):
return (1.0-self.y**2)*dy
Adding several layers make sense, because unlike one-layer network, multi-layered model will be able to accuratley classify sets that are not linearly separable. I.e., a model with several layers will be reacher.
It can be demonstrated that with sufficient number of neurons a two-
layered model is capable to classifying any convex set of data points
, and three-layered network can classify virtually any set.
多层网络的形式见前(2.)
两层网络示例:
net = Net()
net.add(Linear(2,10))
net.add(Tanh())
net.add(Linear(10,2))
net.add(Softmax())
loss = CrossEntropyLoss()
关于线性模型和多层复杂模型的区别和**过拟合(**overfitting) 问题:
A linear model:
- We are likely to get high training loss - so-called underfitting , when the model does not have enough power to correctly separate all data.
- Valiadation loss and training loss are more or less the same. The model is likely to generalize well to test data.
Complex multi-layered model
- Low training loss - the model can approximate training data well, because it has enough expressive power.
- Validation loss can be much higher than training loss and can start to increase during training - this is because the model “memorizes” training points, and loses the “overall picture”
小结:
Takeaways
- Simple models (fewer layers, fewer neurons) with low number of parameters (“low capacity”) are less likely to overfit
- More complex models (more layers, more neurons on each layer, high capacity) are likely to overfit. We need to monitor validation error to make sure it does not start to rise with further training
- More complex models need more data to train on.
- You can solve overfitting problem by either:
- simplifying your model
- increasing the amount of training data
- Bias-variance trade-off is a term that shows that you need to get the compromise
- between power of the model and amount of data,
- between overfittig and underfitting
- There is not single recipe on how many layers of parameters you need - the best way is to experiment
5.代码整合
###################################################################
# package
# matplotlib nbagg
import matplotlib.pyplot as plt
from matplotlib import gridspec
from sklearn.datasets import make_classification
import numpy as np
# pick the seed for reproducibility - change it to explore the effects of random variations
np.random.seed(0)
import random
###################################################################
# dataset
n = 100
X, Y = make_classification(n_samples = n, n_features=2,
n_redundant=0, n_informative=2, flip_y=0.2)
X = X.astype(np.float32)
Y = Y.astype(np.int32)
# Split into train and test dataset
train_x, test_x = np.split(X, [n8//10])
train_labels, test_labels = np.split(Y, [n8//10])
###################################################################
# layers
class Linear:
def init(self,nin,nout):
self.W = np.random.normal(0, 1.0/np.sqrt(nin), (nout, nin))
self.b = np.zeros((1,nout))
self.dW = np.zeros_like(self.W)
self.db = np.zeros_like(self.b)
def forward(self, x):
self.x=x
return np.dot(x, self.W.T) + self.b
def backward(self, dz):
dx = np.dot(dz, self.W)
dW = np.dot(dz.T, self.x)
db = dz.sum(axis=0)
self.dW = dW
self.db = db
return dx
def update(self,lr):
self.W -= lr*self.dW
self.b -= lr*self.db
class Tanh:
def forward(self,x):
y = np.tanh(x)
self.y = y
return y
def backward(self,dy):
return (1.0-self.y**2)*dy
class Softmax:
def forward(self,z):
self.z = z
zmax = z.max(axis=1,keepdims=True)
expz = np.exp(z-zmax)
Z = expz.sum(axis=1,keepdims=True)
return expz / Z
def backward(self,dp):
p = self.forward(self.z)
pdp = p * dp
return pdp - p * pdp.sum(axis=1, keepdims=True)
class CrossEntropyLoss:
def forward(self,p,y):
self.p = p
self.y = y
p_of_y = p[np.arange(len(y)), y]
log_prob = np.log(p_of_y)
return -log_prob.mean()
def backward(self,loss):
dlog_softmax = np.zeros_like(self.p)
dlog_softmax[np.arange(len(self.y)), self.y] -= 1.0/len(self.y)
return dlog_softmax / self.p
###################################################################
# network
class Net:
def init(self):
self.layers = []
def add(self,l):
self.layers.append(l)
def forward(self,x):
for l in self.layers:
x = l.forward(x)
return x
def backward(self,z):
for l in self.layers[::-1]:
z = l.backward(z)
return z
def update(self,lr):
for l in self.layers:
if 'update' in l.__dir__():
l.update(lr)
def get_loss_acc(x,y,loss=CrossEntropyLoss()):
p = net.forward(x)
l = loss.forward(p,y)
pred = np.argmax(p,axis=1)
acc = (pred==y).mean()
return l,acc
def train_epoch(net, train_x, train_labels, loss=CrossEntropyLoss(), batch_size=4, lr=0.1):
for i in range(0,len(train_x),batch_size):
xb = train_x[i:i+batch_size]
yb = train_labels[i:i+batch_size]
p = net.forward(xb)
l = loss.forward(p,yb)
dp = loss.backward(l)
dx = net.backward(dp)
net.update(lr)
print("epoch={}: ".format(i),end="")
print("Final loss={}, accuracy={}: ".format(*get_loss_acc(train_x,train_labels)))
print("Test loss={}, accuracy={}: ".format(*get_loss_acc(test_x,test_labels)))
###################################################################
# main
net = Net()
net.add(Linear(2,10))
net.add(Tanh())
net.add(Linear(10,2))
net.add(Softmax())
train_epoch(net,train_x,train_labels)
6.3-layer network 实现 mnist 手写数字识别
训练模型,保存结果:
###################################################################
# packages
import matplotlib.pyplot as plt
from matplotlib import gridspec
from sklearn.datasets import make_classification
import numpy as np
# pick the seed for reproducibility - change it to explore the effects of random variations
np.random.seed(0)
import random
###################################################################
# dataset
n=70000
# generate data
# X, Y = make_classification(n_samples = n, n_features=2828,n_redundant=0, n_informative=88, flip_y=0.2)
# get data from mnist
from torchvision import datasets, transforms
mnist_train = datasets.MNIST(root=’./data’, train=True, transform=transforms.ToTensor())
X = mnist_train.data.numpy()
Y = mnist_train.targets.numpy()
X = X.reshape(X.shape[0],-1)
X = X.astype(np.float32)
Y = Y.astype(np.int32)
# Split into train and test dataset
train_x, test_x = np.split(X, [n8//10]) # 80% training and 20% test
train_labels, test_labels = np.split(Y, [n8//10])
###################################################################
# layers
class Linear:
def init(self,nin,nout):
self.W = np.random.normal(0, 1.0/np.sqrt(nin), (nout, nin))
self.b = np.zeros((1,nout))
self.dW = np.zeros_like(self.W)
self.db = np.zeros_like(self.b)
def forward(self, x):
self.x=x
return np.dot(x, self.W.T) + self.b
def backward(self, dz):
dx = np.dot(dz, self.W)
dW = np.dot(dz.T, self.x)
db = dz.sum(axis=0)
self.dW = dW
self.db = db
return dx
def update(self,lr):
self.W -= lr*self.dW
self.b -= lr*self.db
class Tanh:
def forward(self,x):
y = np.tanh(x)
self.y = y
return y
def backward(self,dy):
return (1.0-self.y**2)*dy
class Softmax:
def forward(self,z):
self.z = z
zmax = z.max(axis=1,keepdims=True)
expz = np.exp(z-zmax)
Z = expz.sum(axis=1,keepdims=True)
return expz / Z
def backward(self,dp):
p = self.forward(self.z)
pdp = p * dp
return pdp - p * pdp.sum(axis=1, keepdims=True)
class CrossEntropyLoss:
def forward(self,p,y):
self.p = p
self.y = y
p_of_y = p[np.arange(len(y)), y]
log_prob = np.log(p_of_y)
return -log_prob.mean()
def backward(self,loss):
dlog_softmax = np.zeros_like(self.p)
dlog_softmax[np.arange(len(self.y)), self.y] -= 1.0/len(self.y)
return dlog_softmax / self.p
###################################################################
# network
class Net:
def init(self):
self.layers = []
def add(self,l):
self.layers.append(l)
def forward(self,x):
for l in self.layers:
x = l.forward(x)
return x
def backward(self,z):
for l in self.layers[::-1]:
z = l.backward(z)
return z
def update(self,lr):
for l in self.layers:
if 'update' in l.__dir__():
l.update(lr)
def get_loss_acc(x,y,loss=CrossEntropyLoss()):
p = net.forward(x)
l = loss.forward(p,y)
pred = np.argmax(p,axis=1)
acc = (pred==y).mean()
return l,acc
def train_epoch(net, train_x, train_labels, loss=CrossEntropyLoss(), batch_size=4, lr=0.1):
for i in range(0,len(train_x),batch_size):
xb = train_x[i:i+batch_size]
yb = train_labels[i:i+batch_size]
p = net.forward(xb)
l = loss.forward(p,yb)
dp = loss.backward(l)
dx = net.backward(dp)
net.update(lr)
print("epoch={}: ".format(i//batch_size))
print("Final loss={}, accuracy={}: ".format(*get_loss_acc(train_x,train_labels)))
print("Test loss={}, accuracy={}: ".format(*get_loss_acc(test_x,test_labels)))
###################################################################
# main
if __name__ == '__main__':
# model
net = Net()
net.add(Linear(28*28,300))
net.add(Tanh())
net.add(Linear(300,10))
net.add(Softmax())
train_epoch(net,train_x,train_labels,batch_size=1000)
#save the model
import pickle
with open('model.pkl', 'wb') as f:
pickle.dump(net, f)
加载模型,进行测试:
import OwnFramework
import torchvision
import numpy as np
import pickle
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import random
# import the model
with open(‘model.pkl’, ‘rb’) as f:
OwnFramework.net = pickle.load(f)
# test the data from minst
test_data = torchvision.datasets.MNIST(‘./data’, train=False, download=False)
test_x = test_data.data.numpy().reshape(-1,28*28)
test_labels = test_data.targets.numpy()
# test the model
print("Test loss={}, accuracy={}: ".format(*OwnFramework.get_loss_acc(test_x,test_labels)))
# show the images and the predictions
fig=plt.figure(figsize=(8, 8))
gs = gridspec.GridSpec(4, 4)
for i in range(16):
j=random.randint(0,len(test_x))
ax = plt.subplot(gs[i])
ax.imshow(test_x[j].reshape(28,28))
ax.set_title("Predicted: {}".format(np.argmax(OwnFramework.net.forward(test_x[j:j+1]))))
ax.axis('off')
plt.show()
# show the images that are not predicted not correctly
fig=plt.figure(figsize=(12, 8))
gs = gridspec.GridSpec(4, 4)
i=0
for j in range(len(test_x)):
if np.argmax(OwnFramework.net.forward(test_x[j:j+1])) != test_labels[j]:
ax = plt.subplot(gs[i])
ax.imshow(test_x[j].reshape(28,28))
ax.set_title("Predicted: {}, True: {}".format(np.argmax(OwnFramework.net.forward(test_x[j:j+1])),test_labels[j]))
ax.axis('off')
i+=1
if i==16:
break
plt.show()
六.Neural Network Frameworks
架构API:
to be able to train neural networks efficiently we need to do two things:
To operate on tensors , eg. to multiply, add, and compute some functions such as sigmoid or softmax
To compute gradients of all expressions, in order to perform gradient descent optimization
While the **`numpy` library** can **do the first part** , we need some mechanism to compute gradients. In our framework that we have developed in the previous section we had to manually program all derivative functions inside the `backward` method, which does backpropagation. Ideally, _**a framework should give us the opportunity to compute gradients of _any expression_ that we can define**_. Another important thing is to be able to **perform computations on GPU** , or any other specialized compute units, such as [TPU](https://en.wikipedia.org/wiki/Tensor_Processing_Unit "TPU"). Deep neural network training requires _a lot_ of computations, and to be able to parallelize those computations on GPUs is very important.
底层和高层API:
Currently, the two **most popular neural frameworks** are:** [TensorFlow](http://tensorflow.org/ "TensorFlow") and [PyTorch](https://pytorch.org/ "PyTorch").** Both provide a **low-level API** to operate with **tensors on both CPU and GPU**. On top of the low-level API, there is also **higher-level API** , called** [Keras](https://keras.io/ "Keras") and [PyTorch Lightning](https://pytorchlightning.ai/ "PyTorch Lightning") **correspondingly.
Low-Level API| TensorFlow|
PyTorch
—|—|—
High-level API| Keras| PyTorch
Lightning
Low-level APIs in both frameworks allow you to build so-called
computational graphs. This graph defines how to compute the output
(usually the loss function) with given input parameters , and can be
pushed for computation on GPU , if it is available. There are functions to
differentiate this computational graph and compute gradients, which can then
be used for optimizing model parameters.
High-level APIs pretty much consider neural networks as a sequence of
layers , and make constructing most of the neural networks much easier.
Training the model usually requires preparing the data and then calling afit
function to do the job.
The high-level API allows you to construct typical neural networks **very quickly without worrying about lots of details**. At the same time, low-level API offer much more control over the training process, and thus they are **used a lot in research** , when you are dealing with **new neural network architectures.**
It is also important to understand that you can**use both APIs together** , eg. you can develop your own network layer architecture using low-level API, and then use it inside the larger network constructed and trained with the high-level API. Or you can define a network using the high-level API as a sequence of layers, and then use your own low-level training loop to perform optimization. Both APIs use the same basic underlying concepts, and they are designed to work well together.
过拟合检测:
How to detect overfitting
As you can see from the graph above, overfitting can be detected by a very low training error, and a high validation error. Normally during training we will see both training and validation errors starting to decrease, and then **at some point validation error might stop decreasing and start rising**. This will be a sign of overfitting, and the indicator that we should probably **stop training at this point** (or at least **make a snapshot of the model**).(及时备份)
1.Keras
Keras is **a part of Tensorflow 2.x framework**. Let’s make sure we have version 2.x.x of Tensorflow installed:
# packages
import tensorflow as tf
from tensorflow import keras
import numpy as np
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
print(f’Tensorflow version = {tf.version}’)
# data prepare
np.random.seed(0) # pick the seed for reproducibility - change it to explore the effects of random variations
n = 100
X, Y = make_classification(n_samples = n, n_features=2,
n_redundant=0, n_informative=2, flip_y=0.05,class_sep=1.5)
X = X.astype(np.float32)
Y = Y.astype(np.int32)
split = [ 70*n//100 ]
train_x, test_x = np.split(X, split)
train_labels, test_labels = np.split(Y, split)
**关于张量的概念:(多维向量) **
Tensor is a multi-dimensional array. It is very convenient to use
tensors to represent different types of data:
- 400x400 - black-and-white picture
- 400x400x3 - color picture
- 16x400x400x3 - minibatch of 16 color pictures
- 25x400x400x3 - one second of 25-fps video
- 8x25x400x400x3 - minibatch of 8 1-second videos
Tensors give us a convenient way to represent input/output data, as well we
weights inside the neural network.
归一化数据:(约束网络参数范围)Normalizing Data
Before training, it is common to bring our input featuresto the standard
range of [0,1] (or [-1,1]). The exact reasons for that we will discuss later
in the course, but in short the reason is the following. We want to avoid
values that flow through our network getting too big or too small , and we
normally agree to keep all values in the small range close to 0. Thus we
initialize the weights with small random numbers , and we keep signals in
the same range.
train_x_norm = (train_x-np.min(train_x,axis=0)) / (np.max(train_x,axis=0)-np.min(train_x,axis=0))
test_x_norm = (test_x-np.min(train_x,axis=0)) / (np.max(train_x,axis=0)-np.min(train_x,axis=0))
<1>Training One-Layer Network (Perceptron)
①模型定义
In many cases, a neural network would be a sequence of layers. It can be
defined in Keras using **Sequential
**model in the following manner:
model = keras.models.Sequential()
model.add(keras.Input(shape=(2,)))
model.add(keras.layers.Dense(1))
model.add(keras.layers.Activation(keras.activations.sigmoid))
model.summary()
# or
# Input size, as well as activation function, can also be specified directly in the Dense layer for brevity:
model = keras.models.Sequential()
model.add(keras.layers.Dense(1,input_shape=(2,),activation=’sigmoid’))
model.summary()
说明:
Here, we first create the model, and then add layers to it:
First
Input
layer (which is not strictly speaking a layer) contains the specification of network’s input sizeDense
layer is the actual perceptron that contains trainable weightsFinally, there is a layer with **sigmoid
Activation
function **to bring the result of the network into 0-1 range (to make it a probability).Model: “sequential”
_________________________________________________________________
Layer (type) Output Shape Param #dense (Dense) (None, 1) 3
activation (Activation) (None, 1) 0
=================================================================
Total params: 3 (12.00 Byte)
Trainable params: 3 (12.00 Byte)
Non-trainable params: 0 (0.00 Byte)
②模型编译(指定损失函数、优化方法【梯度下降等】、精度)
Before training the model, we need to compile it , which essentially mean
specifying:
- Loss function , which defines how loss is calculated. Because we have two-class classification problem, we will use binary cross-entropy loss.
- Optimizer to use. The simplest option would be to use
sgd
for stochastic gradient descent , or you can use more sophisticated optimizers such asadam
. - Metrics that we want to use to measure success of our training. Since it is classification task, a good metrics would be
Accuracy
(oracc
for short)
We can specify loss, metrics and optimizer either as strings , or by
providing some objects from Keras framework. In our example, we need to
**specifylearning_rate
parameter **to fine-tune learning speed of our model,
and thus we provide full name of Keras SGD optimizer.
(可使用字符串或对象来定义)
model.compile(optimizer=keras.optimizers.SGD(learning_rate=0.2),loss=’binary_crossentropy’,metrics=[‘acc’])
③训练
After compiling the model, we can do the actual training by calling fit
method. The most important parameters are:
x
andy
specify training data, features and labels respectivelyIf we want validation to be performed on each epoch, we can specify **
validation_data
**parameter, which would be a tuple of features and labels**
epochs
**specified the number of epochsIf we want training to happen in minibatches , we can specify **
batch_size
**parameter. You can also pre-batch the data manually before passing it tox
/y
/validation_data
, in which case you do not needbatch_size
model.fit(x=train_x_norm,y=train_labels,validation_data=(test_x_norm,test_labels),epochs=10,batch_size=1)
Note that you can call
fit
function several times in a row to further
train the network. If you want to start training from scratch - you
need to re-run the cell with the model definition.
注:训练是叠加的,想从头训练需重定义网络
<2>Multi-Class Classificatio(多分类问题)
If you need to solve a problem of multi-class classification, your network would have more that one output - corresponding to the number of classes .**Each output will contain the probability of a given class.(多类多输出)**
** When you expect a network to output a set of probabilities , we
need all of them to add up to 1. To ensure this, we use softmax
as a final
activation function on the last layer. Softmax takes a vector input, and
makes sure that all components of that vector are transformed into
probabilities.(softmax 使所有概率和为1)**
Also, since the output of the network is a C-dimensional vector, we need labels to have the same form. This can be achieved by using **one-hot encoding** , when the number of a class is i converted to **a vector of zeroes, with 1 at the i-th position.(独热码,一位1其他位0)**
To compare the probability output of the neural network with expected one-hot-encoded label, we use **cross-entropy loss** function. It takes two probability distributions, and outputs a value of how different they are.**(概率输出和独热码标签计算交叉熵损失函数)**
So, to summarize what we need to do for multi-class classification with
classes:
The network should have neurons in the last layer
Last activation function should be softmax
Loss should be cross-entropy loss
Labels should be converted to one-hot encoding (this can be done using
numpy
, or using Keras utilsto_categorical
)model = keras.models.Sequential([
keras.layers.Dense(5,input_shape=(2,),activation=’relu’),
keras.layers.Dense(2,activation=’softmax’)
])
model.compile(keras.optimizers.Adam(0.01),’categorical_crossentropy’,[‘acc’])Two ways to convert to one-hot encoding
train_labels_onehot = keras.utils.to_categorical(train_labels)
test_labels_onehot = np.eye(2)[test_labels]hist = model.fit(x=train_x_norm,y=train_labels_onehot,validation_data=[test_x_norm,test_labels_onehot],batch_size=1,epochs=10)
Sparse Categorical Cross-Entropy(稀疏分类交叉熵)(使用整数标签代替独热码标签)
Often labels in multi-class classification are represented by class numbers.
Keras also supports another kind of loss function called sparse categorical
crossentropy , which expects class number to be integers , and not one-
hot vectors. Using this kind of loss function, we can simplify our training
code:
model.compile(keras.optimizers.Adam(0.01),’sparse_categorical_crossentropy’,[‘acc’])
model.fit(x=train_x_norm,y=train_labels,validation_data=[test_x_norm,test_labels],batch_size=1,epochs=10)
<3>Multi-Label Classification(多标签分类)
With multi-label classification, instead of one-hot encoded vector, we will **have a vector that has 1 in position corresponding to all classes** relevant to the input sample. Thus, output of the network should not have normalized probabilities for all classes, but rather for each class individually - which corresponds to using **sigmoid** activation function. Cross-entropy loss can still be used as a loss function.**(不再使用独热码,而是标签中所有包含的位为1)**
<4>总结 Summary of Classification Loss Functions
We have seen that binary, multi-class and multi-label classification **differ by the type of loss function and activation function on the last layer** of the network. It may all be a little bit confusing if you are just starting to learn, but here are a few rules to keep in mind:
- If the network has one output (binary classification), we use sigmoid activation function , for multiclass classification - softmax
- If the output class is represented as one-hot-encoding, the loss function will be cross entropy loss (categorical cross-entropy), if the output contains class number - sparse categorical cross-entropy. For binary classification - use binary cross-entropy (same as log loss)
- Multi-label classification is when we can have an object belonging to several classes at the same time. In this case, we need to encode labels using one-hot encoding, and use sigmoid as activation function, so that each class probability is between 0 and 1.
Classification | Label Format | Activation Function | Loss |
---|---|---|---|
Binary | Probability of 1st class | sigmoid | binary crossentropy |
Binary | One-hot encoding (2 outputs) | softmax | categorical crossentropy |
Multiclass | One-hot encoding | softmax | categorical crossentropy |
Multiclass | Class Number | softmax | sparse categorical crossentropy |
Multilabel | One-hot encoding | sigmoid | categorical crossentropy |
2.Tensorflow2.x+Keras
Tensorflow 2.x + Keras - new version of Tensorflow with integrated Keras functionality, which supports dynamic computation graph , allowing to perform tensor operations very similar to numpy (and PyTorch)
import tensorflow as tf
import numpy as np
print(tf.version)
<1>简单张量操作
①创建
You can easily create simple tensors from lists of np-arrays, or generate
random ones
# 创建常量张量
a = tf.constant([[1,2],[3,4]])
print(a)
# 创建正态分布随机10*3张量
a = tf.random.normal(shape=(10,3))
print(a)
②运算
You can use arithmetic operations on tensors, which are performed element-
wise, as in numpy. Tensors are automatically expanded to required dimension,
if needed. To extract numpy-array from tensor, use.numpy()
:(将张量转化为np数组)(以下是运算示例:)
print(a-a[0])
print(tf.exp(a)[0].numpy())
<2>计算梯度
For back propagation, you need to compute gradients. This is done usingtf.GradientTape()
idiom:
Add
with tf.GradientTape() as tape:
block around our computationsMark those tensors with respect to which we need to compute gradients by calling
tape.watch
(all variables are watched automatically)Compute whatever we need (build computational graph)
Obtain gradients using
tape.gradient
a = tf.random.normal(shape=(2, 2))
b = tf.random.normal(shape=(2, 2))with tf.GradientTape() as tape:
tape.watch(a) # Start recording the history of operations applied toa
c = tf.sqrt(tf.square(a) + tf.square(b)) # Do some math usinga
What’s the gradient of
c
with respect toa
?dc_da = tape.gradient(c, a)
print(dc_da)
监视变量、构建运算关系、计算梯度
< 3>例1:线性回归问题
生成数据集
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
import random
np.random.seed(13) # pick the seed for reproducability - change it to explore the effects of random variations
train_x = np.linspace(0, 3, 120)
train_labels = 2 * train_x + 0.9 + np.random.randn(*train_x.shape) * 0.5
plt.scatter(train_x,train_labels)
定义损失函数:
input_dim = 1
output_dim = 1
learning_rate = 0.1
# This is our weight matrix
w = tf.Variable([[100.0]])
# This is our bias vector
b = tf.Variable(tf.zeros(shape=(output_dim,)))
def f(x):
return tf.matmul(x,w) + b
def compute_loss(labels, predictions):
return tf.reduce_mean(tf.square(labels - predictions))
训练函数:
def train_on_batch(x, y):
with tf.GradientTape() as tape:
predictions = f(x)
loss = compute_loss(y, predictions)
# Note that tape.gradient
works with a list as well (w, b).
dloss_dw, dloss_db = tape.gradient(loss, [w, b])
w.assign_sub(learning_rate * dloss_dw)
b.assign_sub(learning_rate * dloss_db)
return loss
训练集生成:
# Shuffle the data. 打乱数据
indices = np.random.permutation(len(train_x))
features = tf.constant(train_x[indices],dtype=tf.float32)
labels = tf.constant(train_labels[indices],dtype=tf.float32)
训练过程:(第 i 到 i+batch_size 为一组)
batch_size = 4
for epoch in range(10):
for i in range(0,len(features),batch_size):
loss = train_on_batch(tf.reshape(features[i:i+batch_size],(-1,1)),tf.reshape(labels[i:i+batch_size],(-1,1)))
print(‘Epoch %d: last batch loss = %.4f’ % (epoch, float(loss)))
绘制:
plt.scatter(train_x,train_labels)
x = np.array([min(train_x),max(train_x)])
y = w.numpy()[0,0]*x+b.numpy()[0]
plt.plot(x,y,color=’red’)
We now have obtained optimized parameters $W$ and $b$. Note that their values are similar to the original values used when generating the dataset (W=2, b=1)
本文转自 https://blog.csdn.net/qq_32971095/article/details/137124492,如有侵权,请联系删除。