2简述多层感知机的优缺点多层感知机例题

转载

mob64ca1414098d 2024-04-03 08:52:29

文章标签 2简述多层感知机的优缺点机器学习数据数据集权重 文章分类 机器学习人工智能

一、线性回归

1.线性模型

（1）给定N维输入

2简述多层感知机的优缺点多层感知机例题_数据集

（2）定义一个N维权重和标量偏差

2简述多层感知机的优缺点多层感知机例题_数据集_02

（3）输入的X经过W和b加权输出结果

$y = \sum_{i=1}^{n}w_ix_i+b$

,即

y = <W,X>+b

（4）结构

2简述多层感知机的优缺点多层感知机例题_数据_05

（5）衡量预估质量：比较真实值和预估值,通常用平方损失

（6）训练数据

（7）参数学习

训练损失

$\iota (X,Y,W,b)=\frac{1}{2n}\sum_{i=1}^{n}(y_i-<X_i,W>-b)^{2}=\frac{1}{2n}\left \| y-XW-b \right \|^{2}$

最小化损失来学习参数

$W^{*},b^{*}=arg min\iota (X,Y,W,b)$

2.优化方法

（1）梯度下降，沿着梯度反方向更新求解

挑选初始值

$w_{0}$

,根据

$w_{t} = w_{t-1}-\eta \frac{\partial \iota }{\partial w_{t-1}}$

,沿着梯度增加损失值，

$\eta$

是学习率（保证数据变化跨度不会很大）

（2）小批量随机梯度下降

如果每次都训练一遍所有数据，则需要太多资源和时间，解决方法就是随机提取样本中的部分数据训练来更新参数值

二、softmax回归-----多个输入和多个输出的分类问题

（1）损失：判断出预测i与实际i的置信度

（2）过程：对样本用one-hot编码，使用均方损失训练，最大值为预测（

$y=argmax(o_{i})$

）,输出匹配概率

$Y=argmax(o_{})$

，实际概率

$y = \frac{exp(o_{i}))}{\sum_k exp(o_{k}))}$

,Y和y 的区别就是损失

（3）通常使用交叉熵

三、多层感知机

（1）感知机就是批量为1的一个线性回归加激活函数组成，区别在于这是分类问题

（2）收敛定理：数据在一个半径内，对于

$y(X^{T}w+b)\geq \rho$

，

$\left | \left | w \right | \right |^{2}+b^{2}\leq 1$

,全部分类正确，则在

$\frac{r^{2}+1}{\rho ^{2}}$

步后收敛

（3）为了解决XOR函数，使用多层感知机，增加隐藏层，提取更多特征，每层通过激活函数改变线性结构

import torch
from torch import nn
from d2l import torch as d2l

#单隐层的感知机
batch_size = 256
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size)
num_inputs,num_outputs,num_hiddens = 784,10,256
W1 = nn.Parameter(torch.randn(num_inputs,num_hiddens,requires_grad=True))#第一层
b1 = nn.Parameter(torch.zeros(num_hiddens,requires_grad=True))
W2 = nn.Parameter(torch.randn(num_hiddens,num_outputs,requires_grad=True))#第二层
b2 = nn.Parameter(torch.zeros(num_outputs,requires_grad=True))
params = [W1,b1,W2,b2]
def relu(X):#自定义激活函数
  a = torch.zeros_like(X)
  return torch.max(X,a)
def net(X):#自定义网络结构
  X = X.reshape(-1,num_inputs)#-1是自动找，为batch_size
  H = relu(X@W1+b1)
  return (H@W2+b2)
loss = nn.CrossEntropyLoss()
num_epochs,lr = 100,0.1
updater = torch.optim.SGD(params,lr=lr)#优化器
d2l.train_ch3(net,train_iter,test_iter,loss,num_epochs,updater)
#简洁实现
net = nn.Sequential(nn.Flatten(),nn.Linear(784,256),nn.ReLU(),nn.Linear(256,10))
def init_weights(m):
  if type(m) == nn.Linear:
    nn.init.normal_(m.weight,std=0)

net.apply(init_weights)
d2l.train_ch3(net,train_iter,test_iter,loss,num_epochs,updater)

四、模型选择、过拟合与欠拟合

（1）相关参数概念：训练误差（模型在训练数据集上的误差）、泛化误差（模型在新数据的误差）、验证数据集（验证模型好坏的数据集，属于训练数据集）、测试数据集（只用一次的新的数据集）、k-则交叉验证（数据不足时，均分数据集，每次选用一块作为验证数据集）、过拟合（模型过于拟合训练数据，记住了所有的数据特征，没有提出噪声）、欠拟合（模型难以拟合训练数据）、数据复杂度（样本个数与其元素个数、时间和空间结构、多样性（类别））

（2）模型需要匹配数据，需要训练误差和泛化误差观察模型是否合适

五、权重衰退

1.通过限制参数值来控制模型容量

(1)均方范数硬性限制

$min \iota (w,b), \left \| w \right \|^{2}\leq \theta$

(2)均方范数柔性限制

$min \iota (w,b)+\frac{\lambda }{2}\left \| w \right \|^{2}$

2.梯度计算

2简述多层感知机的优缺点多层感知机例题_数据集_19

3.实现

%matplotlib inline
import torch
from torch import nn
from d2l import torch as d2l
#生成数据
n_train,n_test,num_inputs,batch_size = 20,100,200,5
true_w,true_b = torch.ones((num_inputs,1))*0.01,0.05
train_data = d2l.synthetic_data(true_w,true_b,n_train)#生成训练集
train_iter = d2l.load_array(train_data,batch_size)#随机提取batch_size大小的数据
test_data = d2l.synthetic_data(true_w,true_b,n_test)#生成训练集
test_iter = d2l.load_array(test_data,batch_size,is_train=False)#随机提取batch_size大小的数据
#初始化模型参数
def init_params():
  w = torch.normal(0,1,size=(num_inputs,1),requires_grad=True)
  b = torch.zeros(1,requires_grad=True)
  return [w,b]
#定义l2范数惩罚
def l2_penalty(w):
  return torch.sum(w.pow(2))/2
#训练函数
def train(lambd):
  w,b=init_params()
  net,loss = lambda X:d2l.linreg(X,w,b),d2l.squared_loss#lambda函数定义一个线性层网络
  num_epochs,lr = 100,0.03
  animator = d2l.Animator(xlabel='epochs',ylabel='loss',yscale='log',xlim=[5,num_epochs],legend=['train','test'])
  for epoch in range(num_epochs):
    for X,y in train_iter:
      l = loss(net(X),y)+lambd*l2_penalty(w)#计算损失，权重衰退
      l.sum().backward()#计算梯度
      d2l.sgd([w,b],lr,batch_size)#激活并更新
    if (epoch+1)%5 == 0:
      animator.add(epoch+1,(d2l.evaluate_loss(net,train_iter,loss),
                d2l.evaluate_loss(net,test_iter,loss)))
  print('w的l2范数是：',torch.norm(w).item())
train(lambd=5)
#简洁实现
def train_concise(wd):
  net = nn.Sequential(nn.Linear(num_inputs,1))
  for param in net.parameters():
    param.data.normal_()
  loss = nn.MSELoss()
  num_epochs,lr = 100,0.003
  #weight_decay=lambd
  trainer = torch.optim.SGD([{"params":net[0].weight,'weight_decay':wd},{"params":net[0].bias}],lr=lr)
  animator = d2l.Animator(xlabel='epochs',ylabel='loss',yscale='log',xlim=[5,num_epochs],legend=['train','test'])
  for epoch in range(num_epochs):
    for X,y in train_iter:
      trainer.zero_grad()
      l = loss(net(X),y)
      l.backward()
      trainer.step()
    if (epoch+1)%5 == 0:
      animator.add(epoch+1,(d2l.evaluate_loss(net,train_iter,loss),
                d2l.evaluate_loss(net,test_iter,loss)))
train_concise(3)

六、丢弃法---解决过拟合

（1）在层之间加入噪音，使得E[X']=x,对每个元素做扰动

$x_{i}^{'}=\left\{\begin{matrix} 0,p & & \\ \frac{x_{i}}{1-p},otherise& & \end{matrix}\right.$

,作用在全连接层的输出上

（2）实现

import torch
from torch import nn
from d2l import torch as d2l

def dropout_layer(X,dropout):#丢弃层
  assert 0<=dropout<=1#丢弃概率在0-1
  if dropout == 1 :
    return torch.zeros_like(X)
  if dropout == 0:
    return X
  #randn随机生成0-1的数字
  mask = (torch.randn(X.shape)>dropout).float()
  return mask*X/(1.0-dropout)
num_inputs,num_outputs,num_hiddens1,num_hiddens2 = 784,10,256,256
dropout1,dropout2 = 0.2,0.5
#定义具有两个隐藏层的网络
class Net(nn.Module):
  def __init__(self,num_inputs,num_outputs,num_hiddens1,num_hiddens2,is_training=True):
    super(Net,self).__init__()
    self.num_inputs = num_inputs
    self.training = is_training
    self.lin1 = nn.Linear(num_inputs,num_hiddens1)
    self.lin2 = nn.Linear(num_hiddens1,num_hiddens2)
    self.lin3 = nn.Linear(num_hiddens2,num_outputs)
    self.relu = nn.ReLU()

  def forward(self,X):
    H1 = self.relu(self.lin1(X.reshape((-1,self.num_inputs))))
    if self.training == True:
      H1 = dropout_layer(H1,dropout1)
    H2 = self.relu(self.lin2(H1))
    if self.training == True:
      H2 = dropout_layer(H2,dropout2)
    out = self.lin3(H2)
    return out


net = Net(num_inputs,num_outputs,num_hiddens1,num_hiddens2)
num_epochs,lr,batch_size = 10,0.5,256
loss = nn.CrossEntropyLoss()
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size)
trainer = torch.optim.SGD(net.parameters(),lr=lr)
d2l.train_ch3(net,train_iter,test_iter,loss,num_epochs,trainer)
#简洁实现
net1 = nn.Sequential(nn.Flatten(),nn.Linear(784,256),nn.ReLU(),
          nn.Dropout(dropout1),nn.Linear(256,256),nn.ReLU(),
          nn.Dropout(dropout2),nn.Linear(256,10))
trainer1 = torch.optim.SGD(net1.parameters(),lr=lr)
d2l.train_ch3(net1,train_iter,test_iter,loss,num_epochs,trainer1)

七、数值稳定性与模型初始化

（1）神经网络的梯度

2简述多层感知机的优缺点多层感知机例题_权重_21