NumPy是高性能科学计算和数据分析的基础包。

多维数组对象:ndarray

ndarray中的所有元素必须是相同类型的,每个数组都有一个shape(表示各维度大小的元组)和一个dtype(说明数组数据类型的对象)

创建ndarray

通过array函数,它接受一切序列型的对象(包括其他numpy数组)

>>> import numpy as np
>>> data1 = [1,2,3,4,5]
>>> arr1 = np.array(data1)
>>> arr1
array([1, 2, 3, 4, 5])

还可以传一组等长列表组成的列表

>>> data2 = [[1,2,3,4],[5,6,7,8]]
>>> arr2 = np.array(data2)
>>> arr2
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
>>> arr2.ndim
2
>>> arr2.shape
(2, 4)
>>> arr2.dtype
dtype('int64')

还有其他常用的创建方法为:​​zeros​​​和​​ones​​​,分别创建指定长度的全0或全1数组;​​empty​​创建一个没有任何具体数值的数组。

>>> np.zeros(10)
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> np.ones(3)
array([ 1., 1., 1.])

>>> np.ones((3,1))
array([[ 1.],
[ 1.],
[ 1.]])

还有一个range函数的数组版:arange:

>>> np.arange(15)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

​eye​​​和​​identity​​函数用来创建一个正方的N×N单位矩阵

ndarray的数据类型

>>> arr1 = np.array([1,2,3],dtype=np.float64)
>>> arr2 = np.array([1,2,3],dtype=np.int32)
>>> arr1.dtype
dtype('float64')
>>> arr2.dtype
dtype('int32')

可以通过ndarray的astype方法显示地转换其dtype:

>>> import numpy as np
>>> arr = np.array([1,2,3,4,5])
>>> arr.dtype
dtype('int64')
>>> float_arr = arr.astype(np.float64)
>>> float_arr.dtype
dtype('float64')

数组和标量之间的运算

>>> arr = np.array([[1.,2.,3.],[4.,5.,6.]])
>>> arr
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
>>> arr * arr
array([[ 1., 4., 9.],
[ 16., 25., 36.]])
>>> arr - arr
array([[ 0., 0., 0.],
[ 0., 0., 0.]])
>>> 1/arr
array([[ 1. , 0.5 , 0.33333333],
[ 0.25 , 0.2 , 0.16666667]])
>>> arr ** 0.5
array([[ 1. , 1.41421356, 1.73205081],
[ 2. , 2.23606798, 2.44948974]])

基本的索引和切片

>>> import numpy as np
>>> arr = np.arange(10)
>>> arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> arr[5]
5
>>> arr[5:8]
array([5, 6, 7])
>>> arr[5:8] = 12
>>> arr
array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9])

numpy数组和列表最重要的区别在于,数组切片是原始数组的视图,视图上的任何修改都会直接反映到源数组上:

>>> arr_slice = arr[5:8]
>>> arr_slice[1] = 5
>>> arr
array([ 0, 1, 2, 3, 4, 12, 5, 12, 8, 9])

切片索引

>>> arr
array([ 0, 1, 2, 3, 4, 12, 5, 12, 8, 9])
>>> arr[:6]
array([ 0, 1, 2, 3, 4, 12])
>>> arr2d = np.array([[1,2,3,],[4,5,6],[7,8,9]])
>>> arr2d
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> arr2d[:2]
array([[1, 2, 3],
[4, 5, 6]])

还可以一次传入多个切片:

>>> arr2d[:2,1:]
array([[2, 3],
[5, 6]])

将整数索引和切片混合,可以得到低维度(这里是一维)的切片:

>>> arr2d[1,:2]
array([4, 5])
>>> arr2d[2,:1]
array([7])

布尔型索引

假设有一个存储数据的数组和一个存储姓名的数组。

>>> names = np.array(['Bob','Joe','Will','Bob','Will','Joe','Joe'])
>>> data = np.random.randn(7,4)
>>> names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],
dtype='|S4')
>>> data
array([[-0.67023922, -1.14645338, -0.69756796, 0.58820624],
[ 0.76535231, -1.37177997, 1.21513091, 0.34985306],
[-0.64221811, 0.63777144, 0.50440548, -1.19693581],
[ 0.70748924, -0.22569276, 1.3192902 , 0.51158935],
[-0.43007899, 0.43423987, -0.36440682, -0.06110292],
[-0.25360922, 0.01430185, 0.302572 , -0.81345566],
[ 1.56183223, -0.15471421, 0.79385929, 0.01137061]])

对names和字符串"Bob"的比较运算将会产生一个布尔型数组:

>>> names == 'Bob'
array([ True, False, False, True, False, False, False], dtype=bool)
>>> data[names=='Bob']
array([[-0.67023922, -1.14645338, -0.69756796, 0.58820624],
[ 0.70748924, -0.22569276, 1.3192902 , 0.51158935]])

如上,选取的是第0行和第3行。

还可以跟切片、整数混合使用:

>>> data[names=='Bob',2:]
array([[-0.69756796, 0.58820624],
[ 1.3192902 , 0.51158935]])

要排除"Bob"的选项,可以通过下面两种方式:

>>> names != 'Bob'
array([False, True, True, False, True, True, True], dtype=bool)
>>> data[-(names == 'Bob')]
array([[ 0.76535231, -1.37177997, 1.21513091, 0.34985306],
[-0.64221811, 0.63777144, 0.50440548, -1.19693581],
[-0.43007899, 0.43423987, -0.36440682, -0.06110292],
[-0.25360922, 0.01430185, 0.302572 , -0.81345566],
[ 1.56183223, -0.15471421, 0.79385929, 0.01137061]])

将data中的所有负数值都设置为0,只需要:

>>> data[data<0] = 0
>>> data
array([[ 0. , 0. , 0. , 0.58820624],
[ 0.76535231, 0. , 1.21513091, 0.34985306],
[ 0. , 0.63777144, 0.50440548, 0. ],
[ 0.70748924, 0. , 1.3192902 , 0.51158935],
[ 0. , 0.43423987, 0. , 0. ],
[ 0. , 0.01430185, 0.302572 , 0. ],
[ 1.56183223, 0. , 0.79385929, 0.01137061]])

花式索引

指的是利用整数数组进行索引:

>>> arr = np.empty((8,4))
>>> for i in range(8):
... arr[i] = i
...
>>> arr
array([[ 0., 0., 0., 0.],
[ 1., 1., 1., 1.],
[ 2., 2., 2., 2.],
[ 3., 3., 3., 3.],
[ 4., 4., 4., 4.],
[ 5., 5., 5., 5.],
[ 6., 6., 6., 6.],
[ 7., 7., 7., 7.]])

为了以特定的顺序选取行子集,只需要传入用于指定顺序的整数列表或ndarray即可:

>>> arr[[4,3,0,5]]
array([[ 4., 4., 4., 4.],
[ 3., 3., 3., 3.],
[ 0., 0., 0., 0.],
[ 5., 5., 5., 5.]])

数组转置和轴对换

transpose方法和T属性都可以用来转置

In [1]: import numpy as np

In [2]: arr = np.arange(15).reshape((3,5))

In [3]: arr
Out[3]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])

In [4]: arr.transpose()
Out[4]:
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])

In [5]: arr.T
Out[5]:
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])

reshape把一维数组(可看成15×1)重塑为3×5的矩阵

利用np.dot计算矩阵内积X^TX:

In [6]: arr = np.random.randn(6,3)

In [7]: np.dot(arr.T,arr)
Out[7]:
array([[ 5.89566092, 3.33083617, -4.03329493],
[ 3.33083617, 11.88231079, -9.15776969],
[ -4.03329493, -9.15776969, 17.59249066]])

3×6的矩阵乘以 6×3矩阵 = 3×3矩阵

通用函数

通用函数(ufunc)是一种对ndarray中的数据执行元素级运算的函数。

In [8]: arr = np.arange(10)
In [9]: np.sqrt(arr)
Out[9]:
array([ 0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
In [10]: np.exp(arr)
Out[10]:
array([ 1.00000000e+00, 2.71828183e+00, 7.38905610e+00,
2.00855369e+01, 5.45981500e+01, 1.48413159e+02,
4.03428793e+02, 1.09663316e+03, 2.98095799e+03,
8.10308393e+03])

这些都是一元通用函数,还有一些(add或maximum)等二元函数,接收2个数组,并返回一个结果数组:

In [13]: x = np.random.randn(8)

In [14]: y = np.random.randn(8)

In [15]: x
Out[15]:
array([-0.59020498, -1.24075283, 2.11452765, 0.49729268, -0.77325658,
-0.55457265, -0.97342472, -0.99479623])

In [16]: y
Out[16]:
array([-0.07633226, -0.69846562, -0.2431167 , 0.17992333, 2.11079139,
-0.57698677, -1.64499275, 0.70726203])

In [17]: np.maximum(x,y)#元素级别最大值
Out[17]:
array([-0.07633226, -0.69846562, 2.11452765, 0.49729268, 2.11079139,
-0.55457265, -0.97342472, 0.70726203])

利用数组进行数据处理

NumPy数组可以将许多数据处理任务表述为简洁的数组表达式。利用数组表达式代理循环,即矢量化。

假设在一组网格型值上计算函数​​sqrt(x^2 + y^2)​​​。
首先生成这些数据,通过np.meshgrid函数接受两个一维数组,并产生两个二维矩阵(对应于两个数组中所有的(x,y)对):

In [18]: points = np.arange(-5,5,0.01) #1000个间隔相等的点

In [19]: xs,ys = np.meshgrid(points,points)

In [22]: z = np.sqrt(xs ** 2 + ys ** 2)

In [23]: z
Out[23]:
array([[ 7.07106781, 7.06400028, 7.05693985, ..., 7.04988652,
7.05693985, 7.06400028],
[ 7.06400028, 7.05692568, 7.04985815, ..., 7.04279774,
7.04985815, 7.05692568],
[ 7.05693985, 7.04985815, 7.04278354, ..., 7.03571603,
7.04278354, 7.04985815],
...,
[ 7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 ,
7.03571603, 7.04279774],
[ 7.05693985, 7.04985815, 7.04278354, ..., 7.03571603,
7.04278354, 7.04985815],
[ 7.06400028, 7.05692568, 7.04985815, ..., 7.04279774,
7.04985815, 7.05692568]])

将条件逻辑表述为数组运算

​numpy.where​​​函数是三元表达式​​x if condition else y​​​ 的矢量化版本
首先看三元表达式的版本:

In [24]: xarr = np.array([1.1,1.2,1.3,1.4,1.5])

In [25]: yarr = np.array([2.1,2.2,2.3,2.4,2.5])

In [26]: cond = np.array([True,False,True,True,False])

# 当cond为True时,选取xarr的值,否则选取yarr的值
In [27]: result = [(x if c else y) for x,y,c in zip(xarr,yarr,cond)]

In [28]: result
Out[28]: [1.1000000000000001, 2.2000000000000002, 1.3, 1.3999999999999999, 2.5]

它有几个问题:1.对大数组的处理速度不快;2.不适用于多维数组

接下来看np.where的版本:

In [29]: result = np.where(cond,xarr,yarr)

In [30]: result
Out[30]: array([ 1.1, 2.2, 1.3, 1.4, 2.5])

np.where第二个和第三个参数也可以为标量(一个数字)。

数学和统计方法

可以通过数组上的一组数学函数对整个数组或某个轴向的数据进行统计计算。

In [31]: arr = np.random.randn(5,4)

In [32]: arr.mean() #和下面的方法作用一样
Out[32]: -0.49958506603328134

In [33]: np.mean(arr)
Out[33]: -0.49958506603328134

In [34]: arr.sum()
Out[34]: -9.9917013206656264

mean和sum这类的函数可以接受一个axis参数(用于计算该轴上的统计值),最终结果是一个少一维的数组:

In [35]: arr
Out[35]:
array([[ 0.16070503, -1.76923734, -0.37564558, 0.81347613],
[-0.07424384, -2.70670764, -1.01160143, -1.67078168],
[-0.38030595, -1.48797376, -0.70779078, 1.36638934],
[-0.87429003, -0.5968173 , -0.26610553, -1.99155306],
[ 0.80147259, 0.55104424, 0.3596426 , -0.13137731]])

In [36]: arr.mean(axis=1)#axis=1,对各行求均值
Out[36]: array([-0.29267544, -1.36583365, -0.30242029, -0.93219148, 0.39519553])

In [37]: arr.sum(0) #对各列求均值
Out[37]: array([-0.36666221, -6.0096918 , -2.00150073, -1.61384659])

方法

说明

cumsum

所有元素的累计和

cumprod

所欲元素的累计积

用于布尔型数组的方法

In [40]: arr = np.random.randn(100)

In [41]: (arr>0).sum() #正数的数量
Out[41]: 51

在sum()等方法中,布尔值会被强制转换为1(True)和0(False)。

还有​​any​​​用于测试数组中是否存在一个或多个True;
​​​all​​则检查数组中所有值是否都是True。

用于数组的文件输入输出

将数组以二进制格式保存到磁盘

​np.save​​​和​​np.load​

In [2]: arr = np.arange(10)
In [3]: np.save('some_array',arr) #如果没有扩展名.npy,会自动加上
In [4]: arr = np.load('some_array')
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-4-47445c2470a6> in <module>()
----> 1 arr = np.load('some_array')
/usr/local/python27/lib/python2.7/site-packages/numpy/lib/npyio.pyc in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
368 own_fid = False
369 if isinstance(file, basestring):
--> 370 fid = open(file, "rb")
371 own_fid = True
372 elif is_pathlib_path(file):
IOError: [Errno 2] No such file or directory: 'some_array'
In [5]: arr = np.load('some_array.npy') # 注意加上.npy
In [6]: arr
Out[6]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

还可以保存多个数组到一个压缩文件中:

In [7]: np.savez('array_archive.npz',a=arr,b=arr)

In [8]: arch = np.load('array_archive.npz')

In [9]: arch['b'] #延迟加载
Out[9]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

线性代数

In [10]: x = np.array([[1.,2.,3.],[4.,5.,6.]])
In [12]: y = np.array([[6.,23.],[-1,7],[8,9]])
In [13]: x
Out[13]:
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
In [14]: y
Out[14]:
array([[ 6., 23.],
[ -1., 7.],
[ 8., 9.]])
In [15]: x.dot(y) #等价于np.dot(x,y) 用于计算矩阵点积
Out[15]:
array([[ 28., 64.],
[ 67., 181.]])

numpy.linalg中有一组标准的矩阵分解运算、求逆以及求行列式等函数。

In [16]: from numpy.linalg import inv,qr
In [18]: X = np.random.randn(5,5)
In [20]: mat = X.T.dot(X) #X的转置与X进行矩阵乘法运算
In [21]: inv(mat) #求mat的逆
Out[21]:
array([[ 5.36516943, -0.58814193, -2.90872906, -0.82617219, -0.59256828],
[-0.58814193, 2.86767583, 0.8169438 , 0.32931394, 0.38375424],
[-2.90872906, 0.8169438 , 3.96375795, 0.60280722, 0.09418734],
[-0.82617219, 0.32931394, 0.60280722, 0.37214256, 0.47226758],
[-0.59256828, 0.38375424, 0.09418734, 0.47226758, 1.05462013]])
In [23]: mat.dot(inv(mat))
Out[23]:
array([[ 1.00000000e+00, 3.80031876e-17, -8.37358277e-16,
-1.81650867e-16, 4.96478433e-16],
[ -5.92206043e-17, 1.00000000e+00, 1.03500100e-16,
3.80106825e-17, -6.52342104e-17],
[ 1.50718339e-16, 3.46505958e-18, 1.00000000e+00,
-1.64306169e-16, -6.61741125e-17],
[ -1.18755757e-16, 8.25238725e-17, -3.11759945e-16,
1.00000000e+00, 3.40196146e-16],
[ -2.18770741e-16, -1.55558051e-16, 5.95809687e-17,
-3.85967026e-17, 1.00000000e+00]])

In [24]: q,r = qr(mat) #计算QR分解

In [25]: r
Out[25]:
array([[ -0.87023312, 0.31361688, 1.08561332, -12.91192628,
5.21749656],
[ 0. , -0.4015009 , 0.01416619, 0.44230432,
0.10987083],
[ 0. , 0. , -1.24470462, 7.69508472,
-3.73580595],
[ 0. , 0. , 0. , -1.37119593,
1.00207318],
[ 0. , 0. , 0. , 0. ,
0.73670226]])

随机数生成

 In [26]: samples = np.random.normal(size=(4,4))#标准正态分布4X4样本数组

In [27]: samples
Out[27]:
array([[ -9.59175783e-02, 2.80064084e+00, 1.30208308e-03,
1.18368077e+00],
[ 1.43324694e+00, -5.31049050e-01, 4.98357067e-01,
5.19960030e-01],
[ -1.90984853e-01, -8.13233482e-01, -1.20337155e+00,
3.65353344e-01],
[ 6.13791888e-01, -4.95977294e-01, -1.41134819e-01,
1.01121801e+00]])