NumPy 之案例(随机漫步)

原创

致于数据科学家的小陈 2022-08-23 10:00:12 博主文章分类：数据分析 ©著作权

文章标签 运行时间随机漫步 numpy 数组 python 文章分类 运维

©著作权归作者所有：来自51CTO博客作者致于数据科学家的小陈的原创作品，请联系作者获取转载授权，否则将追究法律责任

import numpy as np

The numpy.random module supplements(补充) the built-in Python random with functions for efficiently generating whole arrays of sample values from many kinds of probalility distributions. For example, you can get a 4x4 array of samples from the standard normal distribution using normal:

samples = np.random.normal(size=(4,4))
samples

array([[-0.49777854,  1.01894039,  0.3542692 ,  1.0187122 ],
       [-0.07139068, -0.44245259, -2.05535526,  0.49974435],
       [ 0.80183078, -0.11299759,  1.22759314,  1.37571884],
       [ 0.32086762, -0.79930024, -0.31965109,  0.23004107]])

Python's built-in random module, by contrast(对比), only samples one value at a time. As you can see from this benchmark, numpy.random is well over an order of magnitude faster for generating very large samples:

from random import normalvariate

n = 100000

'python 运行时间:'
%timeit samples = [normalvariate(0,1) for _ in range(n)]

'python 运行时间:'

127 ms ± 7.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

'np.random 运行时间:'
%timeit np.random.normal(size=n)

'np.random 运行时间:'

4.2 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%time 

# 我后来还是喜欢 %%time 这样的计时方式

from random import normalvariate 

n = 1000000

'Python run time:'
samples = [normalvariate(0,1) for _ in range(n)]

Wall time: 1.08 s

We say that these are pseudorandom numbers(伪随机数) because they are generated by an algorithim with deterministic behavior(确定行为的算法生成的) You can change NumPy's random number generation seed number generation seed using np.random.seed:

"cj还是不理解seed(), 是让算法不改变吗? 每次都是同一个?"

np.random.seed(1234)
"现在理解了, seed(n),叫做随机种子, n表示在接下来的n次(抽样)的结果是一样的"

'cj还是不理解seed(), 是让算法不改变吗? 每次都是同一个?'

'现在理解了, seed(n),叫做随机种子, n表示在接下来的n次(抽样)的结果是一样的'

The data generation functions in numpy.random use a global random seed. To avoid global state, you can use numpy.random.RandomState to create a random number generator islolated(单独的) from others.

rng = np.random.RandomState(1234)

rng.randn(10)

array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 , -0.72058873,
        0.88716294,  0.85958841, -0.6365235 ,  0.01569637, -2.24268495])

See Table 4-8 for a partial list of functions available in numpy.random. I wil give some examples of leveraging(利用) these function's ablility to generate large arrays of samples all at onece in the next section.

seed Seed the random number generator (不明白这个函数还是)
permutation Return a random pemutation(排列) of a sequence, or return a permuted range
shuffle Randomly permute(转换) a sequence in-place (随机洗牌)

rand Draw samples from a uniform distribution U~[0, 1], rand(shape)(均匀分布, 每个值出现的可能性是一样的)
uniform U~[0, 1], uniform(low=0, high=1, size)
randint Draw random integers from a given low-to-high range. (a,b) 边界都是可以取到的

randn 标准正态分布\(N(\mu=0, \sigma=1)\), randn(shape)
normal 正态分布\(N(\mu, \sigma, size)\), normal(low=0, high=1, size)
binomial 二项分布
chisquare 卡方分布
beta
gamma

"permutaion(x), 产生0-x范围内x个随机自然数的一个排列"
np.random.permutation(6)

"shuffle(seq) 将一个序列随机打乱, in-place 哦 "
arr1 = np.arange(6)
np.random.shuffle(arr1)
arr1

'permutaion(x), 产生0-x范围内x个随机自然数的一个排列'

array([1, 2, 5, 0, 3, 4])

'shuffle(seq) 将一个序列随机打乱, in-place 哦 '

array([5, 2, 3, 0, 4, 1])

"rand(shape) 0-1的均匀分布哦"
"shape=(2x2x3)-> 2里的每个1是3,每个1里面是1 "
"[[], []]"
"[  [  [6],[6],[6] ],   [ [6],[6],[6]  ]  ]"

np.random.rand(2,1,1)

"uniform()"
np.random.uniform(3,4,5)

'rand(shape) 0-1的均匀分布哦'

'shape=(2x2x3)-> 2里的每个1是3,每个1里面是1 '

'[[], []]'

'[  [  [6],[6],[6] ],   [ [6],[6],[6]  ]  ]'

array([[[0.06152103]],

       [[0.8725525 ]]])

'uniform()'

array([3.35219682, 3.62783057, 3.51758469, 3.37434633, 3.64026243])

"randn(shape), 标准正态分布"
np.random.randn(1,2,3)

"满足normal(loc=0, scale=1, size=None) 均值80, 标准差10"
np.random.normal(80, 20, 6)

'randn(shape), 标准正态分布'

array([[[0.49112636, 0.90638754, 0.05000051],
        [1.21431522, 0.67847748, 1.3797269 ]]])

'满足normal(loc=0, scale=1, size=None) 均值80, 标准差10'

array([55.07130243, 56.34397557, 68.95608996, 31.40875572, 89.80741058,
       37.38567435])

"binorma(n, p, size), n次试验, p次成功概率, 5个样本量"
np.random.binomial(100, 0.4, 5)

"chisquare(10,2) 服从于自由度为10的卡方分布下的2个样本"
np.random.chisquare(10, 2)

'binorma(n, p, size), n次试验, p次成功概率, 样本量'

array([47, 42, 41, 34, 44])

array([20.07301382, 14.54581473])

Example:Random Walks

The simulation of random walks(随机漫步) provides an illustrative(实例) of utilizing(使用) array operations. Let's first consider a simple random walk starting at 0 with steps of 1 and -1 occuring(出现) with equal probalility. (1,-1 等概率出现)

Here is a pure Python way to implement a single random walk with 1000 steps using the built-in random module.

import random
import matplotlib.pyplot as plt


def random_walk(position=0, steps=1000, walk=[]):
    """
    position=0  # 初始位置
    walk = [] # 结果列表
    steps = 1000
    """
    for i in range(steps):
        state = 1 if random.randint(0,1) else -1
        position += state
        # 将每次位置存入结果列表中
        walk.append(position)
        
    return walk

# test
walk_result = random_walk()

"plot the first 100 values on one of these random walks:"
"默认折线图"

plt.plot(walk_result[:100])

'plot the first 100 values on one of these random walks:'

'默认折线图'

[<matplotlib.lines.Line2D at 0x19329c95208>]

You might make the observation that walk is simply the
cumulative(累积的) sum of the random steps and could be evaluate as an array expression, Thus, I use the np.random module to draw 1000 coin flips at once, set these to 1 and -1, and compute the cumulative sum:

nsteps = 1000

"用size代替for循环, 面向数组编程哦"
draws = np.random.randint(0, 2, size=nsteps)

"np.where 三元很厉害, 代替if-else"
steps = np.where(draws > 0, 1, -1)

'cumsum 累积求和'
walk = steps.cumsum()

plt.plot(walk)

'用size代替for循环, 面向数组编程哦'

'np.where 三元很厉害, 代替if-else'

'cumsum 累积求和'

[<matplotlib.lines.Line2D at 0x1932a032278>]

From this we can begin to extract(提取) statistics like the minmun and maxmun value along the walks trajectory(轨迹)

"walk的极值"
walk.min()
walk.max()

type(walk)

'walk的极值'

-17

numpy.ndarray

# cj test cumsum()
cj_arr = np.array([[1,2,3],[4,5,6]])
cj_arr

"cumsum()累积求和, 返回的是narray 可看到累积的过程哦"
np.cumsum(cj_arr, axis=0) # 往下, 显示没列的和

np.cumsum(cj_arr, axis=1) # 往右, 显示每行的和

array([[1, 2, 3],
       [4, 5, 6]])

'cumsum()累积求和, 返回的是narray 可看到累积的过程哦'

array([[1, 2, 3],
       [5, 7, 9]], dtype=int32)

array([[ 1,  3,  6],
       [ 4,  9, 15]], dtype=int32)

A more complicated(更复杂的) statistic is the first crossing time, the step at which the random walk reaches a particular value. Here we might want to know how long it took the random walk to get at least 10 steps away from the orgin 0 in either direction. (距离原点为10, 需要多少次) np.abs(walk) >= 10 gives us a boolean array indicating(指明) where(是否) the walk has reached or exceeded 10, but we want the index of the first 10 or -10, Turn out(结果是), we can compute this using argmax, which returns the first index of maximum value in the boolean array(True is the maximum): -> argmax()返回数组中, 最大值的第一个索引, 配合maximum使用

"查询累积数组中, 绝对距离大于10的,的第一个值的索引 "
(np.abs(walk) >= 10).argmax()

'查询累积数组中, 绝对距离大于10的,的第一个值的索引 '

Note that(注意) using argmax here is not always efficient because it always makes a full scan of the array. -> 用argmax()来找最大值索引, 效率是不高的,因为需要遍历整个数组. In this special case, once a True is observed we know it to be the maximum value.

# cj test

np.max([1,3,2,5,7])

"maximum返回一个数组, 逐个比较传入数组的值,和第二个参数比较, quit其大,替换"
np.maximum([10,1,3,2,5,7],6)

'maximum返回一个数组, 逐个比较传入数组的值,和第二个参数比较, quit其大,替换'

array([10,  6,  6,  6,  6,  7])

"待补充中..."
np.maximum(np.abs(walk), 10).argmax()

# cj_arr1 = np.array([10,6,6,6,6,7]).argmax()
# cj_arr1

Simulating Many Random Walk at Once

If your goal to simulate many random walks, say 5000 of them, you can generate all of the random walks with minor modiffcations(微小的修改) to the preceding code(之前的代码). If passed a 2-tuple, the numpy.random functions will generate a two-dimensional array of draws, and we can compute the cumulative sum across the rows to compute all 5000 random walks in one shot:

# cj test
np.random.randint(0,2, size=(3,4))

array([[0, 0, 0, 1],
       [0, 1, 0, 1],
       [0, 0, 1, 1]])

"一次生成500次行走记录, 每次走1000不的大数据集5000 x 1000"
nwalks = 5000
nsteps = 1000

# 0 or 1
%time draws = np.random.randint(0, 2, size=(nwalks, nsteps))

"5000000万次, 这耗时也太短了,厉害了, 1秒=10^3ms"

" >0, 1, <0, -1 "
steps = np.where(draws > 0, 1, -1)

"轴1, 列方向, 右边, 按照每行累积"
walks = steps.cumsum(axis=1)

walks

'一次生成500次行走记录, 每次走1000不的大数据集5000 x 1000'

Wall time: 58 ms

'5000000万次, 这耗时也太短了,厉害了'

' >0, 1, <0, -1 '

'轴1, 列方向, 右边, 按照每行累积'

array([[ -1,   0,  -1, ...,  22,  21,  22],
       [ -1,  -2,  -3, ...,  52,  53,  54],
       [  1,   0,  -1, ..., -20, -19, -20],
       ...,
       [ -1,  -2,  -3, ..., -48, -47, -48],
       [  1,   2,   1, ...,  -8,  -7,  -8],
       [ -1,  -2,  -1, ..., -10, -11, -10]], dtype=int32)

Now, we can compute the maximun and minimum values obtained(获得) over all of the walks: ->获取最大值, 最小值

"整个数组的"

'最大值', walks.max()
'最小值', walks.min()

'整个数组的'

('最大值', 112)

('最小值', -109)

Out of these walks, let's compute the minimum crossing time to 30 or -30. This is slightly tricky(稍微有些棘手) because not all 5000 of them reach 30. We can check this using the any method:

hist30 = (np.abs(walks) >= 30).any(axis=1)  # 按照行

hist30

'统计 number that hit 30 or -30, 有多少行'
hist30.sum()

array([ True,  True,  True, ...,  True,  True,  True])

'统计 number that hit 30 or -30, 有多少行'

We can use this boolean array to select out the rows of walks that actually cross the absolute 30 level an d call argmax across axis=1 to get the crossing times:

'cj待理解'
crossing_times = (np.abs(walks[hist30]) >= 30).argmax(1)
crossing_times.mean()

'cj待理解'

497.102877070619

Feel free to experiment(积极地尝试) with other distributions for the steps other than equal-sized coin flips(硬币试验). You need only use a different random number generation function, like normal to generate normally distribute steps with some mean and standard deviation(标准差)

import sys

steps = np.random.normal(loc=0, scale=0.25, size=(5000, 1000))

"查看这个对象占多少内存"

"{}的{}占用{}字节".format(steps.shape, steps.dtype, sys.getsizeof(steps))

'查看这个对象占多少内存'

'(5000, 1000)的float64占用40000112字节'

Conclusion

While(当然) much of the rest of the book will focus on building data wrangling(数据整理) skills with pandas, we will continue to work in a similar array-based style. In Appendix A, we will dig deeper(深入挖掘) into NumPy features to help you further develop your array computing skills.

耐心和恒心, 总会获得回报的.