using the GPU
想要看GPU的介绍性的讨论和对密集并行计算的使用,查阅:GPGPU.
theano设计的一个目标就是在一个抽象层面上进行特定的计算,所以内部的函数编译器需要灵活的处理这些计算,其中一个灵活性体现在可以在显卡上进行计算。
当前有两种方式来使用gpu,一种只支持NVIDIA cards (CUDA backend) ;另一种,还在开发中,可以支持任何 OpenCL设备,就像和NVIDIA cards (GpuArray Backend)一样。
一、CUDA backend
如果你没有准备好,那么就需要安装Nvidia 的 GPU编程工具链 (CUDA),然后配置好 Theano。我们提供了安装指南Linux, MacOS and Windows.(我的安装)。
1.1 测试theano和GPU
为了检查你的GPU是否启用了,可以剪切下面的代码然后保存成一个文件,运行看看。
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
exp() 。注意到我们使用了 shared
device=cpu, 那么计算机将会花费大约 3 ;而在GPU 上,只需要0.64秒。不过 GPU不会一直生成完全和CPU一致的浮点数。 作为一个基准来说,调用numpy.exp(x.get_value())
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.06635117531 seconds
Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761
1.62323284]
Used the cpu
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.638810873032 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
注意到在theano中GPU的操作在目前来说,只支持 floatX 为float32类型。
1.2 返回设备分配数据的句柄
在前面的例子中,加速并没有那么明显,这是因为函数返回的结果是作为一个 NumPy ndarray,而为了方便,已经从设备复制到主机上了。这就是为什么在device=gpu下很容易交换的原因,不过如果你不建议更少的可移植性,可以通过改变graph来用GPU的存储结果表示一个计算的过程来得到更大的加速。 gpu_from_host 操作也就是说“将输入从主机复制到GPU上”,然后在T.exp(x)被GPU版本的exp()替换后进行优化。
from theano import function, config, shared, sandbox
import theano.sandbox.cuda.basic_ops
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
输出结果为:
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
Looping 1000 times took 0.34898686409 seconds
Result is <CudaNdarray object at 0x6a7a5f0>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
numpy.asarray())来转换成一个NumPy ndarray。
对更对你可以使用borrow flag加速的资料,查阅:Borrowing when Constructing Function Objects.
1.3 在GPU上加速的是什么?
在当我们接着优化我们的实现的时候,效果的特性也会改变,而且在从设备到设备之间会有所变化,不过现在还是给出一个粗略的想法吧:
- 只有float32 的数据类型的计算可以加速。针对float64的更好的支持期待将来的硬件,不过在目前(2010年1月)float64还是相当慢的。
- 当参数是足够大而保持30个处理器都工作的时候,矩阵乘法,卷积和大型的逐元素计算可以加速大概5-50x。
- 索引、维度重排和常量时间的reshaping在gpu和cpu上一样块。
- 在张量上基于行/列的求和在gpu上可能会比cpu上慢一点。
- 设备与主机之间大量的数据的复制是相当慢的,通常会抵消掉在数据上一两个加速函数的大部分优势。让gpu取得性能上的提升的关键取决于数据传输到设备上的时间消耗。
1.4 在gpu上提升效果的提示
- floatX=float32 加到你的 .theanorc
- allow_gc=False. 见 GPU Async capabilities
- matrix, vector 和 scalar 来替换dmatrix, dvector 和 dscalar。因为前者当设定floatX = float32
- 确保你的输出变量为float32 dtype而不是float64。在graph中更多的float32变量会让你将更多的工作放在gpu上实现。
- 使用shared float32变量存储频繁访问的数据(见shared())来最大程度的减少转移到gpu设备上花费的时间。当使用gpu的时候,float32 张量共享变量存储在gpu上,并默认的使用这些变量来消除到gpu上的传输时间。(这里的意思应该是创建的时候就放在gpu上,而无需每次调用都从cpu上传给gpu,从而这份数据能够一直保持在gpu上,减少多次的传输)。
- 如果你对你得到的效果不满意,试着用 mode='ProfileMode'来建立你的函数。这在程序终止的时候,会打印出一些时间信息。如果一个op或者apply花费了它共享还多的时间,那么如果你知道一些gpu变成,就可以看看在theano.sandbox.cuda上它是怎么实现的。检查下载cpu上花费的时间比例Xs(X%) ,和在gpu上花费的时间比例 Xs(X%) 和在传输操作上花费的时间比例 Xs(X%) 。这可以告诉你你的graph所花费的时间是在gpu上还是更多的在内存的传输上。
- -ftz=true to flush denormals values to zeros., –prec-div=false 和 –prec-sqrt=false 选项可以通过使用更少的精度来对除法和平方根操作进行加速,。你可以通过 nvcc.flags=–use_fast_math Theano flag 来一次启用它们,或者如子nvcc.flags=-ftz=true –prec-div=false一样分别对它们进行启用。
1.5 GPU 异步功能
从Theano 0.6开始,我们就开始使用gpu的异步功能了。这可以让我们运行的更快,不过可能会让一些错误在它们本应该出现的地方延迟抛出异常。则会导致当分析 theano apply节点的时候有些困难。这里有一个 NVIDIA 驱动特性有助于解决这些问题。如果你将环境变量设置成CUDA_LAUNCH_BLOCKING=1 那么,所有的kernel调用都会自动同步的。这会降低性能,不过却提供很好的profiling和合理的位置错误信息。
在graph中插入同步点。设置theano flag allow_gc=False 来得到甚至更快的速度!不过这会引起内存使用率上升的问题。
1.6 改变共享变量的值
为了改变共享变量的值,即对进程提供新的数据,可以使用函数shared_variable.set_value(new_value). 更详细的资料,查阅 Understanding Memory Aliasing for Speed and Correctness.
练习:再次拿逻辑回归做例子
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.01*gw, b:b-0.01*gb},
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print 'Used the cpu'
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.fgraph.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
通过使用floatX= float32来在gpu上执行该例子,并使用time python file.py。来查看执行时间 (帮助资料:Configuration Settings and Compiling Mode)。
从cpu到gpu上有速度的提升吗?
ProfileMode)
在gpu上如何有更好的速度的提升?
note:
- 当前只支持32 位 floats (其他待开发)。
- 有着float32 dtype的Shared 变量默认会放到gpu内存空间上.
- 当前一个gpu被限制成只允许一个进程。
- device=gpu
- device=gpu{0, 1, ...}
- floatX=float32 (through theano.config.floatX) 。
- 在存储到一个shared变量之前记得Cast 输入。
- 在代码中手动插入cast或者使用 [u]int{8,16}.
- 在均值操作的周围手动插入cast (这会涉及到length的除法,而这是一个int64类型的).
- 注意:一个新的casting机制在开发中。
答案(Solution)
#!/usr/bin/env python
# Theano tutorial
# Solution to Exercise in section 'Using the GPU'
# 1. Raw results
from __future__ import print_function
import numpy
import theano
import theano.tensor as tt
from theano import sandbox, Out
theano.config.floatX = 'float32'
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = theano.shared(D[0], name="x")
y = theano.shared(D[1], name="y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy
cost = tt.cast(xent.mean(), 'float32') + \
0.01 * (w ** 2).sum() # The cost to optimize
gw, gb = tt.grad(cost, [w, b])
"""
# Compile expressions to functions
train = theano.function(
inputs=[x, y],
outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
name="predict")
"""
# Compile expressions to functions
train = theano.function(
inputs=[],
outputs=[prediction, xent],
updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
name="train")
predict = theano.function(inputs=[], outputs=prediction,
name="predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if theano used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train()
#print "Final model:"
#print w.get_value(), b.get_value()
print("target values for D")
print(D[1])
print("prediction on D")
print(predict())
"""
# 2. Profiling
# 2.1 Profiling for CPU computations
# In your terminal, type:
$ THEANO_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py
# You'll see first the output of the script:
Used the cpu
target values for D
prediction on D
# Followed by the output of profiling.. You'll see profiling results for each function
# in the script, followed by a summary for all functions.
# We'll show here only the summary:
Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
Function profiling
==================
Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
Time in 10002 calls to Function.__call__: 1.590916e+00s
Time in Function.fn.__call__: 1.492365e+00s (93.805%)
Time in thunks: 1.408159e+00s (88.512%)
Total compile time: 6.309664e+00s
Number of Apply nodes: 25
Theano Optimizer time: 4.848340e-01s
Theano validate time: 5.454302e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
59.6% 59.6% 0.839s 4.19e-05s C 20001 3 theano.tensor.blas_c.CGemv
30.1% 89.7% 0.424s 4.71e-06s C 90001 10 theano.tensor.elemwise.Elemwise
5.5% 95.2% 0.078s 7.79e-02s Py 1 1 theano.tensor.blas.Gemv
1.9% 97.1% 0.026s 1.30e-06s C 20001 3 theano.tensor.basic.Alloc
1.3% 98.4% 0.018s 1.85e-06s C 10000 1 theano.tensor.elemwise.Sum
1.0% 99.4% 0.014s 4.78e-07s C 30001 4 theano.tensor.elemwise.DimShuffle
0.6% 100.0% 0.008s 4.23e-07s C 20001 3 theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
59.6% 59.6% 0.839s 4.19e-05s C 20001 3 CGemv{inplace}
15.8% 75.4% 0.223s 2.23e-05s C 10000 1 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)]
7.7% 83.1% 0.109s 1.09e-05s C 10000 1 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)]
5.5% 88.7% 0.078s 7.79e-02s Py 1 1 Gemv{no_inplace}
4.3% 92.9% 0.060s 6.00e-06s C 10000 1 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}
1.9% 94.8% 0.026s 1.30e-06s C 20001 3 Alloc
1.3% 96.1% 0.018s 1.85e-06s C 10000 1 Sum{acc_dtype=float64}
0.7% 96.8% 0.009s 4.73e-07s C 20001 3 InplaceDimShuffle{x}
0.6% 97.4% 0.009s 8.52e-07s C 10000 1 Elemwise{sub,no_inplace}
0.6% 98.0% 0.008s 4.23e-07s C 20001 3 Shape_i{0}
0.5% 98.5% 0.007s 7.06e-07s C 10000 1 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
0.5% 98.9% 0.007s 6.57e-07s C 10000 1 Elemwise{neg,no_inplace}
0.3% 99.3% 0.005s 4.88e-07s C 10000 1 InplaceDimShuffle{1,0}
0.3% 99.5% 0.004s 3.78e-07s C 10000 1 Elemwise{inv,no_inplace}
0.2% 99.8% 0.003s 3.44e-07s C 10000 1 Elemwise{Cast{float32}}
0.2% 100.0% 0.003s 3.01e-07s C 10000 1 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
0.0% 100.0% 0.000s 8.11e-06s C 1 1 Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
31.6% 31.6% 0.445s 4.45e-05s 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
27.9% 59.6% 0.393s 3.93e-05s 10000 17 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417})
15.8% 75.4% 0.223s 2.23e-05s 10000 14 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
7.7% 83.1% 0.109s 1.09e-05s 10000 15 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
5.5% 88.7% 0.078s 7.79e-02s 1 0 Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0})
4.3% 92.9% 0.060s 6.00e-06s 10000 13 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
1.3% 94.2% 0.018s 1.85e-06s 10000 16 Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0)
1.0% 95.2% 0.013s 1.34e-06s 10000 5 Alloc(TensorConstant{0.0}, Shape_i{0}.0)
0.9% 96.1% 0.013s 1.27e-06s 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
0.6% 96.7% 0.009s 8.52e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
0.5% 97.2% 0.007s 7.06e-07s 10000 9 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
0.5% 97.6% 0.007s 6.57e-07s 10000 11 Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
0.4% 98.1% 0.006s 6.27e-07s 10000 0 InplaceDimShuffle{x}(b)
0.4% 98.5% 0.006s 5.90e-07s 10000 1 Shape_i{0}(x)
0.3% 98.9% 0.005s 4.88e-07s 10000 2 InplaceDimShuffle{1,0}(x)
0.3% 99.1% 0.004s 3.78e-07s 10000 10 Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0)
0.2% 99.4% 0.003s 3.44e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
0.2% 99.6% 0.003s 3.19e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0)
0.2% 99.8% 0.003s 3.01e-07s 10000 18 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
0.2% 100.0% 0.003s 2.56e-07s 10000 3 Shape_i{0}(y)
... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime)
# 2.2 Profiling for GPU computations
# In your terminal, type:
$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py
# You'll see first the output of the script:
Used the gpu
target values for D
prediction on D
Results were produced using a GeForce GTX TITAN
# Profiling summary for all functions:
Function profiling
==================
Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
Time in 10002 calls to Function.__call__: 3.535239e+00s
Time in Function.fn.__call__: 3.420863e+00s (96.765%)
Time in thunks: 2.865905e+00s (81.067%)
Total compile time: 4.728150e-01s
Number of Apply nodes: 36
Theano Optimizer time: 4.283385e-01s
Theano validate time: 7.687330e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
45.7% 45.7% 1.308s 1.64e-05s C 80001 9 theano.sandbox.cuda.basic_ops.GpuElemwise
17.2% 62.8% 0.492s 2.46e-05s C 20002 4 theano.sandbox.cuda.blas.GpuGemv
15.1% 77.9% 0.433s 2.17e-05s C 20001 3 theano.sandbox.cuda.basic_ops.GpuAlloc
8.2% 86.1% 0.234s 1.17e-05s C 20002 4 theano.sandbox.cuda.basic_ops.HostFromGpu
7.2% 93.3% 0.207s 2.07e-05s C 10000 1 theano.sandbox.cuda.basic_ops.GpuCAReduce
4.4% 97.7% 0.127s 1.27e-05s C 10003 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.9% 98.6% 0.025s 8.23e-07s C 30001 4 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.7% 99.3% 0.020s 9.88e-07s C 20001 3 theano.tensor.elemwise.Elemwise
0.5% 99.8% 0.014s 7.18e-07s C 20001 3 theano.compile.ops.Shape_i
0.2% 100.0% 0.006s 5.78e-07s C 10000 1 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
17.2% 17.2% 0.492s 2.46e-05s C 20001 3 GpuGemv{inplace}
8.2% 25.3% 0.234s 1.17e-05s C 20002 4 HostFromGpu
8.0% 33.3% 0.228s 2.28e-05s C 10001 2 GpuAlloc{memset_0=True}
7.4% 40.7% 0.211s 2.11e-05s C 10000 1 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}
7.2% 47.9% 0.207s 2.07e-05s C 10000 1 GpuCAReduce{add}{1}
7.1% 55.0% 0.205s 2.05e-05s C 10000 1 GpuAlloc
6.9% 62.0% 0.198s 1.98e-05s C 10000 1 GpuElemwise{sub,no_inplace}
6.9% 68.9% 0.198s 1.98e-05s C 10000 1 GpuElemwise{inv,no_inplace}
6.2% 75.1% 0.178s 1.78e-05s C 10000 1 GpuElemwise{neg,no_inplace}
5.6% 80.6% 0.159s 1.59e-05s C 10000 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)]
4.4% 85.1% 0.127s 1.27e-05s C 10003 4 GpuFromHost
4.3% 89.4% 0.124s 1.24e-05s C 10000 1 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
4.2% 93.6% 0.121s 1.21e-05s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)]
4.2% 97.7% 0.119s 1.19e-05s C 10000 1 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
0.5% 98.2% 0.014s 7.18e-07s C 20001 3 Shape_i{0}
0.5% 98.7% 0.013s 1.33e-06s C 10001 2 Elemwise{gt,no_inplace}
0.3% 99.0% 0.010s 9.81e-07s C 10000 1 GpuDimShuffle{1,0}
0.3% 99.3% 0.008s 7.90e-07s C 10000 1 GpuDimShuffle{0}
0.2% 99.6% 0.007s 6.97e-07s C 10001 2 GpuDimShuffle{x}
0.2% 99.8% 0.006s 6.50e-07s C 10000 1 Elemwise{Cast{float32}}
... (remaining 3 Ops account for 0.20%(0.01s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
8.8% 8.8% 0.251s 2.51e-05s 10000 22 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417})
8.4% 17.2% 0.241s 2.41e-05s 10000 7 GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
8.0% 25.1% 0.228s 2.28e-05s 10000 5 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0)
7.4% 32.5% 0.211s 2.11e-05s 10000 13 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
7.2% 39.7% 0.207s 2.07e-05s 10000 21 GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0)
7.1% 46.9% 0.205s 2.05e-05s 10000 17 GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0)
6.9% 53.8% 0.198s 1.98e-05s 10000 4 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y)
6.9% 60.7% 0.198s 1.98e-05s 10000 12 GpuElemwise{inv,no_inplace}(GpuFromHost.0)
6.2% 66.9% 0.178s 1.78e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
5.6% 72.5% 0.159s 1.59e-05s 10000 19 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0)
4.8% 77.3% 0.138s 1.38e-05s 10000 18 HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
4.4% 81.7% 0.126s 1.26e-05s 10000 10 GpuFromHost(Elemwise{Cast{float32}}.0)
4.3% 86.0% 0.124s 1.24e-05s 10000 9 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
4.2% 90.2% 0.121s 1.21e-05s 10000 15 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
4.2% 94.4% 0.119s 1.19e-05s 10000 23 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0)
3.4% 97.7% 0.096s 9.61e-06s 10000 16 HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0)
0.5% 98.2% 0.013s 1.33e-06s 10000 20 Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
0.3% 98.5% 0.010s 9.81e-07s 10000 2 GpuDimShuffle{1,0}(x)
0.3% 98.8% 0.008s 8.27e-07s 10000 1 Shape_i{0}(x)
0.3% 99.1% 0.008s 7.90e-07s 10000 14 GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0)
... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime)
# 3. Conclusions
Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves
consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead.
Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs;
if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU.
"""
二、 GpuArray Backend
如果你还没有准备好,你需要安装 libgpuarray 和至少一个计算工具箱。可以看相关的介绍说明 libgpuarray.
如果使用OpenGL,那么所有设备的类型都支持的,对于该章节剩下的部分,不管你使用的计算设备是什么,都表示是gpu。
waring:我们想完全支持OpenCL, 在2014年5月的时候,该支持仍然是个想法而已。一些有用的ops仍然没有被支持,因为 想要在旧的后端以最小化变化来移植。
2.1 Testing Theano with GPU
为了查看是否使用的是GPU,可以将下面代码剪切然后创建个文件运行:
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
该程序只计算一群随机数的 exp() 。注意到我们使用 theano.shared() 函数来确保输入x存储在gpu上。
$ THEANO_FLAGS=device=cpu python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.6071999073 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
$ THEANO_FLAGS=device=cuda0 python check1.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 2.28562092781 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
2.2 返回在设备上分配数据的句柄
在默认情况下,在gpu上执行的函数仍然返回一个标准的numpy ndarray。在得到结果之前会有一个迁移操作,将数据传输会cpu上从而来确保与cpu代码的兼容。这可以让在不改变源代码的情况下只使用flag device来改变代码运行的位置。
如果不建议损失一些灵活性,可以让theano直接返回gpu对象。下面的代码就是这样:
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', numpy.asarray(r)
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
theano.sandbox.gpuarray.basic.gpu_from_host()
输出为:
$ THEANO_FLAGS=device=cuda0 python check2.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took 0.455810785294 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
然而每次调用的时间看上去会比之前的两个调用更少 (的确是会更少,因为这里避免了数据传输r)这里这么大的加速是因为gpu上执行的异步过程所导致的,也就是说工作并没有完成,只是“启动”了。
numpy.asarray()来转换成一个常规的ndarray 。
为了更快的速度,可以使用borrow flag,查阅: Borrowing when Constructing Function Objects.
2.3 什么能够在gpu上加速?
当然在不同设备之间,性能特性还是不太的,同样的,我们会改进我们的实现。
该backend支持所有的常规theano数据类型 (float32, float64, int, ...),然而GPU的支持是变化的,而且一些单元没法处理 double (float64)或者更小的 (小于32 位,比如 int16)数据类型。如果使用了这些单元,那么会在编译的时候或者运行的时候得到一个错误。
复杂的支持还未测试,而且大多数都不行。
通常来说,大的操作,比如矩阵乘法或者有着大量输入的逐元素操作将会明显更快的。
2.4 GPU 异步功能
默认情况下,在gpu上所有的操作都是异步的,这可以通过底层的libgpuarray来使得这些操作都是透明的。
当在设备和主机之间进行内存迁移的时候,可以通过引入同步点。当在gpu上释放活动的(活动的缓冲区就是仍然会被kernel使用的缓冲区)内存缓冲区的时候,可以引入另一个同步点。
可以通过调用它的sync()方法来对一个特定的GpuArray强制同步。这在做基准的时候可以用来得到准确的耗时计算。
强制的同步点会和中间结果的垃圾回收相关联。为了得到最快的速度,你应该通过使用theano flag allow_gc=False来禁用垃圾回收器。不过要注意这会导致内存使用提升的问题。
三、直接对gpu编程的一些软件
撇开theano这种元编程,有:
- CUDA: GPU 编程API,是NVIDIA 对C的扩展 (CUDA C)
- 特定供应商
- 成熟的数值库 (BLAS, RNG, FFT) 。
- OpenCL: CUDA的多供应商版本
- 更加的通用和标准。
- 更少的库,传播不广
- PyCUDA:对CUDA驱动接口的python绑定,允许通过python来访问 Nvidia的 CUDA 并行计算API
- 方便:使用python来更容易的进行GPU 元编程。从python中能够抽象的编译更低层的 CUDA 代码 (pycuda.driver.SourceModule).GPU 内存缓存 (pycuda.gpuarray.GPUArray).帮助文档.
- 完整性: 绑定了所有的CUDA驱动 API.
- 自动的错误检测:所有的 CUDA 错误都会自动的转到python异常。
- 速度: PyCUDA的底层是用 C++写的。
- 针对GPU对象,具有很好的内存管理:对象的清理是和对象的生命周期绑定的 (RAII, ‘Resource Acquisition Is Initialization’).使得更容易编写正确的,无漏洞的和不容易崩溃的代码。PyCUDA 会知道依赖条件 (例如,它不会在所有分配的内存释放之前对上下文进行分离)。
(查阅PyCUDA的 documentation 和 在PyCUDA上Andreas Kloeckner的 website )
- PyOpenCL: PyCUDA for OpenCL
四、学习用PyCUDA编程
如果你已经精通C了,那么你就可以很容易的通过学习来充分利用你的知识,首先用CUDA C来编写GPU,然后,使用 PyCUDA来访问 CUDA API。
下面的资源有助于你学习的过程:
- CUDA API 和CUDA C: 入门
- NVIDIA’s slides
- Stein’s (NYU) slides
- CUDA API 和 CUDA C: 高级
- MIT IAP2009 CUDA (full coverage: lectures, leading Kirk-Hwu textbook, 例子,额外的资源)
- Course U. of Illinois (full lectures, Kirk-Hwu 教科书)
- NVIDIA’s knowledge base (覆盖范围广,从入门到高级)
- practical issues ( grids, blocks 和 threads之间的关系;并在同一页还有相对应的问题)
- CUDA optimisation
- PyCUDA: 入门
- Kloeckner’s slides
- Kloeckner’ website
- PYCUDA: 高级
- PyCUDA documentation website
下面的例子是用来说明用PyCUDA来对GPU编程的一个预言。一旦你觉得完全足够了,你就可以尝试去做相对应的练习。
Example: PyCUDA
# (from PyCUDA's documentation)
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
assert numpy.allclose(dest, a*b)
print dest
Exercise
运行之前的例子
修改并执行一个shape(20,10)的矩阵
Example: Theano + PyCUDA
import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda
class PyCUDADoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [storage_map[v] for v in node.inputs]
outputs = [storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape != inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512, 1, 1), grid=grid)
return thunk
使用这个代码来测试:
>>> x = theano.tensor.fmatrix()
>>> f = theano.function([x], PyCUDADoubleOp()(x))
>>> xv = numpy.ones((4, 5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2)
>>> print numpy.asarray(f(xv))
Exercise
运行前面的例子
修改并执行两个矩阵的乘法: x * y.
修改并执行返回两个输出: x + y 和 x - y.
(注意到theano当前的逐元素优化只对涉及到单一输出的计算有用。所以,为了提供基本解决情况下的效率,需要在代码中显式的对这两个操作进行优化)。
修改然后执行来支持跨越行为(stride) (即,避免受限于输入一定是C-连续的)。
五、注意
查阅 Other Implementations 来了解如何在gpu上处理随机数
参考资料:
[1]官网:http://deeplearning.net/software/theano/tutorial/using_gpu.html