摘要:
本文对比了:原图、tobytes、gfile三种方式的图片数据的训练文件大小、训练时长、GPU利用率等指标,通过实际数据分析这3种训练方式的优缺点;明确指出gfile训练方式具有最佳的空间占用和GPU利用率;行文中强调了初学者容易犯错的tf.decode_raw 、 tf.image.decode_jpeg、tf.image.convert_image_dtype、tf.image.resize_images函数的使用方法;旨在减少大家在相关问题上为此付出的检索时间
目录
1、引入
1.1、tensorflow中3种常用的图片数据形式
方式一(原图)
方式二(tobytes写入)
方式三(gfile写入)
1.2、TFRecords介绍
1.3、训练环境
1.4、数据
2、数据准备
2.1、方式一(原图)数据准备
2.2、方式二(tobytes写入)数据准备
2.3、方式三(gfile写入)数据准备
2.4、数据准备总结
3、训练评估
3.1、训练介绍
3.2、方式一(原图)数据读取和训练时长分析
3.3、方式二(tobtytes写入)数据读取和训练时长分析
3.5、训练评估总结
4、总结
1、引入
1.1、tensorflow中3种常用的图片数据形式
方式一(原图)
使用cv2.imread将图片数据直接读入内存队列中,训练时读取队列元素(初学者使用较多)。
ps:使用opencv读取图像,直接返回numpy.ndarray 对象,通道顺序为BGR ,注意是BGR,通道值默认范围0-255。
方式二(tobytes写入)
使用PIL.Image的tobytes()方法写入tfrecords文件中,训练时读取tfrecords(网上教程使用较多)。
ps:tobytes()方法返回一个使用标准“raw”编码器生成的包含像素数据的字符串。
方式三(gfile写入)
使用tf.gfile.FastGFile的read方法读取图片,写入tfrecords文件中,训练时读取tfrecords(tensorflow官方使用,例如:models-master\models-master\research\slim\datasets\download_and_convert_cifar10.py)。
ps:read方法默认以字符串返回整个文件的内容,如果参数是rb
,按照字节返回。
1.2、TFRecords介绍
TFRecords可以允许你讲任意的数据转换为TensorFlow所支持的格式, 这种方法可以使TensorFlow的数据集更容易与网络应用架构相匹配。这种建议的方法就是使用TFRecords文件,TFRecords文件包含了[tf.train.Example
协议内存块(protocol buffer)](协议内存块包含了字段[Features
]。你可以写一段代码获取你的数据, 将数据填入到Example
协议内存块(protocol buffer),将协议内存块序列化为一个字符串, 并且通过[tf.python_io.TFRecordWriter
class]写入到TFRecords文件。
TFRecords文件格式在图像识别中有很好的使用,其可以将二进制数据和标签数据(训练的类别标签)数据存储在同一个文件中,它可以在模型进行训练之前通过预处理步骤将图像转换为TFRecords格式,此格式最大的优点实践每幅输入图像和与之关联的标签放在同一个文件中.TFRecords文件是一种二进制文件,其不对数据进行压缩,所以可以被快速加载到内存中.格式不支持随机访问,因此它适合于大量的数据流,但不适用于快速分片或其他非连续存取。
1.3、训练环境
CPU:Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
GPU:Tesla V100
MEN:250G
硬盘:固态硬盘
1.4、数据
数据内容是2款不同型号的ONU图片,已经分好训练集 train 和验证集 val
du -h
查看数据的目录结构
201M ./603_663_img/train/F603
235M ./603_663_img/train/F663N
435M ./603_663_img/train
102M ./603_663_img/val/F603
87M ./603_663_img/val/F663N
188M ./603_663_img/val
623M ./603_663_img
623M .ls -lR|grep "^-"| wc -l
查看数据文件个数
3562
2、数据准备
2.1、方式一(原图)数据准备
通过代码直接读取目录下的数据,无需转化,图片数量3562张,大小为623M。
2.2、方式二(tobytes写入)数据准备
使用tobytes()写入tfrecords中,创建训练和验证数据集,关键代码:
tfrecords_path='603_663_train_raw.tfrecord'
#tfrecords_path='603_663_val_raw.tfrecord'
writer = tf.python_io.TFRecordWriter(tfrecords_path)
for line in lines:
with tf.Session('') as sess:
img = Image.open(os.path.join(base_path,line[0]))
width = img.size[0]
height = img.size[1]
img_raw = img.tobytes() #将图片转化为原生bytes
example = tf.train.Example(features=tf.train.Features(feature={
"image/class/label": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(line[1])])),
'image/width': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(width)])),
'image/height': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(height)])),
'image/raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_raw]))
}))
writer.write(example.SerializeToString()) #序列化为字符串
writer.close()
文件大小
total 6.8G
-rw-r--r-- 1 root root 4.8G Jul 21 00:45 603_663_train_raw.tfrecord
-rw-r--r-- 1 root root 2.1G Jul 21 00:45 603_663_val_raw.tfrecord
生成的数据文件 6.8G 约为原数据 623M 的 11.1倍!
2.3、方式三(gfile写入)数据准备
使用tf.gfile.FastGFile的read方法读取图片,写入tfrecords文件中,创建训练和验证数据集,关键代码:
tfrecords_path='603_663_train_gfile.tfrecord'
#tfrecords_path='603_663_val_gfile.tfrecord'
writer = tf.python_io.TFRecordWriter(tfrecords_path)
for line in lines:
decode_jpeg_data = tf.placeholder(dtype=tf.string)
decode_jpeg = tf.image.decode_jpeg(decode_jpeg_data, channels=3)
with tf.Session('') as sess:
image_data = tf.gfile.FastGFile(os.path.join(base_path, line[0]), 'rb').read()
image = sess.run(decode_jpeg, feed_dict={decode_jpeg_data: image_data})
height, width = image.shape[0], image.shape[1]
example = tf.train.Example(features=tf.train.Features(feature={
"image/class/label": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(line[1])])),
'image/width': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(width)])),
'image/height': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(height)])),
'image/encoded': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_data]))
}))
writer.write(example.SerializeToString()) #序列化为字符串
writer.close()
文件大小
总用量 617M
-rw-rw-r-- 1 hadoop hadoop 431M 7月 21 00:26 603_663_train_g.tfrecord
-rw-rw-r-- 1 hadoop hadoop 186M 7月 21 00:29 603_663_val_g.tfrecord
生成的数据文件 617M 比原数据 623M 还小!
ps:tensorflow官方数据准备
2.4、数据准备总结
数据占用大小排序 方式三(gfile写入) < 方式一(原图) < 方式二(tobtytes写入);
方式三(gfile写入) 与 方式一(原图)空间占用大小相似;
方式二(tobtytes写入) 空间占用超过其他两种方式十倍;
3、训练评估
3.1、训练介绍
分类算法:Inception_Resnet_v2
源码在tensorflow的model项目中 models-master\research\slim\nets\inception_resnet_v2.py。
训练方式:train集合中每张图片在训练是都经过随机数据增强,以保证每张训练图片都不同,每10个epoch,用验证集进行一次 accuracy验证并导出对应pb文件,训练脚本参考 models-master\research\slim\train_image_classifier.py 进行对应修改。或可以参考:
评估方式:每个batch为32张图片,记录每个batch读入图片之前和完成loss值run后的时间差duration。
3.2、方式一(原图)数据读取和训练时长分析
方式一(原图)数据读取:通过load_img函数将训练集图片地址读入内存:
def load_img():
train_path = 'train/'
X_train = []
y_train = []
class_list = os.listdir(train_path)
for c in range(len(class_list)):
# 训练数据
img_list = os.listdir(train_path + class_list[c])
# 每一个分类随机获取TEST_DATA_SAMPLE_SIZE张作为测试,避免tensorflow显存OOM,维持数据均衡
sample_size = len(img_list) if len(img_list) <= TRAIN_DATA_SAMPLE_SIZE else TRAIN_DATA_SAMPLE_SIZE
train_img_list_random = random.sample(img_list, sample_size)
for img_name in train_img_list_random:
img_path = train_path + class_list[c] + '/' + img_name
# 训练数据不直接加载,训练时加载并做数据增强
X_train.append(img_path)
y_train.append(c)
X_train = np.array(X_train)
y_train = np.array(y_train)
return X_train, y_train
训练时,构造load_augmentation_data函数从对应目录读取图片(并进行随机数据增强)。
def load_augmentation_data(X_train_array):
X_train = []
for img_path in X_train_array:
img = cv2.imread(img_path)
img = random_rotation(img)
img = random_exposure(img)
img = cv2.resize(img, (default_image_size, default_image_size))
# 归一化
img = img / 255.0
X_train.append(img)
return np.array(X_train)
通过duration计算每个batch的训练时长
np.random.shuffle(train_indicies)
# make sure we iterate over the dataset once
for i in range(int(math.ceil(Xd_num / BATCH_SIZE)) - 1):
# generate indicies for the batch
start_idx = (i * BATCH_SIZE) % Xd_num
idx = train_indicies[start_idx:start_idx + BATCH_SIZE]
start_time = time.time()
X_train_batch = load_augmentation_data(X_train[idx])
X_rs = np.reshape(X_train_batch, [BATCH_SIZE, image_size_H, image_size_W, num_channels])
# create a feed dictionary for this batch
feed_dict = {X: X_rs,
y: y_train[idx],
k_prob: keep_prob,
is_training: True}
_, loss_value, step = sess.run([train_step, loss, global_step], feed_dict=feed_dict)
duration = time.time()-start_time
随机查看第573个epoch,每个batch的训练时长
=epoch : 573 ==========
==shuffle training data==
===epoch : 573 ,batch : 0 ,training loss : 0.000939467, duration : 6.990 ===
===epoch : 573 ,batch : 1 ,training loss : 0.00018375, duration : 5.298 ===
===epoch : 573 ,batch : 2 ,training loss : 0.00446536, duration : 3.874 ===
===epoch : 573 ,batch : 3 ,training loss : 0.00122624, duration : 6.115 ===
===epoch : 573 ,batch : 4 ,training loss : 4.27897e-05, duration : 5.215 ===
===epoch : 573 ,batch : 5 ,training loss : 0.000973509, duration : 7.537 ===
===epoch : 573 ,batch : 6 ,training loss : 0.00018321, duration : 4.449 ===
===epoch : 573 ,batch : 7 ,training loss : 0.000185303, duration : 5.799 ===
===epoch : 573 ,batch : 8 ,training loss : 0.000451507, duration : 4.623 ===
===epoch : 573 ,batch : 9 ,training loss : 0.00191352, duration : 6.497 ===
===epoch : 573 ,batch : 10 ,training loss : 0.000114542, duration : 7.248 ===
===epoch : 573 ,batch : 11 ,training loss : 0.000165993, duration : 4.750 ===
===epoch : 573 ,batch : 12 ,training loss : 0.000519277, duration : 5.391 ===
===epoch : 573 ,batch : 13 ,training loss : 0.000214111, duration : 3.950 ===
===epoch : 573 ,batch : 14 ,training loss : 0.000279504, duration : 4.913 ===
===epoch : 573 ,batch : 15 ,training loss : 0.000331382, duration : 7.457 ===
===epoch : 573 ,batch : 16 ,training loss : 0.000179745, duration : 6.576 ===
===epoch : 573 ,batch : 17 ,training loss : 0.000352278, duration : 5.640 ===
===epoch : 573 ,batch : 18 ,training loss : 0.000146118, duration : 6.106 ===
===epoch : 573 ,batch : 19 ,training loss : 0.000144974, duration : 5.775 ===
===epoch : 573 ,batch : 20 ,training loss : 0.000233854, duration : 5.554 ===
===epoch : 573 ,batch : 21 ,training loss : 0.00214951, duration : 6.736 ===
===epoch : 573 ,batch : 22 ,training loss : 0.000471303, duration : 6.563 ===
===epoch : 573 ,batch : 23 ,training loss : 0.0403984, duration : 4.057 ===
===epoch : 573 ,batch : 24 ,training loss : 0.000338226, duration : 5.766 ===
===epoch : 573 ,batch : 25 ,training loss : 0.000138452, duration : 4.761 ===
===epoch : 573 ,batch : 26 ,training loss : 0.000272711, duration : 5.982 ===
===epoch : 573 ,batch : 27 ,training loss : 8.50688e-05, duration : 6.151 ===
===epoch : 573 ,batch : 28 ,training loss : 0.00122387, duration : 4.286 ===
===epoch : 573 ,batch : 29 ,training loss : 0.000393897, duration : 6.040 ===
===epoch : 573 ,batch : 30 ,training loss : 0.0108629, duration : 4.741 ===
===epoch : 573 ,batch : 31 ,training loss : 0.000192479, duration : 4.712 ===
===epoch : 573 ,batch : 32 ,training loss : 8.31924e-05, duration : 4.378 ===
===epoch : 573 ,batch : 33 ,training loss : 0.000232057, duration : 5.263 ===
===epoch : 573 ,batch : 34 ,training loss : 0.000416539, duration : 6.058 ===
===epoch : 573 ,batch : 35 ,training loss : 0.000387605, duration : 6.199 ===
===epoch : 573 ,batch : 36 ,training loss : 0.000366599, duration : 6.499 ===
===epoch : 573 ,batch : 37 ,training loss : 0.00061386, duration : 5.950 ===
===epoch : 573 ,batch : 38 ,training loss : 0.000292077, duration : 4.811 ===
===epoch : 573 ,batch : 39 ,training loss : 0.000189406, duration : 6.279 ===
===epoch : 573 ,batch : 40 ,training loss : 0.000306261, duration : 4.495 ===
===epoch : 573 ,batch : 41 ,training loss : 0.000369673, duration : 6.078 ===
===epoch : 573 ,batch : 42 ,training loss : 0.000378717, duration : 7.124 ===
===epoch : 573 ,batch : 43 ,training loss : 0.000119711, duration : 4.739 ===
===epoch : 573 ,batch : 44 ,training loss : 0.000523766, duration : 3.970 ===
===epoch : 573 ,batch : 45 ,training loss : 0.000165045, duration : 6.820 ===
===epoch : 573 ,batch : 46 ,training loss : 0.000131299, duration : 6.152 ===
===epoch : 573 ,batch : 47 ,training loss : 0.000267241, duration : 5.596 ===
===epoch : 573 ,batch : 48 ,training loss : 0.000307097, duration : 5.661 ===
===epoch : 573 ,batch : 49 ,training loss : 7.79769e-05, duration : 6.451 ===
===epoch : 573 ,batch : 50 ,training loss : 0.000377345, duration : 3.622 ===
===epoch : 573 ,batch : 51 ,training loss : 0.000201783, duration : 5.403 ===
===epoch : 573 ,batch : 52 ,training loss : 0.000388394, duration : 4.989 ===
===epoch : 573 ,batch : 53 ,training loss : 0.00268338, duration : 8.330 ===
===epoch : 573 ,batch : 54 ,training loss : 0.000297159, duration : 3.592 ===
===epoch : 573 ,batch : 55 ,training loss : 0.00193538, duration : 5.746 ===
===epoch : 573 ,batch : 56 ,training loss : 0.000743068, duration : 5.976 ===
===epoch : 573 ,batch : 57 ,training loss : 0.000149464, duration : 5.746 ===
===epoch : 573 ,batch : 58 ,training loss : 0.000302717, duration : 6.536 ===
===epoch : 573 ,batch : 59 ,training loss : 0.000622811, duration : 7.743 ===
===epoch : 573 ,batch : 60 ,training loss : 0.00157522, duration : 6.933 ===
===epoch : 573 ,batch : 61 ,training loss : 0.00044104, duration : 6.160 ===
训练时长分析:平均每个batch需要5.9s,通过nvidia-smi命令观察GPU使用情况可以发现,GPU的利用率长时间为0%,
可以推断,opencv从硬盘读取图片并进行解码、数据增强等操作,并未使用GPU进行加速,仅在运行tensorflow相关操作时利用GPU进行加速,想加速训练,应该优化代码,尽量提升GPU利用率。
3.3、方式二(tobtytes写入)数据读取和训练时长分析
方式二(tobytes写入)数据读取:利用tensorflow的队列自动生成数据
ps:参考 tensorflow读取数据-tfrecord格式
特别注意:
1、使用 tobytes 方法写入tfrecords中,需对应使用 tf.decode_raw 方法进行解码;
2、使用tf.image.convert_image_dtype 进行归一化操作,必须保证输入的每位像素都是整数。要特别注意 tf.image.resize_images 操作的顺序,先归一化,再resize,resize中不同参数返回的图片像素值可能不是整型。笔者曾经忽略了这个细节,花费了大量时间用于调试。
ps: 参考 Tensorflow中图像处理函数(图像大小调整)
def pre_process_img(image):
image = tf.image.random_saturation(image, lower=0.8, upper=1.2)
image = tf.image.random_brightness(image, max_delta=32. / 255)
image = tf.image.random_contrast(image, lower=0.8, upper=1.2)
image = tf.image.random_hue(image, max_delta=0.2)
image = tf.contrib.image.rotate(image, np.random.randint(-10, 10))
return image
def read_and_decode(filename, batch_size, capacity, min_after_dequeue):
filename_queue = tf.train.string_input_producer([filename],shuffle=False) #读入流中
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue) #返回文件名和文件
features = tf.parse_single_example(serialized_example,
features={
'image/class/label': tf.FixedLenFeature([], tf.int64),
'image/raw': tf.FixedLenFeature([], tf.string),
'image/width': tf.FixedLenFeature([], tf.int64),
'image/height':tf.FixedLenFeature([], tf.int64),
}) #取出包含image和label的feature对象
image = tf.decode_raw(features['image/raw'], tf.uint8)
width = tf.cast(features['image/width'], tf.int32)
height = tf.cast(features['image/height'], tf.int32)
image = tf.reshape(image,[height,width,3])
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
image = tf.image.resize_images(image, [default_image_size, default_image_size])
image = pre_process_img(image)
label = tf.cast(features['image/class/label'], tf.int32)
# 生成batch
batch_image, batch_label = tf.train.shuffle_batch([image, label], batch_size=batch_size,
capacity=capacity, min_after_dequeue=min_after_dequeue)
return batch_image, batch_label
通过sess.run([X_train_batch, Y_train_label])生成训练数据后,再送入神经网络进行训练。
通过duration计算每个batch的训练时长。
b_num = int(math.ceil(FLAGS.TRAIN_NUM_EPOCH / FLAGS.BATCH_SIZE)) - 1
for i in range(b_num):
start_time = time.time()
train_batch,train_label = sess.run([X_train_batch, Y_train_label])
feed_dict = {X: train_batch,
y: train_label,
is_training: True}
_, loss_value, step,acc = sess.run([train_step, loss, global_step,accuracy], feed_dict=feed_dict)
duration = time.time()-start_time
随机查看第739个epoch,每个batch的训练时长
===epoch : 739 ,batch : 0 ,training loss : 0.000458688 , accuracy : 1.000000, duration : 0.560 ===
===epoch : 739 ,batch : 1 ,training loss : 0.000252981 , accuracy : 1.000000, duration : 0.591 ===
===epoch : 739 ,batch : 2 ,training loss : 0.000421034 , accuracy : 1.000000, duration : 0.578 ===
===epoch : 739 ,batch : 3 ,training loss : 0.000621131 , accuracy : 1.000000, duration : 0.633 ===
===epoch : 739 ,batch : 4 ,training loss : 0.000471209 , accuracy : 1.000000, duration : 0.559 ===
===epoch : 739 ,batch : 5 ,training loss : 0.00109829 , accuracy : 1.000000, duration : 0.578 ===
===epoch : 739 ,batch : 6 ,training loss : 0.000799854 , accuracy : 1.000000, duration : 0.606 ===
===epoch : 739 ,batch : 7 ,training loss : 0.00094151 , accuracy : 1.000000, duration : 0.569 ===
===epoch : 739 ,batch : 8 ,training loss : 0.00357334 , accuracy : 1.000000, duration : 0.561 ===
===epoch : 739 ,batch : 9 ,training loss : 0.000717244 , accuracy : 1.000000, duration : 0.581 ===
===epoch : 739 ,batch : 10 ,training loss : 0.00162466 , accuracy : 1.000000, duration : 0.563 ===
===epoch : 739 ,batch : 11 ,training loss : 0.00216218 , accuracy : 1.000000, duration : 0.591 ===
===epoch : 739 ,batch : 12 ,training loss : 0.00305078 , accuracy : 1.000000, duration : 0.578 ===
===epoch : 739 ,batch : 13 ,training loss : 0.00485827 , accuracy : 1.000000, duration : 0.570 ===
===epoch : 739 ,batch : 14 ,training loss : 0.000763111 , accuracy : 1.000000, duration : 0.612 ===
===epoch : 739 ,batch : 15 ,training loss : 0.0012173 , accuracy : 1.000000, duration : 0.559 ===
===epoch : 739 ,batch : 16 ,training loss : 0.000912588 , accuracy : 1.000000, duration : 0.590 ===
===epoch : 739 ,batch : 17 ,training loss : 0.000284491 , accuracy : 1.000000, duration : 0.591 ===
===epoch : 739 ,batch : 18 ,training loss : 0.000173526 , accuracy : 1.000000, duration : 0.595 ===
===epoch : 739 ,batch : 19 ,training loss : 0.000246254 , accuracy : 1.000000, duration : 0.563 ===
===epoch : 739 ,batch : 20 ,training loss : 0.000206566 , accuracy : 1.000000, duration : 0.596 ===
===epoch : 739 ,batch : 21 ,training loss : 0.000116531 , accuracy : 1.000000, duration : 0.564 ===
===epoch : 739 ,batch : 22 ,training loss : 8.58229e-05 , accuracy : 1.000000, duration : 0.596 ===
===epoch : 739 ,batch : 23 ,training loss : 0.000143532 , accuracy : 1.000000, duration : 0.634 ===
===epoch : 739 ,batch : 24 ,training loss : 0.000129403 , accuracy : 1.000000, duration : 0.577 ===
===epoch : 739 ,batch : 25 ,training loss : 4.88722e-05 , accuracy : 1.000000, duration : 0.557 ===
===epoch : 739 ,batch : 26 ,training loss : 0.000311686 , accuracy : 1.000000, duration : 0.593 ===
===epoch : 739 ,batch : 27 ,training loss : 0.000107129 , accuracy : 1.000000, duration : 0.577 ===
===epoch : 739 ,batch : 28 ,training loss : 8.82945e-05 , accuracy : 1.000000, duration : 0.567 ===
===epoch : 739 ,batch : 29 ,training loss : 5.18051e-05 , accuracy : 1.000000, duration : 0.572 ===
===epoch : 739 ,batch : 30 ,training loss : 4.80934e-05 , accuracy : 1.000000, duration : 0.634 ===
===epoch : 739 ,batch : 31 ,training loss : 0.000163095 , accuracy : 1.000000, duration : 0.575 ===
===epoch : 739 ,batch : 32 ,training loss : 0.00010896 , accuracy : 1.000000, duration : 0.552 ===
===epoch : 739 ,batch : 33 ,training loss : 0.000137048 , accuracy : 1.000000, duration : 0.629 ===
===epoch : 739 ,batch : 34 ,training loss : 0.00027466 , accuracy : 1.000000, duration : 0.595 ===
===epoch : 739 ,batch : 35 ,training loss : 0.000142379 , accuracy : 1.000000, duration : 0.570 ===
===epoch : 739 ,batch : 36 ,training loss : 0.000336119 , accuracy : 1.000000, duration : 0.573 ===
===epoch : 739 ,batch : 37 ,training loss : 0.00034926 , accuracy : 1.000000, duration : 0.558 ===
===epoch : 739 ,batch : 38 ,training loss : 0.000100725 , accuracy : 1.000000, duration : 0.629 ===
===epoch : 739 ,batch : 39 ,training loss : 0.0015204 , accuracy : 1.000000, duration : 0.556 ===
===epoch : 739 ,batch : 40 ,training loss : 0.000176535 , accuracy : 1.000000, duration : 0.571 ===
===epoch : 739 ,batch : 41 ,training loss : 0.000625736 , accuracy : 1.000000, duration : 0.570 ===
===epoch : 739 ,batch : 42 ,training loss : 0.000519971 , accuracy : 1.000000, duration : 0.556 ===
===epoch : 739 ,batch : 43 ,training loss : 0.00245568 , accuracy : 1.000000, duration : 0.555 ===
===epoch : 739 ,batch : 44 ,training loss : 0.000495821 , accuracy : 1.000000, duration : 0.634 ===
===epoch : 739 ,batch : 45 ,training loss : 0.000780712 , accuracy : 1.000000, duration : 0.581 ===
===epoch : 739 ,batch : 46 ,training loss : 0.00104021 , accuracy : 1.000000, duration : 0.595 ===
===epoch : 739 ,batch : 47 ,training loss : 0.000877889 , accuracy : 1.000000, duration : 0.580 ===
===epoch : 739 ,batch : 48 ,training loss : 0.00239627 , accuracy : 1.000000, duration : 0.569 ===
===epoch : 739 ,batch : 49 ,training loss : 0.00187364 , accuracy : 1.000000, duration : 0.563 ===
===epoch : 739 ,batch : 50 ,training loss : 0.00190735 , accuracy : 1.000000, duration : 0.594 ===
===epoch : 739 ,batch : 51 ,training loss : 0.00623525 , accuracy : 1.000000, duration : 0.574 ===
===epoch : 739 ,batch : 52 ,training loss : 0.00365224 , accuracy : 1.000000, duration : 0.594 ===
===epoch : 739 ,batch : 53 ,training loss : 0.00609043 , accuracy : 1.000000, duration : 0.561 ===
===epoch : 739 ,batch : 54 ,training loss : 0.000788014 , accuracy : 1.000000, duration : 0.566 ===
===epoch : 739 ,batch : 55 ,training loss : 0.000209338 , accuracy : 1.000000, duration : 0.625 ===
===epoch : 739 ,batch : 56 ,training loss : 3.40514e-05 , accuracy : 1.000000, duration : 0.596 ===
===epoch : 739 ,batch : 57 ,training loss : 0.000287742 , accuracy : 1.000000, duration : 0.574 ===
===epoch : 739 ,batch : 58 ,training loss : 6.50318e-05 , accuracy : 1.000000, duration : 0.567 ===
===epoch : 739 ,batch : 59 ,training loss : 0.000213395 , accuracy : 1.000000, duration : 0.558 ===
===epoch : 739 ,batch : 60 ,training loss : 5.59622e-05 , accuracy : 1.000000, duration : 0.590 ===
===epoch : 739 ,batch : 61 ,training loss : 0.000133713 , accuracy : 1.000000, duration : 0.563 ===
训练时长分析:平均每个batch需要0.58s,通过nvidia-smi命令观察GPU使用情况可以发现,GPU利用率长时间稳定于较高水平
GPU得到有效利用,可以提升训练效率,降低训练时长。
3.4、方式三(gfile写入) 数据读 取和训练时长分析:
方式三(gfile写入)数据读取:利用tensorflow的队列自动生成数据
注意:这里对应的解码方式为 tf.image.decode_jpeg 方法,区别于方式二
def pre_process_img(image):
image = tf.image.random_saturation(image, lower=0.8, upper=1.2)
image = tf.image.random_brightness(image, max_delta=32. / 255)
image = tf.image.random_contrast(image, lower=0.8, upper=1.2)
image = tf.image.random_hue(image, max_delta=0.2)
image = tf.contrib.image.rotate(image, np.random.randint(-10, 10))
return image
def read_and_decode(filename, batch_size, capacity, min_after_dequeue):
filename_queue = tf.train.string_input_producer([filename],shuffle=False) #读入流中
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue) #返回文件名和文件
features = tf.parse_single_example(serialized_example,
features={
'image/class/label': tf.FixedLenFeature([], tf.int64),
'image/encoded': tf.FixedLenFeature([], tf.string),
'image/width': tf.FixedLenFeature([], tf.int64),
'image/height':tf.FixedLenFeature([], tf.int64),
}) #取出包含image和label的feature对象
image = tf.image.decode_jpeg(features['image/encoded'])
width = tf.cast(features['image/width'], tf.int32)
height = tf.cast(features['image/height'], tf.int32)
image = tf.reshape(image,[height,width,3])
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
image = tf.image.resize_images(image, [default_image_size, default_image_size])
image = pre_process_img(image)
label = tf.cast(features['image/class/label'], tf.int32)
# 生成batch
batch_image, batch_label = tf.train.shuffle_batch([image, label], batch_size=batch_size,
capacity=capacity, min_after_dequeue=min_after_dequeue)
return batch_image, batch_label
训练脚本同方式二相同,不在赘述
随机查看第200个epoch,每个batch的训练时长
===epoch : 200 ,batch : 0 ,training loss : 9.96804e-06 , accuracy : 1.000000, duration : 0.617 ===
===epoch : 200 ,batch : 1 ,training loss : 2.79766e-06 , accuracy : 1.000000, duration : 0.652 ===
===epoch : 200 ,batch : 2 ,training loss : 7.1747e-06 , accuracy : 1.000000, duration : 0.628 ===
===epoch : 200 ,batch : 3 ,training loss : 2.34188e-05 , accuracy : 1.000000, duration : 0.671 ===
===epoch : 200 ,batch : 4 ,training loss : 3.16147e-05 , accuracy : 1.000000, duration : 0.728 ===
===epoch : 200 ,batch : 5 ,training loss : 0.000155467 , accuracy : 1.000000, duration : 0.645 ===
===epoch : 200 ,batch : 6 ,training loss : 0.000191888 , accuracy : 1.000000, duration : 0.671 ===
===epoch : 200 ,batch : 7 ,training loss : 3.72732e-05 , accuracy : 1.000000, duration : 0.599 ===
===epoch : 200 ,batch : 8 ,training loss : 7.74464e-06 , accuracy : 1.000000, duration : 0.623 ===
===epoch : 200 ,batch : 9 ,training loss : 1.76095e-05 , accuracy : 1.000000, duration : 0.610 ===
===epoch : 200 ,batch : 10 ,training loss : 8.60901e-06 , accuracy : 1.000000, duration : 0.604 ===
===epoch : 200 ,batch : 11 ,training loss : 0.000154941 , accuracy : 1.000000, duration : 0.620 ===
===epoch : 200 ,batch : 12 ,training loss : 0.00011795 , accuracy : 1.000000, duration : 0.600 ===
===epoch : 200 ,batch : 13 ,training loss : 4.72731e-06 , accuracy : 1.000000, duration : 0.647 ===
===epoch : 200 ,batch : 14 ,training loss : 6.96867e-05 , accuracy : 1.000000, duration : 0.638 ===
===epoch : 200 ,batch : 15 ,training loss : 2.29104e-06 , accuracy : 1.000000, duration : 0.665 ===
===epoch : 200 ,batch : 16 ,training loss : 3.4905e-06 , accuracy : 1.000000, duration : 0.617 ===
===epoch : 200 ,batch : 17 ,training loss : 3.55389e-06 , accuracy : 1.000000, duration : 0.696 ===
===epoch : 200 ,batch : 18 ,training loss : 6.92515e-06 , accuracy : 1.000000, duration : 0.686 ===
===epoch : 200 ,batch : 19 ,training loss : 6.75381e-06 , accuracy : 1.000000, duration : 0.645 ===
===epoch : 200 ,batch : 20 ,training loss : 2.99511e-06 , accuracy : 1.000000, duration : 0.718 ===
===epoch : 200 ,batch : 21 ,training loss : 3.66562e-06 , accuracy : 1.000000, duration : 0.596 ===
===epoch : 200 ,batch : 22 ,training loss : 8.90663e-06 , accuracy : 1.000000, duration : 0.684 ===
===epoch : 200 ,batch : 23 ,training loss : 4.35108e-06 , accuracy : 1.000000, duration : 0.635 ===
===epoch : 200 ,batch : 24 ,training loss : 1.28055e-05 , accuracy : 1.000000, duration : 0.613 ===
===epoch : 200 ,batch : 25 ,training loss : 2.93427e-05 , accuracy : 1.000000, duration : 0.686 ===
===epoch : 200 ,batch : 26 ,training loss : 2.97213e-05 , accuracy : 1.000000, duration : 0.621 ===
===epoch : 200 ,batch : 27 ,training loss : 3.35554e-05 , accuracy : 1.000000, duration : 0.627 ===
===epoch : 200 ,batch : 28 ,training loss : 8.73921e-06 , accuracy : 1.000000, duration : 0.640 ===
===epoch : 200 ,batch : 29 ,training loss : 8.87718e-06 , accuracy : 1.000000, duration : 0.633 ===
===epoch : 200 ,batch : 30 ,training loss : 1.19316e-05 , accuracy : 1.000000, duration : 0.730 ===
===epoch : 200 ,batch : 31 ,training loss : 7.80038e-06 , accuracy : 1.000000, duration : 0.639 ===
===epoch : 200 ,batch : 32 ,training loss : 3.0789e-05 , accuracy : 1.000000, duration : 0.639 ===
===epoch : 200 ,batch : 33 ,training loss : 3.27821e-06 , accuracy : 1.000000, duration : 0.635 ===
===epoch : 200 ,batch : 34 ,training loss : 3.6002e-05 , accuracy : 1.000000, duration : 0.641 ===
===epoch : 200 ,batch : 35 ,training loss : 1.35211e-05 , accuracy : 1.000000, duration : 0.635 ===
===epoch : 200 ,batch : 36 ,training loss : 5.59903e-06 , accuracy : 1.000000, duration : 0.733 ===
===epoch : 200 ,batch : 37 ,training loss : 7.75578e-06 , accuracy : 1.000000, duration : 0.649 ===
===epoch : 200 ,batch : 38 ,training loss : 5.64479e-05 , accuracy : 1.000000, duration : 0.622 ===
===epoch : 200 ,batch : 39 ,training loss : 4.32871e-06 , accuracy : 1.000000, duration : 0.664 ===
===epoch : 200 ,batch : 40 ,training loss : 1.12876e-06 , accuracy : 1.000000, duration : 0.674 ===
===epoch : 200 ,batch : 41 ,training loss : 0.000441137 , accuracy : 1.000000, duration : 0.638 ===
===epoch : 200 ,batch : 42 ,training loss : 1.77718e-05 , accuracy : 1.000000, duration : 0.634 ===
===epoch : 200 ,batch : 43 ,training loss : 2.89471e-05 , accuracy : 1.000000, duration : 0.659 ===
===epoch : 200 ,batch : 44 ,training loss : 8.2659e-06 , accuracy : 1.000000, duration : 0.641 ===
===epoch : 200 ,batch : 45 ,training loss : 1.32167e-05 , accuracy : 1.000000, duration : 0.641 ===
===epoch : 200 ,batch : 46 ,training loss : 2.93152e-05 , accuracy : 1.000000, duration : 0.704 ===
===epoch : 200 ,batch : 47 ,training loss : 1.20351e-05 , accuracy : 1.000000, duration : 0.588 ===
===epoch : 200 ,batch : 48 ,training loss : 0.000137681 , accuracy : 1.000000, duration : 0.731 ===
===epoch : 200 ,batch : 49 ,training loss : 5.31589e-06 , accuracy : 1.000000, duration : 0.638 ===
===epoch : 200 ,batch : 50 ,training loss : 2.68441e-05 , accuracy : 1.000000, duration : 0.597 ===
===epoch : 200 ,batch : 51 ,training loss : 7.99048e-06 , accuracy : 1.000000, duration : 0.590 ===
===epoch : 200 ,batch : 52 ,training loss : 3.29691e-05 , accuracy : 1.000000, duration : 0.650 ===
===epoch : 200 ,batch : 53 ,training loss : 1.87034e-05 , accuracy : 1.000000, duration : 0.681 ===
===epoch : 200 ,batch : 54 ,training loss : 1.56731e-05 , accuracy : 1.000000, duration : 0.609 ===
===epoch : 200 ,batch : 55 ,training loss : 9.93123e-06 , accuracy : 1.000000, duration : 0.720 ===
===epoch : 200 ,batch : 56 ,training loss : 1.89678e-05 , accuracy : 1.000000, duration : 0.630 ===
===epoch : 200 ,batch : 57 ,training loss : 1.54174e-05 , accuracy : 1.000000, duration : 0.625 ===
===epoch : 200 ,batch : 58 ,training loss : 7.36479e-06 , accuracy : 1.000000, duration : 0.642 ===
===epoch : 200 ,batch : 59 ,training loss : 9.15997e-06 , accuracy : 1.000000, duration : 0.705 ===
===epoch : 200 ,batch : 60 ,training loss : 1.20092e-05 , accuracy : 1.000000, duration : 0.716 ===
===epoch : 200 ,batch : 61 ,training loss : 3.4272e-06 , accuracy : 1.000000, duration : 0.684 ===
训练时长分析:平均每个batch需要0.64s
3.5、训练评估总结
根据统计情况,平均每batch(32张)训练时长排序: 方式二(tobtytes写入) < 方式三(gfile写入) < 方式一(原图) ;
根据统计情况,方式二(tobtytes写入)较 方式三(gfile写入) 可以缩短10%训练时长;
方式一(原图)训练时长超过其余两种方式十倍;
4、总结
每batch训练时长(单位:秒) | train和val集的总数据大小(单位MB) | |
方式三(gfile写入tfrecords) | 0.64 | 617 |
方式二(tobtytes写入tfrecords) | 0.58 | 637952 |
方式一(原图) | 6.4 | 623 |
总体来说,在相似硬件配置条件下
方式三(gfile写入tfrecords)达到了空间和性能的最佳配比,也是tf官方采用的方式,推荐在大多数场景中使用;
方式二(tobtytes写入tfrecords)适合追求最短训练时长;
方式一(原图)浪费了GPU的性能,不推荐使用;