在这篇文章中,我们将逐字逐句地尝试找到图片中的单词!基于最近的一篇论文进行文字检测。

EAST: An Efficient and Accurate Scene Text Detector.

应该注意,文本检测不同于文本识别。在文本检测中,我们只检测文本周围的边界框。但是,在文本识别中,我们实际上找到了框中所写的内容。例如,在下面给出的图像中,文本检测将为您提供单词周围的边界框,文本识别将告诉您该框包含单词STOP。本文只进行文本检测。

本文基于tensorflow模型,基于OpenCV调用tensorflow模型。我们将逐步讨论算法是如何工作的。您将需要OpenCV3.4.3以上版本来运行代码。其他opencv DNN模型读取也类似这样步骤。

涉及的步骤如下: 1. 下载EAST模型1. 将模型加载到内存中1. 准备输入图像1. 正向传递blob通过网络1. 处理输出 # 1 网络加载

我们将使用cv :: dnn :: readnet(C++版本)或cv2.dnn.ReadNet(python版本)函数将网络加载到内存中。它会根据指定的文件名自动检测配置和框架。在我们的例子中,它是一个pb文件,因此,它将假定要加载Tensorflow网络。和加载图像不大一样,没有模型结构描述文件。

C++

Net net = readNet(model);

Python

net = cv.dnn.readNet(model)

2 读取图像

我们需要创建一个4-D输入blob,用于将图像输送到网络。这是使用blobFromImage函数完成的。

C++

blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);

Python

blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)

我们需要为此函数指定一些参数。它们如下: 1. 第一个参数是图像本身。1. 第二个参数指定每个像素值的缩放。在这种情况下,它不是必需的。因此我们将其保持为1。1. 第三个参数是设定网络的默认输入为320×320。因此,我们需要在创建blob时指定它。最好和网络输入一致。1. 第四个参数是训练时候设定的模型均值。需要减去模型均值。1. 第五个参数是我们是否要交换R和B通道。这是必需的,因为OpenCV使用BGR格式,Tensorflow使用RGB格式,caffe模型使用BGR格式。1. 最后一个参数是我们是否要裁剪图像并采取中心裁剪。在这种情况下我们指定False。 # 3 前向传播

现在我们已准备好输入,我们将通过网络传递它。网络有两个输出。一个指定文本框的位置,另一个指定检测到的框的置信度分数。两个输出层如下:

feature_fusion/concat_3

feature_fusion/Conv_7/Sigmoid

这两个输出可以直接用netron这个软件打开pb模型,看到最后输出结果。Netron是一个模型结构可视化神器,支持tf, caffe, keras,mxnet等多种框架。Netron下载地址:

c++读取输出代码如下:

std::vector<String> outputLayers(2);

outputLayers[0] = "feature_fusion/Conv_7/Sigmoid";

outputLayers[1] = "feature_fusion/concat_3";

python读取输出代码如下:

outputLayers = []

outputLayers.append("feature_fusion/Conv_7/Sigmoid")

outputLayers.append("feature_fusion/concat_3")

接下来,我们通过将输入图像传递到网络来获得输出。如前所述,输出由两部分组成:置信度和位置。

C++

std::vector<Mat> output;

net.setInput(blob);

net.forward(output, outputLayers);

Mat scores = output[0];

Mat geometry = output[1];

python:


net.setInput(blob)

output = net.forward(outputLayers)

scores = output[0]

geometry = output[1]

4 处理输出

如前所述,我们将使用两个层的输出并解码文本框的位置及其方向。我们可能会得到许多文本框。因此,我们需要从该批次中筛选出看起来最好的文本框。这是使用非极大值抑制算法完成的。

非极大值抑制算法在目标检测中应用很广泛,具体可以参考

1 解码

C++:

std::vector<RotatedRect> boxes;

std::vector<float> confidences;

decode(scores, geometry, confThreshold, boxes, confidences);

python:

[boxes, confidences] = decode(scores, geometry, confThreshold)

2 非极大值抑制

我们使用OpenCV函数NMSBoxes(C ++)或NMSBoxesRotated(Python)来过滤掉误报并获得最终预测。

C++:

std::vector<int> indices;

NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);

Python:

indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold, nmsThreshold)

3结果和代码

3.1结果

在VS2017下运行了C++代码,其中OpenCV版本至少要3.4.5以上。不然模型读取会有问题。模型文件太大,见下载链接:

如果没有积分(系统自动设定资源分数)看看参考链接。我搬运过来的,大修改没有。

或者梯子直接下载模型:

结果如下,效果还不错,速度也还好。

https://s4.51cto.com/images/blog/202203/30233917_624479a5e28ed31362.jpg?x-oss-process=image/watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=

https://s4.51cto.com/images/blog/202203/30233918_624479a6aa60c56807.jpg?x-oss-process=image/watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=

https://s4.51cto.com/images/blog/202203/30233920_624479a8b988126069.jpg?x-oss-process=image/watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=

3.2 代码

C++代码有所更改,python没有。对文本检测不熟悉,注释不多,但是实际代码不需要太大变化。

C++代码:

// text_detection.cpp : 此文件包含 "main" 函数。程序执行将在此处开始并结束。
//

#include "pch.h"
#include <iostream>
#include <opencv2/opencv.hpp>

using namespace std;
using namespace cv;
using namespace cv::dnn;

//解码
void decode(const Mat &scores, const Mat &geometry, float scoreThresh,
    std::vector<RotatedRect> &detections, std::vector<float> &confidences);

/**
 * @brief
 *
 * @param srcImg 检测图像
 * @param inpWidth 深度学习图像输入宽
 * @param inpHeight 深度学习图像输入高
 * @param confThreshold 置信度
 * @param nmsThreshold 非极大值抑制算法阈值
 * @param net
 * @return Mat
 */
Mat text_detect(Mat srcImg, int inpWidth, int inpHeight, float confThreshold, float nmsThreshold, Net net)
{
    //输出
    std::vector<Mat> output;
    std::vector<String> outputLayers(2);
    outputLayers[0] = "feature_fusion/Conv_7/Sigmoid";
    outputLayers[1] = "feature_fusion/concat_3";

    //检测图像
    Mat frame, blob;
    frame = srcImg.clone();
    //获取深度学习模型的输入
    blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);
    net.setInput(blob);
    //输出结果
    net.forward(output, outputLayers);

    //置信度
    Mat scores = output[0];
    //位置参数
    Mat geometry = output[1];

    // Decode predicted bounding boxes, 对检测框进行解码,获取文本框位置方向
    //文本框位置参数
    std::vector<RotatedRect> boxes;
    //文本框置信度
    std::vector<float> confidences;
    decode(scores, geometry, confThreshold, boxes, confidences);

    // Apply non-maximum suppression procedure, 应用非极大性抑制算法
    //符合要求的文本框
    std::vector<int> indices;
    NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);

    // Render detections. 输出预测
    //缩放比例
    Point2f ratio((float)frame.cols / inpWidth, (float)frame.rows / inpHeight);
    for (size_t i = 0; i < indices.size(); ++i)
    {
        RotatedRect &box = boxes[indices[i]];

        Point2f vertices[4];
        box.points(vertices);
        //还原坐标点
        for (int j = 0; j < 4; ++j)
        {
            vertices[j].x *= ratio.x;
            vertices[j].y *= ratio.y;
        }
        //画框
        for (int j = 0; j < 4; ++j)
        {
            line(frame, vertices[j], vertices[(j + 1) % 4], Scalar(0, 255, 0), 2, LINE_AA);
        }
    }

    // Put efficiency information. 时间
    std::vector<double> layersTimes;
    double freq = getTickFrequency() / 1000;
    double t = net.getPerfProfile(layersTimes) / freq;
    std::string label = format("Inference time: %.2f ms", t);
    putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));

    return frame;
}

//模型地址
auto model = "./model/frozen_east_text_detection.pb";
//检测图像
auto detect_image = "./image/patient.jpg";
//输入框尺寸
auto inpWidth = 320;
auto inpHeight = 320;
//置信度阈值
auto confThreshold = 0.5;
//非极大值抑制算法阈值
auto nmsThreshold = 0.4;

int main()
{
    //读取模型
    Net net = readNet(model);
    //读取检测图像
    Mat srcImg = imread(detect_image);
    if (!srcImg.empty())
    {
        cout << "read image success!" << endl;
    }
    Mat resultImg = text_detect(srcImg, inpWidth, inpHeight, confThreshold, nmsThreshold, net);
    imshow("result", resultImg);
    waitKey();
    return 0;
}

/**
 * @brief 输出检测到的文本框相关信息
 *
 * @param scores 置信度
 * @param geometry 位置信息
 * @param scoreThresh 置信度阈值
 * @param detections 位置
 * @param confidences 分类概率
 */
void decode(const Mat &scores, const Mat &geometry, float scoreThresh,
    std::vector<RotatedRect> &detections, std::vector<float> &confidences)
{
    detections.clear();
    //判断是不是符合提取要求
    CV_Assert(scores.dims == 4);
    CV_Assert(geometry.dims == 4);
    CV_Assert(scores.size[0] == 1);
    CV_Assert(geometry.size[0] == 1);
    CV_Assert(scores.size[1] == 1);
    CV_Assert(geometry.size[1] == 5);
    CV_Assert(scores.size[2] == geometry.size[2]);
    CV_Assert(scores.size[3] == geometry.size[3]);

    const int height = scores.size[2];
    const int width = scores.size[3];
    for (int y = 0; y < height; ++y)
    {
        //识别概率
        const float *scoresData = scores.ptr<float>(0, 0, y);
        //文本框坐标
        const float *x0_data = geometry.ptr<float>(0, 0, y);
        const float *x1_data = geometry.ptr<float>(0, 1, y);
        const float *x2_data = geometry.ptr<float>(0, 2, y);
        const float *x3_data = geometry.ptr<float>(0, 3, y);
        //文本框角度
        const float *anglesData = geometry.ptr<float>(0, 4, y);
        //遍历所有检测到的检测框
        for (int x = 0; x < width; ++x)
        {
            float score = scoresData[x];
            //低于阈值忽略该检测框
            if (score < scoreThresh)
            {
                continue;
            }

            // Decode a prediction.
            // Multiple by 4 because feature maps are 4 time less than input image.
            float offsetX = x * 4.0f, offsetY = y * 4.0f;
            //角度及相关正余弦计算
            float angle = anglesData[x];
            float cosA = std::cos(angle);
            float sinA = std::sin(angle);
            float h = x0_data[x] + x2_data[x];
            float w = x1_data[x] + x3_data[x];

            Point2f offset(offsetX + cosA * x1_data[x] + sinA * x2_data[x],
                offsetY - sinA * x1_data[x] + cosA * x2_data[x]);
            Point2f p1 = Point2f(-sinA * h, -cosA * h) + offset;
            Point2f p3 = Point2f(-cosA * w, sinA * w) + offset;
            //旋转矩形,分别输入中心点坐标,图像宽高,角度
            RotatedRect r(0.5f * (p1 + p3), Size2f(w, h), -angle * 180.0f / (float)CV_PI);
            //保存检测框
            detections.push_back(r);
            //保存检测框的置信度
            confidences.push_back(score);
        }
    }
}

Python代码:

# Import required modules
import cv2 as cv
import math
import argparse

parser = argparse.ArgumentParser(description='Use this script to run text detection deep learning networks using OpenCV.')
# Input argument
parser.add_argument('--input', help='Path to input image or video file. Skip this argument to capture frames from a camera.')
# Model argument
parser.add_argument('--model', default="./model/frozen_east_text_detection.pb",
                    help='Path to a binary .pb file of model contains trained weights.'
                    )
# Width argument
parser.add_argument('--width', type=int, default=320,
                    help='Preprocess input image by resizing to a specific width. It should be multiple by 32.'
                   )
# Height argument
parser.add_argument('--height',type=int, default=320,
                    help='Preprocess input image by resizing to a specific height. It should be multiple by 32.'
                   )
# Confidence threshold
parser.add_argument('--thr',type=float, default=0.5,
                    help='Confidence threshold.'
                   )
# Non-maximum suppression threshold
parser.add_argument('--nms',type=float, default=0.4,
                    help='Non-maximum suppression threshold.'
                   )

args = parser.parse_args()

############ Utility functions ############
def decode(scores, geometry, scoreThresh):
    detections = []
    confidences = []

    ############ CHECK DIMENSIONS AND SHAPES OF geometry AND scores ############
    assert len(scores.shape) == 4, "Incorrect dimensions of scores"
    assert len(geometry.shape) == 4, "Incorrect dimensions of geometry"
    assert scores.shape[0] == 1, "Invalid dimensions of scores"
    assert geometry.shape[0] == 1, "Invalid dimensions of geometry"
    assert scores.shape[1] == 1, "Invalid dimensions of scores"
    assert geometry.shape[1] == 5, "Invalid dimensions of geometry"
    assert scores.shape[2] == geometry.shape[2], "Invalid dimensions of scores and geometry"
    assert scores.shape[3] == geometry.shape[3], "Invalid dimensions of scores and geometry"
    height = scores.shape[2]
    width = scores.shape[3]
    for y in range(0, height):

        # Extract data from scores
        scoresData = scores[0][0][y]
        x0_data = geometry[0][0][y]
        x1_data = geometry[0][1][y]
        x2_data = geometry[0][2][y]
        x3_data = geometry[0][3][y]
        anglesData = geometry[0][4][y]
        for x in range(0, width):
            score = scoresData[x]

            # If score is lower than threshold score, move to next x
            if(score < scoreThresh):
                continue

            # Calculate offset
            offsetX = x * 4.0
            offsetY = y * 4.0
            angle = anglesData[x]

            # Calculate cos and sin of angle
            cosA = math.cos(angle)
            sinA = math.sin(angle)
            h = x0_data[x] + x2_data[x]
            w = x1_data[x] + x3_data[x]

            # Calculate offset
            offset = ([offsetX + cosA * x1_data[x] + sinA * x2_data[x], offsetY - sinA * x1_data[x] + cosA * x2_data[x]])

            # Find points for rectangle
            p1 = (-sinA * h + offset[0], -cosA * h + offset[1])
            p3 = (-cosA * w + offset[0],  sinA * w + offset[1])
            center = (0.5*(p1[0]+p3[0]), 0.5*(p1[1]+p3[1]))
            detections.append((center, (w,h), -1*angle * 180.0 / math.pi))
            confidences.append(float(score))

    # Return detections and confidences
    return [detections, confidences]

if __name__ == "__main__":
    # Read and store arguments
    confThreshold = args.thr
    nmsThreshold = args.nms
    inpWidth = args.width
    inpHeight = args.height
    model = args.model

    # Load network
    net = cv.dnn.readNet(model)

    # Create a new named window
    kWinName = "EAST: An Efficient and Accurate Scene Text Detector"
    outputLayers = []
    outputLayers.append("feature_fusion/Conv_7/Sigmoid")
    outputLayers.append("feature_fusion/concat_3")

    # Read frame
    frame = cv.imread("./image/stop1.jpg")

    # Get frame height and width
    height_ = frame.shape[0]
    width_ = frame.shape[1]
    rW = width_ / float(inpWidth)
    rH = height_ / float(inpHeight)

    # Create a 4D blob from frame.
    blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)

    # Run the model
    net.setInput(blob)
    output = net.forward(outputLayers)
    t, _ = net.getPerfProfile()
    label = 'Inference time: %.2f ms' % (t * 1000.0 / cv.getTickFrequency())

    # Get scores and geometry
    scores = output[0]
    geometry = output[1]
    [boxes, confidences] = decode(scores, geometry, confThreshold)
    # Apply NMS
    indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold,nmsThreshold)
    for i in indices:
        # get 4 corners of the rotated rect
        vertices = cv.boxPoints(boxes[i[0]])
        # scale the bounding box coordinates based on the respective ratios
        for j in range(4):
            vertices[j][0] *= rW
            vertices[j][1] *= rH
        for j in range(4):
            p1 = (vertices[j][0], vertices[j][1])
            p2 = (vertices[(j + 1) % 4][0], vertices[(j + 1) % 4][1])
            cv.line(frame, p1, p2, (0, 255, 0), 2, cv.LINE_AA);
            # cv.putText(frame, "{:.3f}".format(confidences[i[0]]), (vertices[0][0], vertices[0][1]), cv.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1, cv.LINE_AA)

    # Put efficiency information
    cv.putText(frame, label, (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))

    # Display the frame
    cv.imshow("result",frame)
    cv.waitKey(0)