Cython—字符串

原创

SongpingWang 2020-03-14 21:24:44 博主文章分类：Cython ©著作权

©著作权归作者所有：来自51CTO博客作者SongpingWang的原创作品，请联系作者获取转载授权，否则将追究法律责任

文章目录

Passing byte strings（传递字节字符串）
从Python代码中接收字符串
解码字节为文本
将文本编码为字节
C ++字符串
迭代

一般来说：除非你知道你在做什么，如果可能，避免使用C的字符串，而是使用Python的字符串对象。明显的异常会出现在把它们传递给外部的C代码时。同样，C++字符串也储存它们的长度，所以它们可以在某些情况下合适的替代Python的bytes对象，例如在一个很好定义的场景下不需要引用计数的时候。

Unicode和传递字符串
与Python 3中的字符串语义类似，Cython严格分隔字节字符串和unicode字符串。最重要的是，这意味着默认情况下，字节字符串和unicode字符串之间不会自动转换（Python 2在字符串操作中除外）。所有编码和解码都必须经过一个明确的编码/解码步骤。为了在简单的情况下简化Python和C字符串之间的转换，可以使用模块级 c_string_type和c_string_encoding指令隐式插入这些编码/解码步骤。

Passing byte strings（传递字节字符串）

c_func.pyx

from libc.stdlib cimport malloc
from libc.string cimport strcpy, strlen

cdef char* hello_world = 'hello world'
cdef Py_ssize_t n = strlen(hello_world)


cdef char* c_call_returning_a_c_string():
    cdef char* c_string = <char *> malloc((n + 1) * sizeof(char))
    if not c_string:
        raise MemoryError()
    strcpy(c_string, hello_world)
    return c_string


cdef void get_a_c_string(char** c_string_ptr, Py_ssize_t *length):
    c_string_ptr[0] = <char *> malloc((n + 1) * sizeof(char))
    if not c_string_ptr[0]:
        raise MemoryError()

    strcpy(c_string_ptr[0], hello_world)
    length[0] =

strcpy是一个复制字符串的函数，要记住一点，C的字符串是要以\0结尾的，所以需要分配n+1个字符的空间。一个ascii字符需要一个字节，因为2^8=256。

在C代码和Python中传递字节字符串是很容易的。如下：

from c_func cimport c_call_returning_a_c_string


cdef char* c_string = c_call_returning_a_c_string()
cdef bytes py_string = c_string            # C字节字符串--转换为-->Python字节字符串
py_string = <bytes> c_string               # 一个类型转换到Python对象

除了对于null bytes不正常以外，上面的对于长字符串也是不高效的，因为Cython没有必须首先调用strlen作用在C的字符串上通过计数直到停止的null字节来得到长度。在很多场景下，用户的代码可以事先直到长度。在这种情况下，通过对C字符串进行切片告诉Cython bytes的位数是更高效的。如下：

from libc.stdlib cimport free
from c_func cimport get_a_c_string


def main():
    cdef char* c_string = NULL
    cdef Py_ssize_t length = 0

    # get pointer and length from a C function 从C函数获取指针和长度
    get_a_c_string(&c_string, &length)

    try:
        py_bytes_string = c_string[:length]  # Performs a copy of the data
    finally:
        free(c_string)

在这里，不需要额外的字节计数，并且的length字节c_string将被复制到Python字节对象中，包括任何空字节。请记住，在这种情况下切片索引被假定为准确的，并且不进行边界检查，因此不正确的切片索引将导致数据损坏和崩溃。

注意创建一个Python bytes字符串可能会因为异常而失败，例如，因为内存不够。如果你需要f在转换以后free掉这个字符串，你应该把赋值放在一个try-finally结构里面：

from libc.stdlib cimport free
from c_func cimport c_call_returning_a_c_string

cdef bytes py_string
cdef char* c_string = c_call_returning_a_c_string()
try:
    py_string = c_string
finally:
    free(c_string)

要将字节字符串转换回C char*，请使用相反的赋值：
cdef char* other_c_string = py_string

从Python代码中接收字符串

在API只处理bytes字符串的情况下，也就是说二进制数据或者编码的文本。最好不要把输入的参数定型，像bytes，因为那会把允许的输入限制到那种类型不包括子类型和其它种类的bytes容器，例如，bytearray对象或者memory views对象，它是一种Python类型，决定于怎样和哪里处理数据，接收一维的memory view可能是一个好的主意。

def process_byte_data(unsigned char[:] data):
    length = data.shape[0]
    first_byte = data[0]
    slice_view = data[1:-1]

上面的示例已经显示了一维字节视图的大多数相关功能。它们允许有效地处理数组，并接受可以将自身解压缩到字节缓冲区中的任何内容，而无需中间复制。经处理的内容最终可以在存储器视图本身（或它的一个片段）被返回，但它往往是更好地复制数据返回到平坦且简单bytes或bytearray对象，尤其是仅返回一小片时。由于memoryview不复制数据，因此它们将使整个原始缓冲区保持活动状态。这里的总体思想是通过接受任何类型的字节缓冲区来放宽输入，而通过返回简单且适应性强的对象来严格输出。可以简单地按以下步骤完成：

def process_byte_data(unsigned char[:] data):
    # ... process the data, here, dummy processing.
    cdef bint return_all = (data[0] == 108)

    if return_all:
        return bytes(data)
    else:
        # example for returning a slice
        return bytes(data[5:7])

对于类似的只读缓冲区，bytes应该将memoryview项的类型声明为const（请参阅只读视图）。如果字节输入实际上是编码的文本，并且应该在Unicode级别进行进一步的处理，那么正确的做法是立即解码输入。这几乎只是Python 2.x中的一个问题，在Python 2.x中，Python代码期望它可以将str带有编码文本的字节字符串（str）传递到文本API中。由于这通常发生在模块API中的多个位置，因此几乎总是要使用辅助函数，因为它可以在以后轻松地适应输入规范化过程。
这种输入归一化函数通常看起来类似于以下内容：
to_unicode.pyx

from cpython.version cimport PY_MAJOR_VERSION

cdef unicode _text(s):
    if type(s) is unicode:
        # Fast path for most common case(s).
        return <unicode>s

    elif PY_MAJOR_VERSION < 3 and isinstance(s, bytes):
        # Only accept byte strings as text input in Python 2.x, not in Py3.
        return (<bytes>s).decode('ascii')

    elif isinstance(s, unicode):
        # We know from the fast path above that 's' can only be a subtype here.
        # An evil cast to <unicode> might still work in some(!) cases,
        # depending on what the further processing does.  To be safe,
        # we can always create a copy instead.
        return unicode(s)
    else:
        raise TypeError("Could not convert to unicode.")

from to_unicode cimport _text

def api_func(s):
    text_input = _text(s)
    # ...

同样，如果进一步的处理发生在字节级别，但是应该接受Unicode字符串输入，那么在使用内存视图时，以下操作可能会起作用：

# define a global name for whatever char type is used in the module
ctypedef unsigned char char_type

cdef char_type[:] _chars(s):
    if isinstance(s, unicode):
        # encode to the specific encoding used inside of the module
        s = (<unicode>s).encode('utf8')
    return

解码字节为文本

如果您的代码仅处理字符串中的二进制数据，则最初介绍的传递和接收C字符串的方法就足够了。但是，当我们处理编码的文本时，最佳实践是在接收时将C字节字符串解码为Python Unicode字符串，并在出局时将Python Unicode字符串编码为C字节字符串。

对于Python字节字符串对象，通常只需调用该 bytes.decode()方法即可将其解码为Unicode字符串：
ustring = byte_string.decode('UTF-8') Cython允许您对C字符串执行相同的操作，只要它不包含空字节即可：

from c_func cimport c_call_returning_a_c_string

cdef char* some_c_string = c_call_returning_a_c_string()
ustring = some_c_string.decode('UTF-8')

from c_func cimport get_a_c_string

cdef char* c_string = NULL
cdef Py_ssize_t length = 0

# get pointer and length from a C function
get_a_c_string(&c_string, &length)

ustring = c_string[:length].decode('UTF-8')    //对于长度已知的字符串，更有效

当字符串包含空字节时，例如，当使用UCS-4之类的编码时，应该使用相同的字符，其中每个字符都编码为四个字节，其中大多数趋向于0。

同样，如果提供了切片索引，则不会进行边界检查，因此错误的索引会导致数据损坏和崩溃。但是，可以使用负索引，并且将调用strlen()以确定字符串长度。显然，这仅适用于以0终止的字符串，而没有内部空字节。以UTF-8或ISO-8859编码之一编码的文本通常是不错的选择。如有疑问，最好传递“显然”正确的索引，而不要依赖于预期的数据。

通常的做法是将字符串转换（通常是非平凡的类型转换）包装在专用函数中，因为每当从C接收文本时，都需要以完全相同的方式进行。这可能如下所示：

from libc.stdlib cimport free

cdef unicode tounicode(char* s):
    return s.decode('UTF-8', 'strict')

cdef unicode tounicode_with_length(char* s, size_t length):
    return s[:length].decode('UTF-8', 'strict')

cdef unicode tounicode_with_length_and_free(char* s, size_t length):
    try:
        return s[:length].decode('UTF-8', 'strict')
    finally:
        free(s)

最有可能的是，您将根据要处理的字符串类型在代码中使用较短的函数名称。不同类型的内容通常意味着在接收时处理它们的方式不同。为了使代码更具可读性并预期将来的更改，优良作法是对不同类型的字符串使用单独的转换函数。

将文本编码为字节

相反，将Python unicode字符串转换为C char*本身非常有效，假设您实际想要的是内存管理的字节字符串：

py_byte_string = py_unicode_string.encode('UTF-8')
cdef char* c_string =

C ++字符串

包装C ++库时，字符串通常以std::string类的形式出现。与C字符串一样，Python字节字符串会自动在C ++字符串之间强制转换：

from libcpp.string cimport string

def get_bytes():
    py_bytes_object = b'hello world'
    cdef string s = py_bytes_object

    s.append('abc')
    py_bytes_object = s
    return

内存管理情况与C语言不同，这是因为创建C ++字符串会创建字符串对象然后拥有的字符串缓冲区的独立副本。因此，可以将临时创建的Python对象直接转换为C ++字符串。一种通用的使用方法是将Python unicode字符串编码为C ++字符串时：

cdef string cpp_string = py_unicode_string.encode('UTF-8') 请注意，这涉及一些开销，因为它首先将Unicode字符串编码为临时创建的Python字节对象，然后将其缓冲区复制到新的C ++字符串中。

在另一个方向上，Cython 0.17和更高版本提供了有效的解码支持：

# distutils: language = c++

from libcpp.string cimport string

def get_ustrings():
    cdef string s = string(b'abcdefg')

    ustring1 = s.decode('UTF-8')
    ustring2 = s[2:-2].decode('UTF-8')
    return ustring1,

对于C ++字符串，解码切片将始终考虑字符串的适当长度，并应用Python切片语义（例如，返回空字符串以获取越界索引）。

迭代

支持在char*，字节和unicode字符串上进行有效的迭代

cdef char* c_string = "Hello to A C-string's world"

cdef char c
for c in c_string[:11]:
    if c == 'A':
        print("Found the letter A")

=======================================================
cdef bytes bytes_string = b"hello to A bytes' world"

cdef char c
for c in bytes_string:
    if c == 'A':
        print("Found the letter A")

=======================================================
#对于unicode对象，Cython会自动将循环变量的类型推断为Py_UCS4：
cdef unicode ustring = u'Hello world'

# NOTE: no typing required for 'uchar' !
for uchar in ustring:
    if uchar == u'A':
        print("Found the letter A")

自动类型推断通常会导致代码效率更高。但是，请注意，某些unicode操作仍然需要将该值作为Python对象，因此Cython最终可能会为循环内的循环变量值生成冗余转换代码。如果这会导致特定代码的性能下降，则可以在运行Python之前将循环变量显式键入为Python对象，或者将其值分配给循环内某个位置的Python类型变量以强制执行一次强制操作就可以了。

还对in测试进行了优化，因此以下代码将以纯C代码运行（实际上使用switch语句）：

cpdef void is_in(Py_UCS4 uchar_val):
    if uchar_val in u'abcABCxY':
        print("The character is in the string.")
    else:
        print("The character is not in the string")

结合上面的循环优化，可以产生非常高效的字符切换代码，例如在unicode解析器中。

from cpython cimport array
 
def cfuncA():
    cdef str a
    cdef int i,j
    for j in range(1000):
        a = ''.join([chr(i) for i in range(127)])
 
def cfuncB():
    cdef:
        str a
        array.array[char] arr,template = array.array('c')
        int i,j
 
    for j in range(1000):
        arr = array.clone(template,127,False)
 
        for i in range(127):
            arr[i] = i
 
        a = arr.tostring()

>>> python2 -m timeit -s "import pyximport; pyximport.install(); import cyytn" "cyytn.cfuncA()"
100 loops,best of 3: 14.3 msec per loop
 
>>> python2 -m timeit -s "import pyximport; pyximport.install(); import cyytn" "cyytn.cfuncB()"
1000 loops,best of 3: 512

参考：https://cython.readthedocs.io/en/latest/src/tutorial/strings.html