pythonchoic能用于set嘛 python charset

转载

mob64ca1402665b 2023-12-14 11:25:50

文章标签 pythonchoic能用于set嘛 Python 字符串 xml 文章分类 Python 后端开发

背景

　　Python中的字符串编码算是让人头疼的事情。在web开发中，用户输入的字符串通过前端直接透传过来，如果是一些比较奇怪的字符，可能就涉及到Python的编解码转换了。Python自身提供了str和bytes之间的转换，可以通过encode()和decode()函数进行转换，但是比较麻烦的一点是，我们首先要要知道其编码方式，然后才能知道如何对其进行编解码。经过网上搜索得知python有一个charset库，专治此类编码不解之谜。

简介

项目地址：https://github.com/chardet/chardet

支持检测的字符集

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR, Johab (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-5, windows-1251 (Bulgarian)
ISO-8859-1, windows-1252 (Western European languages)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)

需要版本：Python 3.6+.(实际上Python2.7也可以）

安装

sudo pip3 install chardet

使用

1. 命令行

chardetect somefile someotherfile

例子：

chardetect get-pip.py tune.sh

pythonchoic能用于set嘛 python charset_字符串

上图检测出了两个文件的编码，以及其预测可能性（confidence）：99%和100%

2. python module

1) 使用chardet.detect检测编码类型

import urllib
rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
import chardet
#检测rawdata类型
chardet.detect(rawdata)

pythonchoic能用于set嘛 python charset_字符串_02

2) 使用Universaldetector检测大文件的编码（非贪婪模式）

#coding: utf8
import urllib
from chardet.universaldetector import UniversalDetector

usock = urllib.urlopen('http://yahoo.co.jp/')
#生成UniversalDetector对象
detector = UniversalDetector()
#循环遍历文件每行
for line in usock.readlines():
    #feed当前读取到的行给detector，它会自行检测编码
    detector.feed(line)
    #当detector被feed了足够多的行且能猜测出编码，detector.done会被置为True
    if detector.done: break
#close()是防止detector没有足够信心，最后做一轮计算，确认编码
detector.close()
usock.close()
print(detector.result)

最终打印结果：{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}

3) 使用Universaldetector检测多个大文件的编码（非贪婪模式）

#coding: utf8
import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
#遍历所有.xml后缀结尾的大文件
for filename in glob.glob('*.xml'):
    print filename.ljust(60),
    #每一轮检测前使用reset()重置detector
    detector.reset()
    for line in file(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    #每一轮检测完后close（）
    detector.close()
    print detector.result

以上就是chardet对于字符集判断使用，对于Python字符集问题，你是不是更加有信心了呢？

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。