目前的测试脚本是有缺陷的,因为它不像以前那样比较。

为了更公平的比较,所有功能必须使用相同的标点符号集(即所有ascii或所有unicode)运行。

当这样做完成时,正则表达式和替换方法就会使用一整套unicode标点符号更差。

对于完整的unicode,它看起来像“set”方法是最好的。但是,如果您只想从unicode字符串中删除ascii标点符号,则编码,翻译和解码可能最好(取决于输入字符串的长度)。

在替换之前进行遏制测试(取决于字符串的准确组合),“替换”方法也可以大大提高。

以下是测试脚本重新排序的一些示例结果:

$ python2 test.py
running ascii punctuation test...
using byte strings...
set: 0.862006902695
re: 0.17484498024
trans: 0.0207080841064
enc_trans: 0.0206489562988
repl: 0.157525062561
in_repl: 0.213351011276
$ python2 test.py a
running ascii punctuation test...
using unicode strings...
set: 0.927773952484
re: 0.18892288208
trans: 1.58275294304
enc_trans: 0.0794939994812
repl: 0.413739919662
in_repl: 0.249747991562
python2 test.py u
running unicode punctuation test...
using unicode strings...
set: 0.978360176086
re: 7.97941994667
trans: 1.72471117973
enc_trans: 0.0784001350403
repl: 7.05612301826
in_repl: 3.66821289062
这里是重新编写的脚本:
# -*- coding: utf-8 -*-
import re, string, timeit
import unicodedata
import sys
#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/
s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos
Eisley cantina: a wretched hive of scum and villainy. But, you know, one you
still kinda want to hang out in occasionally. The thing is, though, Reddit
isn’t some obscure dive bar in a remote corner of the universe—it’s a huge
watering hole at the very center of it. The site had some 400 million unique
visitors in 2012. They can’t all be Greedos. So maybe my problem is just that
I’ve never been able to find the places where the decent people hang out."""
su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the
Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one
you still kinda want to hang out in occasionally. The thing is, though,
Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a
huge watering hole at the very center of it. The site had some 400 million
unique visitors in 2012. They can’t all be Greedos. So maybe my problem is
just that I’ve never been able to find the places where the decent people
hang out."""
def test_trans(s):
return s.translate(tbl)
def test_enc_trans(s):
s = s.encode('utf-8').translate(None, string.punctuation)
return s.decode('utf-8')
def test_set(s): # with list comprehension fix
return ''.join([ch for ch in s if ch not in exclude])
def test_re(s): # From Vinko's solution, with fix.
return regex.sub('', s)
def test_repl(s): # From S.Lott's solution
for c in punc:
s = s.replace(c, "")
return s
def test_in_repl(s): # From S.Lott's solution, with fix
for c in punc:
if c in s:
s = s.replace(c, "")
return s
txt = 'su'
ptn = u'[%s]'
if 'u' in sys.argv[1:]:
print 'running unicode punctuation test...'
print 'using unicode strings...'
punc = u''
tbl = {}
for i in xrange(sys.maxunicode):
char = unichr(i)
if unicodedata.category(char).startswith('P'):
tbl[i] = None
punc += char
else:
print 'running ascii punctuation test...'
punc = string.punctuation
if 'a' in sys.argv[1:]:
print 'using unicode strings...'
punc = punc.decode()
tbl = {ord(ch):None for ch in punc}
else:
print 'using byte strings...'
txt = 's'
ptn = '[%s]'
def test_trans(s):
return s.translate(None, punc)
test_enc_trans = test_trans
exclude = set(punc)
regex = re.compile(ptn % re.escape(punc))
def time_func(func, n=10000):
timer = timeit.Timer(
'func(%s)' % txt,
'from __main__ import %s, test_%s as func' % (txt, func))
print '%s: %s' % (func, timer.timeit(n))
print
time_func('set')
time_func('re')
time_func('trans')
time_func('enc_trans')
time_func('repl')
time_func('in_repl')