python cookbook3第一章
- 序列中出现次数最多的元素
- 通过某个关键字排序一个字典列表
- 排序不支持原生比较的对象
- 通过某个字段将记录分组
- 过滤序列元素
- 从字典中提取子集
序列中出现次数最多的元素
标准答案应该是 collections.Counter
类,它甚至有一个有用的 most_common()
方法直接给了你答案。
words = [
'look', 'into', 'my', 'eyes', 'look', 'into', 'my', 'eyes',
'the', 'eyes', 'the', 'eyes', 'the', 'eyes', 'not', 'around', 'the',
'eyes', "don't", 'look', 'around', 'the', 'eyes', 'look', 'into',
'my', 'eyes', "you're", 'under'
]
from collections import Counter
word_counts = Counter(words)
# 出现频率最高的3个单词
top_three = word_counts.most_common(3)
print(top_three)
# Outputs [('eyes', 8), ('the', 5), ('look', 4)]
作为输入, Counter
对象可以接受任意的由可哈希(hashable
)元素构成的序列对象。 在底层实现上,一个 Counter
对象就是一个字典,将元素映射到它出现的次数上。下面随便贴一点源码:
def _count_elements(mapping, iterable):
'Tally elements from the iterable.'
mapping_get = mapping.get
for elem in iterable:
mapping[elem] = mapping_get(elem, 0) + 1
def update(self, iterable=None, /, **kwds):
'''Like dict.update() but add counts instead of replacing them.
Source can be an iterable, a dictionary, or another Counter instance.
>>> c = Counter('which')
>>> c.update('witch') # add elements from another iterable
>>> d = Counter('watch')
>>> c.update(d) # add elements from another counter
>>> c['h'] # four 'h' in which, witch, and watch
4
'''
# The regular dict.update() operation makes no sense here because the
# replace behavior results in the some of original untouched counts
# being mixed-in with all of the other counts for a mismash that
# doesn't have a straight-forward interpretation in most counting
# contexts. Instead, we implement straight-addition. Both the inputs
# and outputs are allowed to contain zero and negative counts.
if iterable is not None:
if isinstance(iterable, _collections_abc.Mapping):
if self:
self_get = self.get
for elem, count in iterable.items():
self[elem] = count + self_get(elem, 0)
else:
super(Counter, self).update(iterable) # fast path when counter is empty
else:
_count_elements(self, iterable)
if kwds:
self.update(kwds)
update
函数可以更新(增加个数)Counter
对象,这里参数iterable
可以是list
或者dict
,同时相应地定义了subtract
函数做减少的操作。此外也定义了__add__
和__sub__
方法使得Counter
可以直接加减(Counter('abbb') + Counter('bcc')
),定义了__or__
和__and__
方式使得可以直接做集合操作(Counter('abbb') | Counter('bcc')
),还有很多方法就不一一讲了,源码拿去。
通过某个关键字排序一个字典列表
排序不支持原生比较的对象
分别使用了operator
模块里的itemgetter
和attrgetter
函数,拿到排序需要的信息,传递给sorted
函数的关键字参数key
,比如:
rows = [
{'fname': 'Brian', 'lname': 'Jones', 'uid': 1003},
{'fname': 'David', 'lname': 'Beazley', 'uid': 1002},
{'fname': 'John', 'lname': 'Cleese', 'uid': 1001},
{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}
]
from operator import itemgetter
rows_by_fname = sorted(rows, key=itemgetter('fname'))
print(rows_by_fname)
class User:
def __init__(self, user_id):
self.user_id = user_id
def __repr__(self):
return 'User({})'.format(self.user_id)
from operator import attrgetter
users = [User(23), User(3), User(99)]
rows_by_id = sorted(users, key=attrgetter('user_id'))
print(rows_by_id)
当然这两个函数还能同时允许多个字段进行比较,那时会返回包含多个目标元素值的元祖,并根据顺序比较然后排序。如果你不是很在乎运行速度的话,也能自己写个匿名函数lambda
来代替。
源码里的一些魔法方法的实现:
class attrgetter:
"""
Return a callable object that fetches the given attribute(s) from its operand.
After f = attrgetter('name'), the call f(r) returns r.name.
After g = attrgetter('name', 'date'), the call g(r) returns (r.name, r.date).
After h = attrgetter('name.first', 'name.last'), the call h(r) returns
(r.name.first, r.name.last).
"""
__slots__ = ('_attrs', '_call')
def __init__(self, attr, *attrs):
if not attrs:
if not isinstance(attr, str):
raise TypeError('attribute name must be a string')
self._attrs = (attr,)
names = attr.split('.')
def func(obj):
for name in names:
obj = getattr(obj, name)
return obj
self._call = func
else:
self._attrs = (attr,) + attrs
getters = tuple(map(attrgetter, self._attrs))
def func(obj):
return tuple(getter(obj) for getter in getters)
self._call = func
def __call__(self, obj):
return self._call(obj)
def __repr__(self):
return '%s.%s(%s)' % (self.__class__.__module__,
self.__class__.__qualname__,
', '.join(map(repr, self._attrs)))
def __reduce__(self):
return self.__class__, self._attrs
class itemgetter:
"""
Return a callable object that fetches the given item(s) from its operand.
After f = itemgetter(2), the call f(r) returns r[2].
After g = itemgetter(2, 5, 3), the call g(r) returns (r[2], r[5], r[3])
"""
__slots__ = ('_items', '_call')
def __init__(self, item, *items):
if not items:
self._items = (item,)
def func(obj):
return obj[item]
self._call = func
else:
self._items = items = (item,) + items
def func(obj):
return tuple(obj[i] for i in items)
self._call = func
def __call__(self, obj):
return self._call(obj)
def __repr__(self):
return '%s.%s(%s)' % (self.__class__.__module__,
self.__class__.__name__,
', '.join(map(repr, self._items)))
def __reduce__(self):
return self.__class__, self._items
通过某个字段将记录分组
如果你需要将这样的数据进行分组处理:
[(1, 'ame'), (2, 'xm'), (1, 'paparize'), (2, 'ori'), (3 ,'yang'), (1, 'sccc'), (3, 'old eleven')]
处理之后我可能需要得到这样的结果:
[(1, ['ame', 'paparize', 'sccc']), (2, ['xm', 'ori']), (3, ['yang', 'old eleven'])]
最实用的方法是itertools.groupby()
函数:
from itertools import groupby
from operator import itemgetter
x = [(1, 'ame'), (2, 'xm'), (1, 'paparize'), (2, 'ori'), (3 ,'yang'), (1, 'sccc'), (3, 'old eleven')]
soooo = sorted(x, key=itemgetter(0))
p = groupby(soooo, key=itemgetter(0))
for i,l in p:
print (i, [_[1] for _ in l])
在这里,如果你想按key
字段分组,那么你首先得按该字段排序,否则达不到效果,因为源代码里是直接用next(iter(iterable))
来遍历传入的可迭代容器的。返回值要注意,第一个是分组的key
,第二个是一个包含所有分到该组的元素的迭代器。
具体源码在这:
https://docs.python.org/3/library/itertools.html#itertools.groupby
过滤序列元素
mylist = [1, 4, -5, 10, -7, 2, 3, -1]
print([n for n in mylist if n > 0]) # output [1, 4, 10, 2, 3]
print(list(n for n in mylist if n > 0))
不考虑内存占用的话就直接用列表推导式,否则就用生成器表达式。
过滤条件复杂的时候就需要用上filter()
,将过滤条件写成一个函数,注意返回也是一个迭代器:
values = ['1', '2', '-3', '-', '4', 'N/A', '5']
def is_int(val):
try:
x = int(val)
return True
except ValueError:
return False
ivals = list(filter(is_int, values))
print(ivals)
# Outputs ['1', '2', '-3', '4', '5']
从字典中提取子集
和上一小节差不多,用字典推导式即可:
prices = {
'yyf': 'Ti2',
'Mu': 'Ti4',
'Hao': 'Ti4',
'blink': 'Ti6',
'iceice': 'Ti6'
}
print({key: value for key, value in prices.items() if value == 'Ti4'})