本文档主要记录了 python 中常用的一些数据处理方法。


Python : Util 常用的数据处理方法

  • 根据多边形顶点列表遍历所有边
  • 二维数组下标互换
  • 将两层列表展开平铺成一层
  • sum 函数合并
  • reduce 函数
  • 列表推导式
  • itertools 类库
  • 性能大对比
  • 将一组数分成每 N 个一组
  • 多维数组去重
  • 一维数组的交集、并集、差集
  • 二维数组的差集
  • dict 中将 key 相同的字典合并在一个对象里


根据多边形顶点列表遍历所有边

vertex_lst = [[0, 0], [1, 1], [1.5, 0.5], [1, 2], [0.6, 0.8], [0, 2], [0, 0]]
print("vertex_lst[:-1] : ", vertex_lst[:-1])
print("vertex_lst[1::] : ", vertex_lst[1::])
print("zip : ", list(zip(vertex_lst[:-1], vertex_lst[1::])), "\n")

for spoi, epoi in zip(vertex_lst[:-1], vertex_lst[1::]):
    print("起点:", spoi, "终点:", epoi)

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test3.py
vertex_lst[:-1] :  [[0, 0], [1, 1], [1.5, 0.5], [1, 2], [0.6, 0.8], [0, 2]]
vertex_lst[1::] :  [[1, 1], [1.5, 0.5], [1, 2], [0.6, 0.8], [0, 2], [0, 0]]
zip :  [([0, 0], [1, 1]), ([1, 1], [1.5, 0.5]), ([1.5, 0.5], [1, 2]), ([1, 2], [0.6, 0.8]), ([0.6, 0.8], [0, 2]), ([0, 2], [0, 0])] 

起点: [0, 0] 终点: [1, 1]
起点: [1, 1] 终点: [1.5, 0.5]
起点: [1.5, 0.5] 终点: [1, 2]
起点: [1, 2] 终点: [0.6, 0.8]
起点: [0.6, 0.8] 终点: [0, 2]
起点: [0, 2] 终点: [0, 0]

Process finished with exit code 0

二维数组下标互换

test = [[1, 5, 3], [5, 4, 7], [6, 3, 7]]
print(list(zip(*test)))

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test3.py
[(1, 5, 6), (5, 4, 3), (3, 7, 7)]

Process finished with exit code 0

将两层列表展开平铺成一层

这里介绍多种方法实现:使用 Python 写出优雅的让列表中的列表展开,变成扁平化的列表。并进行性能对比。例如:

# 期望输入
input = [[['A', 1], ['B', 2]], [['C', 3], ['D', 4]]]

# 期望输出
output = [['A', 1], ['B', 2], ['C', 3], ['D', 4]]

sum 函数合并

这个看上去很简洁,不过有类似字符串累加的性能陷阱。后面有性能对比。

input = [[['A', 1], ['B', 2]], [['C', 3], ['D', 4]]]
new = sum(input, [])
print(input)
print(new)

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test2.py
[[['A', 1], ['B', 2]], [['C', 3], ['D', 4]]]
[['A', 1], ['B', 2], ['C', 3], ['D', 4]]

Process finished with exit code 0

reduce 函数

做序列的累加操作。也是有累加的性能陷阱。

from functools import reduce

input = [[['A', 1], ['B', 2]], [['C', 3], ['D', 4]]]
new = reduce(list.__add__, input)
print(input)
print(new)

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test2.py
[[['A', 1], ['B', 2]], [['C', 3], ['D', 4]]]
[['A', 1], ['B', 2], ['C', 3], ['D', 4]]

Process finished with exit code 0

列表推导式

列表推导式,看着有些长,而且还要 for 循环两次,变成一行理解需要费劲一些,没有那么直观。

input = [[['A', 1], ['B', 2]], [['C', 3], ['D', 4]]]
new = [item for sublist in input for item in sublist]
print(input)
print(new)

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test2.py
[[['A', 1], ['B', 2]], [['C', 3], ['D', 4]]]
[['A', 1], ['B', 2], ['C', 3], ['D', 4]]

Process finished with exit code 0

itertools 类库

通过第三方类库类实现的,相比其他的几个实现,看着还算比较优雅。

import itertools

input = [[['A', 1], ['B', 2]], [['C', 3], ['D', 4]]]
new = list(itertools.chain(*input))
print(input)
print(new)

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test2.py
[[['A', 1], ['B', 2]], [['C', 3], ['D', 4]]]
[['A', 1], ['B', 2], ['C', 3], ['D', 4]]

Process finished with exit code 0

性能大对比

(eln35) [eln@localhost waitDoc]$ python -mtimeit -s'l = [[1, 2, 3], [4, 5, 6], [7], [8, 9]] * 99; from functools import reduce' 'reduce(list.__add__,l)'
1000 loops, best of 3: 287 usec per loop

(eln35) [eln@localhost waitDoc]$ python -mtimeit -s'l = [[1, 2, 3], [4, 5, 6], [7], [8, 9]] * 99' 'sum(l, [])'
1000 loops, best of 3: 261 usec per loop

(eln35) [eln@localhost waitDoc]$ python -mtimeit -s'l = [[1, 2, 3], [4, 5, 6], [7], [8, 9]] * 99' '[item for sublist in l for item in sublist]'
10000 loops, best of 3: 27 usec per loop

(eln35) [eln@localhost waitDoc]$ python -mtimeit -s'l = [[1, 2, 3], [4, 5, 6], [7], [8, 9]] * 99; import itertools;' 'list(itertools.chain(*l))'
100000 loops, best of 3: 15.7 usec per loop

将一组数分成每 N 个一组

a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
step = 3
b = [a[i:i+step] for i in range(0, len(a), step)]
print(b)

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test2.py
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11]]

Process finished with exit code 0

多维数组去重

思路:先把列表中每个元素转化为 tuple ,因为 list 不可哈希但是 tuple 可哈希;然后使用 set 去重。

import numpy as np

list1 = [[1, 2], [3, 4], [5, 6], [3, 4], [7, 8], [1, 2]]
list_unique = list(set([tuple(t) for t in list1]))
print(list1)
print(list_unique)
print(np.array(list_unique).tolist())

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test2.py
[[1, 2], [3, 4], [5, 6], [3, 4], [7, 8], [1, 2]]
[(1, 2), (3, 4), (5, 6), (7, 8)]
[[1, 2], [3, 4], [5, 6], [7, 8]]

Process finished with exit code 0

一维数组的交集、并集、差集

思路:使用列表解析式,列表解析式一般来说比循环更快;将 list 转成 set 以后,使用 set 的各种方法去处理。

listA = [1, 2, 3, 4, 5]
listB = [3, 4, 5, 6, 7]

# 求交集的两种方式
retA = [i for i in listA if i in listB]
retB = list(set(listA).intersection(set(listB)))
print("\n求交集的两种方式:")
print("retA is: ", retA)
print("retB is: ", retB)

# 求并集
retC = list(set(listA).union(set(listB)))
print("\n求并集:")
print("retC is: ", retC)

# 求差集,在 B 中但不在 A 中的两种方式
retD = list(set(listB).difference(set(listA)))
retE = [i for i in listB if i not in listA]
print("\n求差集,在 B 中但不在 A 中的两种方式:")
print("retD is: ", retD)
print("retE is: ", retE)

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test2.py

求交集的两种方式:
retA is:  [3, 4, 5]
retB is:  [3, 4, 5]

求并集:
retC is:  [1, 2, 3, 4, 5, 6, 7]

求差集,在 B 中但不在 A 中的两种方式:
retD is:  [6, 7]
retE is:  [6, 7]

Process finished with exit code 0

二维数组的差集

import numpy as np

a1 = np.asarray([[1, 2, 3], [3, 4, 5], [4, 5, 6]])
a2 = np.asarray([[1, 2, 3], [3, 2, 5]])

a1_rows = a1.view([('', a1.dtype)] * a1.shape[1])
a2_rows = a2.view([('', a2.dtype)] * a2.shape[1])

diff1 = np.setdiff1d(a1_rows, a2_rows).view(a1.dtype).reshape(-1, a1.shape[1])
diff2 = np.setdiff1d(a2_rows, a1_rows).view(a2.dtype).reshape(-1, a2.shape[1])
diff = np.vstack((diff1, diff2))

print("a1 : ")
print(a1)
print("a2 : ")
print(a2)

print("\na1_rows : ")
print(a1_rows)
print("a2_rows : ")
print(a2_rows)

print("\ndiff1 : ")
print(diff1)
print("diff2 : ")
print(diff2)

print("\ndiff : ")
print(diff)

输出结果:

/home/eln/anaconda3/envs/eln35/bin/python3.5 /home/eln/PycharmProjects/SphinxDoc/_test/test2.py
a1 :
[[1 2 3]
 [3 4 5]
 [4 5 6]]
a2 :
[[1 2 3]
 [3 2 5]]

a1_rows :
[[(1, 2, 3)]
 [(3, 4, 5)]
 [(4, 5, 6)]]
a2_rows :
[[(1, 2, 3)]
 [(3, 2, 5)]]

diff1 :
[[3 4 5]
 [4 5 6]]
diff2 :
[[3 2 5]]

diff :
[[3 4 5]
 [4 5 6]
 [3 2 5]]

Process finished with exit code 0

dict 中将 key 相同的字典合并在一个对象里

# dict 中将 key 相同的字典合并在一个对象里
    tmp = {'a': [5], 'b': [56], 'c': [6]}
    objs = {'a': '15', 'b': '156', 'd': '456'}
    for k, v in objs.items():
        tmp.setdefault(k, []).append(v)
    print(tmp)  # {'a': [5, '15'], 'b': [56, '156'], 'c': [6], 'd': ['456']}