python 多线程写入多列文件 python多线程操作列表

转载

mob64ca13f6bbea 2023-09-05 08:54:31

文章标签 python 多线程写入多列文件 python多线程操作列表共享数据 Python 用户交互 文章分类 Python 后端开发

你的代码有一些问题…但不是你问的那个问题。在

因为你没有提供足够的资源来运行任何东西，所以我添加了一些额外的内容：

class Genome(object):
i = 0
def __init__(self, newi = None):
if newi is None:
newi = Genome.i
Genome.i += 1
self.i = newi
def __repr__(self):
return 'Genome({})'.format(self.i)
def children(self):
return self._children
g1, g2 = Genome(), Genome()
g1._children = [Genome(), Genome()]
g2._children = [Genome(), Genome(), Genome()]
a_list_of_genomes = [g1, g2]
all_genomes = [g1.children()[0], g2.children()[2]]

现在，你的算法应该给出2和6的基因组。那么，让我们试试你的非线程代码：

^{pr2}$

我得到{Genome(2), Genome(6)}，这是正确的。在

现在，你的线程代码。复制并粘贴内循环体作为函数体，以确保其完全相同：

def threaded_function(focus_genome):
for child in focus_genome.children(): # a list of objects (again)
if child in all_genomes:
valid_children.update(set([child]))
试着运行它：for focus_genome in a_list_of_genomes: # a list of objects
t = threading.Thread(target=threaded_function, args=(focus_genome,))
t.start()
t.join()
print(valid_children)

我得到了{Genome(2), Genome(6)}，它并不像您所说的那样是空的，并且是正确的，并且与非线程版本完全相同。在

这就是说，你在这里实际上没有做任何有用的事情，如果你做了，你就会有问题。在

首先，join等待后台线程完成。所以，开始一个线程并立即加入它是没有好处的。相反，您需要启动一组线程，然后连接所有线程。例如：

threads = [threading.Thread(target=threaded_function, args=(focus_genome,))
for focus_genome in a_list_of_genomes]
for thread in threads:
thread.start()
for thread in threads:
thread.join()

但是，如果线程只运行CPU密集型Python代码，那么这也没有任何帮助，因为全局解释器锁确保一次只有一个线程可以运行Python代码。当你把所有的时间都花在I/O(读取文件或httpurl)、等待用户交互、或者调用为线程而设计的NumPy之类的库中时，线程是非常棒的。但是对于并行运行Python代码，它们根本不会加快速度。为此，您需要流程。在

同时，有多个线程试图改变一个共享对象，而没有进行任何类型的同步。这是一种竞争条件，这将导致数据损坏。如果要使用线程，则需要使用锁或其他同步对象来保护共享数据：

valid_children_lock = Lock()
def threaded_function(focus_genome):
for child in focus_genome.children(): # a list of objects (again)
if child in all_genomes:
with valid_children_lock():
valid_children.update(set([child]))

当您使用进程时，这种可变的共享数据线程会变得更糟。如果您试图在两个进程之间直接共享一个集合，它有时在Unix上有效，而在Windows上则永远不会。在

如果您可以重新组织逻辑，使其不使用可变的共享数据，那么一切都会变得容易得多。实现这一点的一个非常简单的方法是根据任务编写所有内容，这些任务接受参数和返回值，即没有副作用的函数。然后您就可以使用线程池或执行器来运行所有这些任务并返回结果。这样做的另一个好处是，你一次运行的任务数和你的工人数一样多，其余的任务会自动排队，而不是一次运行所有的任务(这要快得多)。在

我们能在这里做吗？也许吧。如果每个任务都返回一组为给定的valid_children找到的valid_children，那么我们就可以union把所有这些集合放在一起，得到完整的结果，对吗？所以：

def threaded_task(focus_genome):
valid_children = set()
for child in focus_genome.children(): # a list of objects (again)
if child in all_genomes:
valid_children.add(child)
return valid_children
valid_children = set()
with multiprocessing.Pool() as pool:
for subset in pool.imap_unordered(threaded_task, a_list_of_genomes):
valid_children.update(subset)

我们可以通过以下几种理解进一步简化：

def threaded_task(focus_genome):
return {child for child in focus_genome.children() if child in all_genomes}
with multiprocessing.Pool() as pool:
valid_children = set.union(pool.imap_unordered(threaded_task, a_list_of_genomes))

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。