python 迭代器生成器

Generators are functions that can be paused and resumed on the fly, returning an object that can be iterated over. Unlike lists, they are lazy and thus produce items one at a time and only when asked. So they are much more memory efficient when dealing with large datasets. This article details how to create generator functions and expressions as well as why you would want to use them in the first place.

生成器是可以即时暂停和恢复的函数,返回可以迭代的对象。 与列表不同,它们是懒惰的 ,因此一次仅在被询问时才产生一项。 因此,在处理大型数据集时,它们的存储效率更高。 本文详细介绍了如何创建生成器函数和表达式,以及为什么首先要使用它们。




python generator 报错 generator function python_生成器




发电机功能 (Generator Functions)

To create a generator, you define a function as you normally would but use the yield statement instead of return, indicating to the interpreter that this function should be treated as an iterator:

要创建生成器,您可以像通常那样定义一个函数,但是使用yield语句而不是return ,向解释器指示该函数应被视为迭代器


1
1
2
2
3
3
4
4
5
5

The yield statement pauses the function and saves the local state so that it can be resumed right where it left off.

yield语句会暂停该函数并保存局部状态,以便可以从中断处继续恢复状态。

What happens when you call this function?

调用此函数会发生什么?


1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9

Calling the function does not execute it. We know this because the string Starting did not print. Instead, the function returns a generator object which is used to control execution.

调用该函数不会执行它。 我们知道这是因为未打印字符串Starting 。 而是,该函数返回用于控制执行的生成器对象。

Generator objects execute when next() is called:

生成器对象在调用next()时执行:


1
1
2
2
3
3

When calling next() the first time, execution begins at the start of the function body and continues until the next yield statement where the value to the right of the statement is returned, subsequent calls to next() continue from the yield statement to the end of the function, and loop around and continue from the start of the function body until another yield is called. If yield is not called (which in our case means we don’t go into the if function because num <= 0) a StopIteration exception is raised:

第一次调用next()时,执行从函数体的开头开始,一直持续到下一个yield语句,该语句返回该语句右边的值时,对next()后续调用从yield语句继续到函数结尾,然后从函数主体的开头开始循环并继续,直到调用另一个yield为止。 如果未调用yield(在我们的例子中,这意味着我们不进入if函数,因为num <= 0),则会引发StopIteration异常:


1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12



生成器表达式 (Generator Expressions)

Just like list comprehensions, generators can also be written in the same manner except they return a generator object rather than a list:

就像列表推导一样,生成器也可以以相同的方式编写,除了它们返回生成器对象而不是列表:


1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9

Take note of the parens on either side of the second line denoting a generator expression, which, for the most part, does the same thing that a list comprehension does, but does it lazily:

请注意第二行两边的表示生成器表达式的parens ,该表达式在大多数情况下与列表推导具有相同的功能,但懒惰地做到了:


1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8

Be careful not to mix up the syntax of a list comprehension with a generator expression – [] vs () – since generator expressions can run slower than list comprehensions (unless you run out of memory, of course):

注意不要将列表理解的语法与生成器表达式– [] vs () ,因为生成器表达式的运行速度可能比列表表达式 (当然,除非内存不足):


1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24
25
25

This is particularly easy (even for senior developers) to do in the above example since both output the exact same thing in the end.

在上面的示例中,这样做特别容易(即使对于高级开发人员也是如此),因为两者最终都输出完全相同的东西。

NOTE: Keep in mind that generator expressions are drastically faster when the size of your data is larger than the available memory.

注意:请记住,当数据的大小大于可用内存时,生成器表达式的速度将大大提高。



用例 (Use Cases)

Generators are perfect for reading a large number of large files since they yield out data a single chunk at a time irrespective of the size of the input stream. They can also result in cleaner code by decoupling the iteration process into smaller components.

生成器非常适合读取大量大文件,因为它们一次生成一个数据块,而与输入流的大小无关。 通过将迭代过程解耦为较小的组件,它们还可以使代码更简洁。



例子1 (Example 1)


1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9

This function loops through a set of files in the specified directory. It opens each file and then loops through each line to test for the pattern match.

此功能循环遍历指定目录中的一组文件。 它打开每个文件,然后循环浏览每一行以测试模式匹配。

This works fine with a small number of small files. But, what if we’re dealing with extremely large files? And what if there are a lot of them? Fortunately, Python’s open() function is efficient and doesn’t load the entire file into memory. But what if our matches list far exceeds the available memory on our machine?

这适用于少量的小文件。 但是,如果我们要处理超大文件怎么办? 如果有很多呢? 幸运的是,Python的open()函数非常高效,并且不会将整个文件加载到内存中。 但是,如果我们的匹配项列表远远超出了计算机上的可用内存,该怎么办?

So, instead of running out of space (large lists) and time (nearly infinite amount of data stream) when processing large amounts of data, generators are the ideal things to use, as they yield out data one time at a time (instead of creating intermediate lists).

因此,当处理大量数据时,生成器不是用尽空间(大列表)和时间(几乎无限数量的数据流),而是使用它们的理想选择,因为它们一次生成一次数据(而不是一次创建中间列表)。

Let’s look at the generator version of the above problem and try to understand why generators are apt for such use cases using processing pipelines.

让我们看一下上述问题的生成器版本,并尝试了解为什么生成器适合于使用处理管道的此类用例。

We divided our whole process into three different components:

我们将整个过程分为三个不同的部分:

  • Generating set of filenames
  • Generating all lines from all files
  • Filtering out lines on the basis of pattern matching
  • 生成文件名集
  • 从所有文件生成所有行
  • 根据模式匹配过滤出线条
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24
25
25
26
26
27
27
28
28
29
29
30
30
31
31
32
32

In the above snippet, we do not use any extra variables to form the list of lines, instead we create a pipeline which feeds its components via the iteration process one item at a time. grep_files takes in a generator object of all the lines of *.py files. Similarly, cat_files takes in a generator object of all the filenames in a directory. So this is how the whole pipeline is glued via iterations.

在上面的代码片段中,我们没有使用任何额外的变量来形成行列表,而是创建了一个管道,该管道通过迭代过程一次将一项供给其组件。 grep_files接受*.py文件所有行的生成器对象。 同样, cat_files接收目录中所有文件名的生成器对象。 因此,这就是通过迭代粘合整个管道的方式。



例子2 (Example 2)

Generators work great for web scraping and crawling recursively:

生成器非常适合递归地进行Web爬网和爬网:


1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22

Here, we simply fetch a single page at a time and then perform some sort of action on the page when execution occurs. What would this look like without a generator? Either the fetching and processing would have to happen within the same function (resulting in highly coupled code that’s hard to test) or we’d have to fetch all the links before processing a single page.

在这里,我们只需一次获取一个页面,然后在执行时在页面上执行某种操作。 没有发电机,这会是什么样? 要么获取和处理必须在同一个函数内进行(导致难以测试的高度耦合的代码),要么我们必须在处理单个页面之前获取所有链接。



结论 (Conclusion)

翻译自: https://www.pybloggers.com/2016/10/introduction-to-python-generators/

python 迭代器生成器