python如何逐行读取TXT文件并生成变量 python怎么逐行读取文件

转载

level 2024-08-06 15:21:16

文章标签 python读取大文件内存不够文本文件 ci python 文章分类 Python 后端开发

用Python逐行读取大文本文件，而不将其加载到内存中

我需要逐行读取一个大文件。让我们说文件超过5GB，我需要读取每一行，但显然我不想使用tail，因为它会在内存中创建一个非常大的列表。

以下代码如何适用于此案例？ tail本身一个接一个地读入内存吗？是否需要生成器表达式？

f = (line for line in open("log.txt").xreadlines()) # how much is loaded in memory?
f.next()

另外，我可以做什么来以相反的顺序读取它，就像Linux tail命令一样？

我发现：

[http://code.google.com/p/pytailer/]

和

“python头，尾和向后读取文本文件的行”

两者都运作得很好！

13个解决方案

234 votes

我提供了这个答案，因为Keith虽然简洁，却没有明确地关闭文件

with open("log.txt") as infile:
for line in infile:
do_something_with(line)
John La Rooy answered 2019-04-22T14:02:31Z
43 votes

您需要做的就是使用文件对象作为迭代器。

for line in open("log.txt"):
do_something_with(line)

更好的是在最近的Python版本中使用上下文管理器。

with open("log.txt") as fileobject:
for line in fileobject:
do_something_with(line)

这也会自动关闭文件。

Keith answered 2019-04-22T14:03:17Z
14 votes

旧学校方法：

fh = open(file_name, 'rt')
line = fh.readline()
while line:
# do stuff with line
line = fh.readline()
fh.close()
PTBNL answered 2019-04-22T14:03:43Z
11 votes

你最好使用迭代器。相关：[http://docs.python.org/library/fileinput.html]

来自文档：

import fileinput
for line in fileinput.input("filename"):
process(line)

这样可以避免一次将整个文件复制到内存中。

Mikola answered 2019-04-22T14:04:29Z
3 votes

我无法相信它可以像@ john-la-rooy的答案那样简单。所以，我使用逐行读写重新创建了cp命令。这很疯狂。

#!/usr/bin/env python3.6
import sys
with open(sys.argv[2], 'w') as outfile:
with open(sys.argv[1]) as infile:
for line in infile:
outfile.write(line)
Bruno Bronosky answered 2019-04-22T14:04:57Z
3 votes

如果您在文件中没有换行符，请执行以下操作：

with open('large_text.txt') as f:
while True:
c = f.read(1024)
if not c:
break
print(c)
Ariel Cabib answered 2019-04-22T14:05:23Z
0 votes

这个怎么样？将文件分成块然后逐行读取，因为当您读取文件时，操作系统将缓存下一行。如果您逐行读取文件，则无法有效使用缓存的信息。

相反，将文件分成块并将整个块加载到内存中然后进行处理。

def chunks(file,size=1024):
while 1:
startat=fh.tell()
print startat #file's object current position from the start
fh.seek(size,1) #offset from current postion -->1
data=fh.readline()
yield startat,fh.tell()-startat #doesnt store whole list in memory
if not data:
break
if os.path.isfile(fname):
try:
fh=open(fname,'rb')
except IOError as e: #file --> permission denied
print "I/O error({0}): {1}".format(e.errno, e.strerror)
except Exception as e1: #handle other exceptions such as attribute errors
print "Unexpected error: {0}".format(e1)
for ele in chunks(fh):
fh.seek(ele[0])#startat
data=fh.read(ele[1])#endat
print data
Arohi Gupta answered 2019-04-22T14:06:00Z
0 votes

谢谢！我最近转换为python 3并且因使用readlines（0）读取大文件而感到沮丧。这解决了这个问题。但要获得每一条线，我不得不做几个额外的步骤。每行前面都有一个“b”，我猜它是二进制格式。使用“decode（utf-8）”将其更改为ascii。

然后我不得不在每行的中间删除一个“= \ n”。

然后我在新线上分割线条。

b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
data_list = data_chunk.split('\n') #List containing lines in chunk
#print(data_list,'\n')
#time.sleep(1)
for j in range(len(data_list)): #iterate through data_list to get each item
i += 1
line_of_data = data_list[j]
print(line_of_data)

以下是Arohi代码中“打印数据”上方的代码。

John Haynes answered 2019-04-22T14:06:49Z
0 votes

大火项目在过去的6年里已经走过了漫长的道路。它有一个简单的API，涵盖了一个有用的pandas功能子集。

dask.dataframe负责内部分块，支持许多可并行操作，并允许您轻松地将切片导出回到pandas以进行内存操作。

import dask.dataframe as dd
df = dd.read_csv('filename.csv')
df.head(10) # return first 10 rows
df.tail(10) # return last 10 rows
# iterate rows
for idx, row in df.iterrows():
...
# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()
# slice by column
df[df.my_field=='XYZ'].compute()
jpp answered 2019-04-22T14:07:23Z
0 votes

我在另一个问题中演示了并行字节级随机访问方法：

获取没有readlines的文本文件中的行数

已经提供的一些答案很简洁。我喜欢其中的一些。但这实际上取决于你想要对文件中的数据做什么。在我的情况下，我只想在大文本文件上尽可能快地计算行数。我的代码当然可以修改为做其他事情，就像任何代码一样。

Geoffrey Anderson answered 2019-04-22T14:08:05Z
0 votes

下面是用于加载任何大小的文本文件而不会导致内存问题的代码。它支持千兆字节大小的文件

[https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d]

下载文件data_loading_utils.py并将其导入您的代码

用法

import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000
def process_lines(data, eof, file_name):
# check if end of file reached
if not eof:
# process data, data is one single line of the file
else:
# end of file reached
data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)

process_lines方法是回调函数。它将被调用所有行，参数数据一次代表文件的一行。

您可以根据机器硬件配置配置变量CHUNK_SIZE。

Iyvin Jose answered 2019-04-22T14:09:06Z
-1 votes

请试试这个：

with open('filename','r',buffering=100000) as f:
for line in f:
print line
jyoti das answered 2019-04-22T14:09:26Z
-9 votes
f=open('filename','r').read()
f1=f.split('\n')
for i in range (len(f1)):
do_something_with(f1[i])

希望这可以帮助。

Sainik Kr Mahata answered 2019-04-22T14:09:48Z

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：mysql 异步读写 mysql异步复制

下一篇：wxpython 透明按钮 python 透明色

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

python如何逐行读取TXT文件并生成变量 python怎么逐行读取文件

python如何逐行读取TXT文件并生成变量 python怎么逐行读取文件

51CTO博客