file python读取hdfs python读取hdfs数据

转载

幸福的地图 2023-07-14 16:56:41

文章标签 file python读取hdfs python 大数据 hadoop hdfs 文章分类 Python 后端开发

在上节第四课中，我们介绍了使用java编程工具idea创建一个maven项目，来操作hadoop集群上的文件，这节我们介绍下怎么使用python开发工具pycharm来读、写和上传文件。

我们介绍这2种方式的原因是，通过hadoop hive或spark等数据计算框架完成数据清洗后的数据是存储在HDFS上的，而爬虫和机器学习等程序在Python或java中容易实现，

在Linux环境下编写Python或java程序没有那么便利，所以我们需要建立Python，Java与HDFS的读写通道。

首先，我们启动pycharm，我的版本是JetBrains PyCharm 2019.2 x64，点击左上角File=》New Project，弹出如下界面，我们输入位置sshdfs，选择好interpreter的python版本

file python读取hdfs python读取hdfs数据_hadoop

在新打开的界面，我们已经看到了sshdfs项目，我们右键项目，新建一个python文件testhdfs.py

file python读取hdfs python读取hdfs数据_python_02

建好新的空文件后，我们需要安装Python模块pyhdfs，操作hdfs的包，我们点击pycharm底部的终端Terminal

file python读取hdfs python读取hdfs数据_python_03

输入 pip install pyhdfs，稍等片刻即可安装好

file python读取hdfs python读取hdfs数据_大数据_04

我们的测试环境版本:Python3.8, hadoop 2.7.3，接下来我们开始测试

1、读HDFS文件，地址是我在第三节课里配置的集群地址，文件就是我们测试words的结果

from pyhdfs import HdfsClient

client=HdfsClient(hosts='master105:50070')#hdfs地址
res=client.open('/test/output/part-r-00000')#hdfs文件路径,根目录/
for r in res:
  line=str(r,encoding='utf8')#open后是二进制,str()转换为字符串并转码
  print(line)

我们执行testhdfs.py

file python读取hdfs python读取hdfs数据_hadoop_05

会在控制台得到如下结果，测试成功

file python读取hdfs python读取hdfs数据_python_06

2、写HDFS文件，我们在testhdfs.py追加如下代码

#写文件
str='hello python hdfs'
client.create('/py.txt',str)#创建新文件并写入字符串

然后执行testhdfs.py，我们在http://master105:50070/explorer.html#/ 上面刷新下hdfs的根目录，已经产生了py.txt

file python读取hdfs python读取hdfs数据_hdfs_07

我们下载后打开，就是刚才程序里写入的内容

file python读取hdfs python读取hdfs数据_hdfs_08

3、上传文件，我们在testhdfs.py追加如下代码

# 上传文件
client.copy_from_local('d:/word0326.txt', '/sshdfs1/word.txt')#本地文件绝对路径,HDFS目录必须不存在

然后执行testhdfs.py，我们在http://master105:50070/explorer.html#/ 上面刷新下hdfs的根目录，已经产生新目录/sshdfs1

file python读取hdfs python读取hdfs数据_hdfs_09

4、读取文本文件写入csv

Python安装pandas模块，pip install pandas ,我已经安装这里就不安装了

我们准备一个test0326文本文件，里面写了2行逗号间隔的内容

file python读取hdfs python读取hdfs数据_hdfs_10

我们在testhdfs.py追加如下代码，之前的写入和上传代码可以先注释

import pandas as pd

lines = []
with client.open("/test0326.txt") as reader:
    for line in reader:
        lines.append(line.strip().decode("utf-8"))

column_str = lines[0]
column_list = column_str.split(',',3)

data = {"item_list": lines[1:]}

df = pd.DataFrame(data=data)
df[column_list] = df["item_list"].apply(lambda x: pd.Series([i for i in x.split(",")]))  ##重新指定列
df.drop("item_list", axis=1, inplace=True)  ##删除列
df.dtypes
print(df)
# 数据集to_csv方法转换为csv
df.to_csv('d:/test/test0326.csv',encoding='utf-8',index=None)#参数为目标文件,编码,是否要索引

然后执行testhdfs.py，我们在http://master105:50070/explorer.html#/ 上面刷新下hdfs的根目录，已经产生新文件/test0326.txt

file python读取hdfs python读取hdfs数据_hadoop_11