Parquet is an open source file format by Apache for the Hadoop infrastructure. Well, it started as a file format for Hadoop, but it has since become very popular and even cloud service providers such as AWS have started supporting the file format. This could only mean that Parquet should be doing something right. In this post, we’ll see what exactly is the Parquet file format, and then we’ll see a simple Java example to create or write Parquet files.

Parquet是Apache用于Hadoop基础架构的一种开源文件格式。 嗯,它最初是作为Hadoop的文件格式开始的,但是从那以后它变得非常流行,甚至AWS等云服务提供商也开始支持该文件格式。 这仅意味着Parquet应该做正确的事。 在本文中,我们将看到什么是Parquet文件格式,然后,我们将看到一个简单的Java示例来创建或编写Parquet文件。

(Intro to Parquet File Format)

We store data as rows in the traditional approach. But Parquet takes a different approach, where it flattens the data into columns before storing it. This allows for better data compression for storing, and also for better query performance. Also, because of this storage approach, the format can handle data sets with large number of columns.

在传统方法中,我们将数据存储为行。 但是Parquet采用了另一种方法,即在存储数据之前将数据展平为列。 这样可以更好地压缩数据进行存储,并提高查询性能。 同样,由于这种存储方法,格式可以处理具有大量列的数据集。

Most big data projects use the Parquet file format because of all these features. Parquet files also reduce the amount of storage space required. In most cases, we use queries with certain columns. The beauty of the file format is that the data for a column is all adjacent, so the queries run faster.

由于所有这些功能,大多数大数据项目都使用Parquet文件格式。 实木复合地板文件还减少了所需的存储空间。 在大多数情况下,我们对某些列使用查询。 文件格式的优点是一列的数据都是相邻的,因此查询运行更快。

Because of the optimization and the popularity of the file format, even Amazon provides built-in features to transform incoming streams of data into Parquet files before saving into S3 (which acts as a data lake). I have used this extensively with Amazon’s’Athena and some Apache services. For more information about the Parquet file system, you can refer the official documentation.

由于文件格式的优化和流行,即使Amazon提供了内置功能,也可以将传入的数据流转换为Parquet文件,然后再保存到S3 ( 充当数据湖 )中。 我已经在Amazon'sthethena和一些Apache服务中广泛使用了 。 有关Parquet文件系统的更多信息,您可以参考官方文档

(The Dependencies)

Before we start writing the code, we need to take care of the dependencies. Because this is a Spring Boot Maven project, we’ll list all our dependencies in the pom.xml file:

在开始编写代码之前,我们需要注意依赖项。 因为这是一个Spring Boot Maven项目,所以我们将在pom.xml文件中列出所有依赖项:

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-hadoop</artifactId>
        <version>1.8.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-core</artifactId>
        <version>1.2.1</version>
    </dependency>
</dependencies>

As you can see, we are adding the Spring Boot starter package and a couple of other Apache dependencies. For this example, this is all we need.

如您所见,我们正在添加Spring Boot启动程序包和其他两个Apache依赖项。 对于此示例,这就是我们所需要的。

(The Properties)

As always, we have an application.properties file where we specify all the properties. For this example, we only need two properties: one specifying the path of the schema file, and the other specifying the path of the output directory. We’ll learn more about the schema a bit later. So, the properties file looks like this:

与往常一样,我们有一个application.properties文件,在其中指定所有属性。 在此示例中,我们仅需要两个属性:一个指定模式文件的路径,另一个指定输出目录的路径。 稍后我们将详细了解该模式。 因此,属性文件如下所示:

schema.filePath=
output.directoryPath=

And because this is a Spring Boot application, we’ll be using the @Value annotation to read these values in the code:

并且由于这是一个Spring Boot应用程序,我们将使用@Value批注在代码中读取这些值:

@Value("${schema.filePath}")
private String schemaFilePath;
@Value("${output.directoryPath}")
private String outputDirectoryPath;

(Schema of the Parquet File)

We need to specify the schema of the data we’re going to write in the Parquet file. This is because when a Parquet binary file is created, the data type of each column is retained as well. Based on the schema we provide in a schema file, the code will format the data accordingly before writing it to the Parquet file.

我们需要指定要在Parquet文件中写入的数据的架构。 这是因为创建Parquet二进制文件时,还将保留每一列的数据类型。 根据我们在模式文件中提供的模式,代码将在将数据写入Parquet文件之前对数据进行相应的格式化。

In this example, I’m keeping it simple, as you can see from the schema file below:

在此示例中,我将其保持简单,如下面的模式文件所示:

message m { 
    required INT64 id; 
    required binary username; 
    required boolean active; 
}

Let me explain what this is. The first parameter is of type INT64, which is an integer, and it is called id. The second field is of type binary, which is nothing but string. We’re calling this the username field. Third is a boolean field called active. This is a pretty simple example. But unfortunately, if your data has a hundred columns, you’ll have to declare all of them here.

让我解释一下这是什么。 第一个参数的类型为INT64,它是一个整数,称为id 。 第二个字段是二进制类型,它不过是字符串。 我们称其为用户名字段。 第三是一个称为active的布尔字段。 这是一个非常简单的示例。 但是不幸的是,如果您的数据有一百列,则必须在此处声明所有列。

The required keyword before the field declaration is used for validation, to make sure that a value is specified for that field. This is optional and you can remove it for fields which are not mandatory.

字段声明之前的required关键字用于验证,以确保为该字段指定了值。 这是可选的,您可以删除非必填字段。

(The ParquetWriter)

Disclaimer time, I did not write these two classes that I’m discussing in this section. A few months back when I was researching this, I found these two classes on StackOverFlow. I don’t know who wrote this, but I’ve just been using these two classes everywhere. But yes, I have renamed the classes to suit the project.

免责声明时间,我没有编写本节要讨论的这两个类。 几个月前,当我进行研究时,我在StackOverFlow上发现了这两个类。 我不知道是谁写的,但是我到处都在使用这两个类。 但是,是的,我已将类重命名以适合该项目。

First, the CustomParquetWriter class. This extends the ParquetWriter class that Apache provides. The code for this class is as follows:

首先, CustomParquetWriter类。 这扩展了Apache提供的ParquetWriter类。 该类的代码如下:

public class CustomParquetWriter extends ParquetWriter<List<String>> {
    public CustomParquetWriter(
            Path file,
            MessageType schema,
            boolean enableDictionary,
            CompressionCodecName codecName
    ) throws IOException {
        super(file, new CustomWriteSupport(schema), codecName, DEFAULT_BLOCK_SIZE, DEFAULT_PAGE_SIZE, enableDictionary, false);
    }
}

There’s not much to talk about here. The next is CustomWriteSupport, which you can see as the second parameter to the super() constructor in the snippet above. This is where a lot of things are happening. You can check the repo for the complete class and see what it does.

这里没有什么要讨论的。 下一个是CustomWriteSupport ,您可以在上面的代码段中将其视为super()构造函数的第二个参数。 这是发生很多事情的地方。 您可以检查回购为完整的类,看看它做什么。

Basically, the class checks the schema to determine the data type of each field. After this, using an instance of the RecordConsumer class, the data is written to the file. I’ll not talk much about these two classes, because a) I didn’t write them, and b) the code is simple enough for anybody to understand.

基本上,该类检查架构以确定每个字段的数据类型。 此后,使用RecordConsumer类的实例,将数据写入文件。 我不会谈论这两个类,因为a)我没有编写它们,b)代码足够简单,任何人都可以理解。

(Preparing the Data for the Parquet file)

Let’s get some data ready to write to the Parquet files. A list of strings represents one data set for the Parquet file. Each item in this list will be the value of the correcting field in the schema file. For example, let’s assume we have a list like the following:

让我们准备一些数据来写入Parquet文件。 字符串列表代表Parquet文件的一个数据集。 该列表中的每一项将是模式文件中更正字段的值。 例如,假设我们有一个类似以下的列表:

Looking at the scheme file, we can tell that the first value in the array is the ID, the second value is the name, and the third value is a boolean flag for the active field.

查看方案文件,我们可以看出数组中的第一个值是ID,第二个值是名称,第三个值是活动字段的布尔标志。

So, in our code, we’ll have a list of list of String to represent multiple lines. Yes, you read that right, it’s a list of list of strings:

因此,在我们的代码中,我们将有一个String列表的列表来表示多行。 是的,您没看错,它是字符串列表的列表:

List<List<String>> columns = getDataForFile();

Let’s look at the function to see how we’re generating the data:

让我们看一下函数,看看我们如何生成数据:

private List<List<String>> getDataForFile() {
    List<List<String>> data = new ArrayList<>();
    List<String> parquetFileItem1 = new ArrayList<>();
    parquetFileItem1.add("1");
    parquetFileItem1.add("Name1");
    parquetFileItem1.add("true");
    List<String> parquetFileItem2 = new ArrayList<>();
    parquetFileItem2.add("2");
    parquetFileItem2.add("Name2");
    parquetFileItem2.add("false");
    data.add(parquetFileItem1);
    data.add(parquetFileItem2);
    return data;
}

That’s pretty easy, right? Let’s move on then.

那很容易,对吧? 让我们继续前进。

(Getting the Schema file)

As we already discussed, we have a schema file. We need to get that schema into the code, specifically, as an instance of the MessageType class. Let’s see how to do that:

正如我们已经讨论过的,我们有一个模式文件。 我们需要将该架构放入代码中,特别是作为MessageType类的实例。 让我们看看如何做到这一点:

MessageType schema = getSchemaForParquetFile();
...
private MessageType getSchemaForParquetFile() throws IOException {
    File resource = new File(schemaFilePath);
    String rawSchema = new String(Files.readAllBytes(resource.toPath()));
    return MessageTypeParser.parseMessageType(rawSchema);
}

As you can see, we’re just reading the file as string, and then parsing that string using the parseMessageType() method in the MessageTypeParser class provided by the Apache library.

如您所见,我们只是将文件读取为字符串,然后使用Apache库提供的MessageTypeParser类中的parseMessageType()方法解析该字符串。

(Getting the Parquet Writer)

This is almost the last step in the process. We just have to get an instance of the CustomParquetWriter class that we discussed earlier. Here, we also provide the path of the output file to which the writer will write. The code for this is also pretty simple:

这几乎是该过程的最后一步。 我们只需要获取我们前面讨论的CustomParquetWriter类的实例。 在这里,我们还提供编写器将写入的输出文件的路径。 此代码也非常简单:

CustomParquetWriter writer = getParquetWriter(schema);
...
private CustomParquetWriter getParquetWriter(MessageType schema) throws IOException {
    String outputFilePath = outputDirectoryPath+ "/" + System.currentTimeMillis() + ".parquet";
    File outputParquetFile = new File(outputFilePath);
    Path path = new Path(outputParquetFile.toURI().toString());
    return new CustomParquetWriter(
            path, schema, false, CompressionCodecName.SNAPPY
    );
}

(Writing data to the Parquet File)

This is the last step, we just have to write the data to the file. We’ll loop the list of list that we created and we’ll write each list to the file using the writer we created in the previous step:

这是最后一步,我们只需要将数据写入文件即可。 我们将循环创建的列表列表,并使用在上一步中创建的writer将每个列表写入文件:

for (List<String> column : columns) {
    writer.write(column);
}
logger.info("Finished writing Parquet file.");
writer.close();

That’s pretty much it. You can go to the output directory and check the file created. For example, this is what I got after running this project:

就是这样。 您可以转到输出目录并检查创建的文件。 例如,这是运行此项目后得到的:




java 写 parquet java写parquet文件_linux

If you want to start directly with the working example, you can find the Spring Boot project in my Github repo. And if you have any doubts or queries, feel free to ask me in the comments.

如果您想直接从工作示例开始,可以在我的Github存储库中找到Spring Boot项目。 如果您有任何疑问或疑问,请随时在评论中问我。

翻译自: https://towardsdatascience.com/how-to-generate-parquet-files-in-java-64cc5824a3ce