Flink YARN Client: A Comprehensive Guide

Apache Flink is an open-source framework for stream and batch processing of big data. It provides powerful capabilities for analyzing and processing large datasets in real-time. Flink can be deployed on various cluster managers, including Apache Hadoop YARN. In this article, we will explore how to use the Flink YARN client to deploy and run Flink applications on a YARN cluster.

Understanding Flink YARN Client

The Flink YARN client is a command-line tool that provides an interface to launch Flink applications on a YARN cluster. It simplifies the process of deploying Flink applications by abstracting away the complexities of interacting with YARN directly. The YARN client allows users to specify the desired parallelism, memory, and other configuration parameters for their Flink applications.

Setting up the Environment

Before we dive into using the Flink YARN client, let's set up our development environment. We assume that you have a working installation of Apache Flink and Apache Hadoop YARN. If not, please follow the official documentation to install them.

Once you have a running YARN cluster, you need to configure Flink to use YARN as the cluster execution mode. This can be done by modifying the flink-conf.yaml file located in the conf directory of your Flink installation. Set the execution.mode property to yarn:

execution.mode: yarn

Writing a Flink Application

Next, let's write a simple Flink application that we can deploy using the YARN client. For this example, we will use the Flink DataSet API to count the occurrences of words in a text file. Here's the code:

import org.apache.flink.api.java.ExecutionEnvironment;

public class WordCount {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        env.readTextFile("input.txt")
                .flatMap((String line, Collector<Tuple2<String, Integer>> out) -> {
                    for (String word : line.split(" ")) {
                        out.collect(new Tuple2<>(word, 1));
                    }
                })
                .groupBy(0)
                .sum(1)
                .print();
    }
}

In this example, we read a text file, split each line into words, and count the occurrences of each word. The result is printed to the console.

Packaging the Application

To deploy our Flink application on a YARN cluster, we need to package all the necessary dependencies into a JAR file. Typically, this includes the Flink core library, any additional libraries our application depends on, and the application code itself.

To create a JAR file, we can use the Maven Shade plugin or any other build tool of our choice. For simplicity, let's assume we have a Maven project and add the following configuration to our pom.xml file:

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.3.0</version>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
                <archive>
                    <manifest>
                        <mainClass>com.example.WordCount</mainClass>
                    </manifest>
                </archive>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Deploying with the YARN Client

Now that our application is packaged into a JAR file, we can use the Flink YARN client to deploy it on our YARN cluster. The YARN client provides a command-line interface to interact with the cluster. Here's an example command to submit our application:

./bin/flink run -m yarn-cluster -ynm WordCount -ytm 1024 -ys 2 -c com.example.WordCount my-application.jar

Let's break down the command:

  • ./bin/flink run: Executes the Flink YARN client.
  • -m yarn-cluster: Specifies the execution mode as yarn-cluster.
  • -ynm WordCount: Sets the name of the YARN application.
  • -ytm 1024: Defines the YARN task manager memory size (in megabytes).
  • -ys 2: Sets the number of YARN task slots.
  • -c com.example.WordCount: Specifies the entry point class for our application.
  • my-application.jar: The path to our application JAR file.

Understanding the Workflow

To better understand the workflow of deploying a Flink application with the YARN client, let's visualize the steps involved:

flowchart TD
    A[Write Flink Application] --> B[Package Application into JAR]
    B --> C[Submit Application using YARN Client]
    C --> D[Allocate Containers on YARN]
    D --> E[Launch Task Managers]
    E --> F[Execute Application]