Kafka学习分享|快速启动

这个教程假定你是初学者,没有现存的Kafka或者Zookeeper 数据。由于Kafka控制控制脚本在Unix和Windows平台不同,在Windows平台使用bin\windows\ 代替 bin/,并且更改脚本扩展名为.bat。

第一步:下载编码

下载0.10.2.0版本并且解压它。

eagle kafka windows启动成功404 kafka启动过程_kafka

第二步:启动服务器

Kafka使用Zookeeper,因此如果你没有Zookeeper server,你需要先启动a ZooKeeper server。你可以使用Kafka的便利脚本包来得到一个quick-and-dirty single-node ZooKeeper instance。

eagle kafka windows启动成功404 kafka启动过程_数据_02

然后启动the Kafka server:

eagle kafka windows启动成功404 kafka启动过程_Windows_03

第三步:创建一个topic

让我们创建一个只有一个分区和一个备份命名为“test”的topic,

eagle kafka windows启动成功404 kafka启动过程_数据_04

We can now see that topic if we run the list topic command:如果我们运行列出topic的命令我们能够看到那个topic.

eagle kafka windows启动成功404 kafka启动过程_数据_05

作为选择,代替手工创建topics你能够配置你的brokers来自动创建topics当一个不存在的topic被发布的时候。

第四步:发送一些消息

Kafka自带一个命令行客户端(a command line client),它将从一个文件或一个标准输入中输入并且作为一个消息发送Kafka集群。默认情况下,每一行将作为单独的消息发送。运行该producer,然后将一些消息输入到控制台以发送到服务器。

eagle kafka windows启动成功404 kafka启动过程_Windows_06

第五步:启动一个消费者

Kafka也有一个命令行消费者(a command line consumer),它将把消息转储到标准的输出。

eagle kafka windows启动成功404 kafka启动过程_kafka_07

如果在不同的终端中运行上述命令的每个命令,然后你现在应该能够输入消息到producer终端,并且看到他们在consumer终端出现。

所有的命令行工具都有其他选项;在没有参数的情况下运行命令将显示使用信息的详细信息。

第六步:设置一个multi-broker集群

到目前为止,我们已经开始运行一个单一的broker,但是这无趣。对于Kafka,一个单一的broker仅仅是一个大小为1的集群,因此,除了启动更多的代理实例之外,没有什么变化。让我们扩展集群到三个节点(所有的节点仍然在我们本地的机器上)。

首先,我们给每个brokers制作一个配置文件(on Windows use the copy command instead):

eagle kafka windows启动成功404 kafka启动过程_数据_08

现在编辑这些新文件并且设置下面的这些属性:

eagle kafka windows启动成功404 kafka启动过程_kafka_09

在集群中broker.id property对于每一个节点是唯一的和永久的。We have to override the port and log directory only because we are running these all on the same machine and we want to keep the brokers from all trying to register on the same port or overwrite each other's data. 我们必须重写端口和日志目录,只能是因为我们在同一台机器上执行这些操作,我们希望让所有代理不要尝试在相同的端口上注册或者覆盖彼此的数据。

我们已经有了Zookeeper,并且我们唯一的节点已经启动,因此我们只需要去启动这两个新的节点:

eagle kafka windows启动成功404 kafka启动过程_Windows_10

Now create a new topic with a replication factor of three:

 

eagle kafka windows启动成功404 kafka启动过程_数据_11

Okay but now that we have a cluster how can we know which broker is doing what? To see that run the "describe topics" command:

eagle kafka windows启动成功404 kafka启动过程_数据_12

Here is an explanation of output. The first line gives a summary of all the partitions, each additional line gives information about one partition. Since we have only one partition for this topic there is only one line.

这里有一个关于输出的解释。第一行给出了所有分去的一个总结,每个附加的行提供一个分区的信息。由于对于这个topic我们只有一个分区,因此这里仅仅只有一行。

l  “leader”是负责给定分区的所有读和写的节点。每个节点都将成为分区中一个随机选择部分的领导者;

l  "replicas"是所有节点的清单,它复制这个分区的日志不管他们是领导者还是他们现在还活着;

l  "isr"是"in-sync"备份的集合。这是备份清单的一个子集,备份清单是当前或者的并且caught-up to the leader

Note that in my example node 1 is the leader for the only partition of the topic.

We can run the same command on the original topic we created to see where it is:

eagle kafka windows启动成功404 kafka启动过程_Windows_13

So there is no surprise there—the original topic has no replicas and is on server 0, the only server in our cluster when we created it.

Let's publish a few messages to our new topic:

 

eagle kafka windows启动成功404 kafka启动过程_kafka_14

Now let's consume these messages:

 

eagle kafka windows启动成功404 kafka启动过程_kafka_15

Now let's test out fault-tolerance. Broker 1 was acting as the leader so let's kill it:

 

eagle kafka windows启动成功404 kafka启动过程_kafka_16

On Windows use:

 

eagle kafka windows启动成功404 kafka启动过程_Windows_17

Leadership has switched to one of the slaves and node 1 is no longer in the in-sync replica set:

 

eagle kafka windows启动成功404 kafka启动过程_数据_18

But the messages are still available for consumption even though the leader that took the writes originally is down:

 

eagle kafka windows启动成功404 kafka启动过程_数据_19

第七步:用Kafka连接器输入/输出数据

从控制台写入数据并且将它写回控制台是一个很方便的开始,但是你可能想使用来自于其他数据源的数据或者从Kafka输出数据给其他系统。对于许多系统,你能够使用Kafka连接器来输入或输出数据代替写自定义集成编码。

Kafka连接器是一个包括Kafka的工具输出或输入数据到Kafka。它是一个可扩展的运行连接器的工具,这实现了和外部系统相互作用的自定义逻辑。在这种快启动中我们能够看到Kafka连接是怎样通过简单的连接器运行的,它们从一个文件输入数据到一个Kafka topic 并且从一个Kafka topic输出数据到一个文件。

首先,我们通过创造一些种子数据来开始测试它。

> echo -e "foo\nbar" > test.txt

然后,我们将启用两个独立模式运行的连接器,这意味着他们在一个单一的、本地的、专用的进程中运行。我们提供三个配置文件作为参数。第一个总是Kafka连接进程的配置,包含通用配置,如要去连接的Kafka brokers和数据的序列化格式。其余的配置文件都指定要创建的连接器。这些文件包括一个特别的连接器名字,要实例化的连接器类别,和连接器需要的其他配置。

> bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties

这些样本配置文件,包括Kafka,使用默认的你早期开始运行的本地集群配置并且创建两个连接器:首先是一个来源连接器,它从一个输入文件中读取行,并且将其输出到一个Kafka topic;其次是一个接收连接器,它从一个Kafka topic中读取消息,并且将其作为一行输出到一个输出文件。

在启动过程中你将看到许多日志消息,包括一些连接器被实体化的说明。一旦Kafka连接进程被启动,来源连接器应该从test.txt 文件中开始读取行,并且输出他们到topic connect-test,并且接收连接器应该开始读取从topic connect-test读取消息并且把他们写到test.sink.txt文件。我们能够通过检查输出文件的内容来验证数据是否已经通过整个管道分发完成。

> cat test.sink.txt
foo
bar

Note that the data is being stored in the Kafka topic connect-test, so we can also run a console consumer to see the data in the topic (or use custom consumer code to process it):

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-test --from-beginning
{"schema":{"type":"string","optional":false},"payload":"foo"}
{"schema":{"type":"string","optional":false},"payload":"bar"}
...

The connectors continue to process data, so we can add data to the file and see it move through the pipeline:

> echo "Another line" >> test.txt

You should see the line appear in the console consumer output and in the sink file.

第八步:用Kafka Streams来处理数据

Kafka Streams is a client library of Kafka for real-time stream processing and analyzing data stored in Kafka brokers. This quickstart example will demonstrate how to run a streaming application coded in this library. Here is the gist of the WordCountDemo example code (converted to use Java 8 lambda expressions for easy reading).

// Serializers/deserializers (serde) for String and Long types
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
// Construct a `KStream` from the input topic ""streams-file-input", where message values
// represent lines of text (for the sake of this example, we ignore whatever may be stored
// in the message keys).
KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "streams-file-input");
KTable<String, Long> wordCounts = textLines
// Split each text line, by whitespace, into words.
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
// Group the text words as message keys
.groupBy((key, value) -> value)
// Count the occurrences of each word (message key).
.count("Counts")
// Store the running counts as a changelog stream to the output topic.
wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");

It implements the WordCount algorithm, which computes a word occurrence histogram from the input text. However, unlike other WordCount examples you might have seen before that operate on bounded data, the WordCount demo application behaves slightly differently because it is designed to operate on an infinite, unbounded stream of data. Similar to the bounded variant, it is a stateful algorithm that tracks and updates the counts of words. However, since it must assume potentially unbounded input data, it will periodically output its current state and results while continuing to process more data because it cannot know when it has processed "all" the input data.

As the first step, we will prepare input data to a Kafka topic, which will subsequently be processed by a Kafka Streams application.

> echo -e "all streams lead to kafka\nhello kafka streams\njoin kafka summit" > file-input.txt

Or on Windows:

> echo all streams lead to kafka> file-input.txt
> echo hello kafka streams>> file-input.txt
> echo|set /p=join kafka summit>> file-input.txt

Next, we send this input data to the input topic named streams-file-input using the console producer, which reads the data from STDIN line-by-line, and publishes each line as a separate Kafka message with null key and value encoded a string to the topic (in practice, stream data will likely be flowing continuously into Kafka where the application will be up and running):

> bin/kafka-topics.sh --create \
--zookeeper localhost:2181 \
--replication-factor 1 \
--partitions 1 \
--topic streams-file-input
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streams-file-input < file-input.txt

We can now run the WordCount demo application to process the input data:

> bin/kafka-run-class.sh org.apache.kafka.streams.examples.wordcount.WordCountDemo

The demo application will read from the input topic streams-file-input, perform the computations of the WordCount algorithm on each of the read messages, and continuously write its current results to the output topic streams-wordcount-output. Hence there won't be any STDOUT output except log entries as the results are written back into in Kafka. The demo will run for a few seconds and then, unlike typical stream processing applications, terminate automatically.

We can now inspect the output of the WordCount demo application by reading from its output topic:

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic streams-wordcount-output \
--from-beginning \
--formatter kafka.tools.DefaultMessageFormatter \
--property print.key=true \
--property print.value=true \
--property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
--property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

with the following output data being printed to the console:

all     1
lead    1
to      1
hello   1
streams 2
join    1
kafka   3
summit  1

Here, the first column is the Kafka message key in java.lang.String format, and the second column is the message value in java.lang.Long format. Note that the output is actually a continuous stream of updates, where each data record (i.e. each line in the original output above) is an updated count of a single word, aka record key such as "kafka". For multiple records with the same key, each later record is an update of the previous one.

The two diagrams below illustrate what is essentially happening behind the scenes. The first column shows the evolution of the current state of the KTable<String, Long> that is counting word occurrences for count. The second column shows the change records that result from state updates to the KTable and that are being sent to the output Kafka topic streams-wordcount-output.

 

First the text line “all streams lead to kafka” is being processed. The KTable is being built up as each new word results in a new table entry (highlighted with a green background), and a corresponding change record is sent to the downstream KStream.

When the second text line “hello kafka streams” is processed, we observe, for the first time, that existing entries in the KTable are being updated (here: for the words “kafka” and for “streams”). And again, change records are being sent to the output topic.

And so on (we skip the illustration of how the third line is being processed). This explains why the output topic has the contents we showed above, because it contains the full record of changes.

Looking beyond the scope of this concrete example, what Kafka Streams is doing here is to leverage the duality between a table and a changelog stream (here: table = the KTable, changelog stream = the downstream KStream): you can publish every change of the table to a stream, and if you consume the entire changelog stream from beginning to end, you can reconstruct the contents of the table.

Now you can write more input messages to the streams-file-input topic and observe additional messages added to streams-wordcount-output topic, reflecting updated word counts (e.g., using the console producer and the console consumer, as described above).

You can stop the console consumer via Ctrl-C.