Apache Kafka is a distributed data storage designed for real-time input and processing of streaming data. Streaming data is information that is continuously generated by thousands of data sources, all of which transmit data records in at the same time. A streaming platform must be able to cope with the constant inflow of data and process it sequentially and progressively.
What is Multithreading and why do we need it?
The ability of a central processing unit (CPU) (or a single core in a multi-core processor) to provide many threads of execution concurrently, supported by the operating system, is referred to as multithreading. Multithreading can be used to improve application speed in instances when the work can be broken into smaller units that can operate in parallel without compromising data consistency. Kafka allows you to grow your distributed system by using partitions, which are ordered subsets of messages in a topic.
It has recently noticed a trend where developers, rather than ensuring that a computation can efficiently process data from a single partition, take the easy route of expanding the partitions/vms to get the needed throughput. It’s the equivalent of throwing money at the problem.
Kafka topics divide records into smaller parts called partitions, which can be processed individually without compromising the accuracy of the findings, laying the groundwork for parallel processing. This is commonly accomplished by scaling, which involves using many consumers within the same group, each processing data from a subset of topic partitions and operating in a single thread.
Because reading and processing messages in a single thread is sufficient for most Kafka use cases, the Apache Kafka consumer threading paradigm is widely utilized. The poll loop works smoothly when processing does not require I/O activities.
Kafka consumers
Consumers who buy Kafka usually do so as part of a group. When many consumers subscribe to a topic and are members of the same consumer group, each consumer receives messages from a subset of the subject’s partitions.
Adding extra consumers to a consumer group is the most common technique to scale data consumption from a Kafka topic. Consumers of Kafka frequently perform high-latency actions like writing to a database or performing a time-consuming computation on the data. When a single consumer can’t keep up with the rate at which data flows into a topic, we scale by adding more consumers who share the load by having each consumer own only a subset of the partitions and messages.
Benefits of multithreading
Multithreading allows many pieces of a programme to run at the same time. Threads are lightweight processes available within the process. Multithreading allows multitasking to make the most of the CPU.
Following are some of the advantages of multithreaded programming:
Sharing Resources
A process’s resources, including memory, data, and files, are shared among all threads. Using resource sharing, a single programme can have several threads in the same address space.
Responsiveness
Program responsiveness allows a programme to continue to operate even if a portion of it is halted due to multithreading. If the process is doing a lengthy operation, this can be done as well.
Multiprocessor Architecture
Multithreading allows each thread in a multiprocessor architecture to run on a different processor in parallel. This improves the system’s concurrency. In a single processor system, just one process or thread can run at a time.
What is thread per consumer model
Each thread is instantiated and connected to the Kafka broker in the thread per consumer model. The partitions whose messages will be sent to these threads are assigned by the kafka broker.
A single thread connects to Kafka in multi-threaded consumer mode and can acquire data from several / single partitions (s). Once the data has been provided to the thread, the thread may distribute the messages to other pools of threads for processing in parallel. In this method, the consumer thread determines which child thread will handle which types of messages. However, in this circumstance, offset management becomes extremely difficult.
Spring may easily generate several threads to connect to Kafka. Let’s see how the two behave differently. We have a single test-topic with ten partitions and a single VM running a single concurrent spring application.
Thread per consumer model
/**
* Consumer configuration for email topics
*
* @return
*/
@Bean
public ConsumerFactory<String, String> consumerFactory()
{
Map<String, Object> props = new HashMap<>();
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, EMAIL_STATUS_CONSUMER_GROUP);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
StringDeserializer.class);
return new DefaultKafkaConsumerFactory<>(props);
}
/**
* Sets Concurrency for kafka listener
*
* @return
*/
@Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory()
{
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.setConcurrency(1);
return factory;
}
This division is being listened to by the consumer group spring-group. The following is how single concurrency behaves:
GROUP TOPIC PARTITION CONSUMER-ID HOST CLIENT-ID
spring-group test-topic 8 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
spring-group test-topic 2 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
spring-group test-topic 1 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
spring-group test-topic 4 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
spring-group test-topic 5 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
spring-group test-topic 6 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
spring-group test-topic 3 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
spring-group test-topic 7 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
spring-group test-topic 9 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
spring-group test-topic 0 consumer-1-01a5779b-940b-44cf-b8c6-2e414aa38eb1 /172.22.0.1 consumer-1
If you look closely at the above output, you’ll notice that the application’s consumer ID is the same for all 10 partitions, indicating that it’s a single thread that connects them all.
Let’s look at what happens when the concurrency is increased to 2,
GROUP TOPIC PARTITION CONSUMER-ID HOST CLIENT-ID
spring-group test-topic 8 consumer-2-8ab0213d-683c-4f92-b3c8-767701905994 /172.22.0.1 consumer-2
spring-group test-topic 5 consumer-2-8ab0213d-683c-4f92-b3c8-767701905994 /172.22.0.1 consumer-2
spring-group test-topic 6 consumer-2-8ab0213d-683c-4f92-b3c8-767701905994 /172.22.0.1 consumer-2
spring-group test-topic 7 consumer-2-8ab0213d-683c-4f92-b3c8-767701905994 /172.22.0.1 consumer-2
spring-group test-topic 9 consumer-2-8ab0213d-683c-4f92-b3c8-767701905994 /172.22.0.1 consumer-2
spring-group test-topic 4 consumer-1-886f1a6e-f316-4e17-90d2-599a582682e4 /172.22.0.1 consumer-1
spring-group test-topic 2 consumer-1-886f1a6e-f316-4e17-90d2-599a582682e4 /172.22.0.1 consumer-1
spring-group test-topic 3 consumer-1-886f1a6e-f316-4e17-90d2-599a582682e4 /172.22.0.1 consumer-1
spring-group test-topic 1 consumer-1-886f1a6e-f316-4e17-90d2-599a582682e4 /172.22.0.1 consumer-1
spring-group test-topic 0 consumer-1-886f1a6e-f316-4e17-90d2-599a582682e4 /172.22.0.1 consumer-1
As you can see in the screenshot above, there are now two threads, each with five partitions.
Kafka will try to distribute partitions evenly among threads belonging to the same consumer group. We’ll have a dedicated thread for each partition if we create ten concurrent threads.
Conclusion
In this article we understand a few things about multithreading and learn the threading model.