Kafka introduction

Kafka introduction

Why Kafka is fast?

There're some majar reasons why Kafka is much faster than most of its competitors:

  1. Avoid Random Disk Access
    Kafka writes everything onto the disk in order and consumers fetch data in the same order too.
    So disk access always works sequentially instead of randomly, because sequential acess is much faster than random access for HDD.

When we access the disk sequentially, the data reading is not quite slower than memory Random access, but still, the sequential memory access is much faster than Sequential Disk Access.

Why does Kafka choose disk acess instead of memory access?

Since Kafka runs on top of JVM:

1. The memory overhad of objects is very high, ofthen doubling the size of the data store
2. GC happens very frequently, so creating the objects in memory is very expensive, as in-heap data increases because we will need more time to collect unused data

  1. Memory Mapped File (MMAP)
    MMAP can map files contents from the disk into memory.
    when write the data into the mapped memory, the OS will flush the changes onto the disk later.(It's actually using the memory)

Why does it use MMAP to write data into disk, which later will be mapped into memory? Why not write data into memory directly?

Kafka runs on top of JVM, huge mount of data writing into memory direclty will make memory overhead high and GC would happend frequently.
  1. Zero Copy

Copy data directly from kernel Context to the Kernel contenxt

  1. Batch Data
    When the sending data reach the batch.size, Kafka will send them at one time.

How to monitor the disk

After deploy the kafka, it's importatnt tot monitor the server ersource usage. Sysstat(SAR) is a good tool to monitor the serve resource usage.

install the Syssstat
apt intall sysstat

config the sysstat:

vim /etc/default/sysstat

ENABLED="true"

change the collection interval from every 10 mins to 1 min

vim /etc/cron.d/sysstat

change
5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1
to
*/1 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1

then restart service

service sysstat restart

If you didn't config properly, you may get the error like:

Cannot open /var/log/sysstat/sa13: No such file or directory
Please check if data collecting is enabled

use the tool

report the system CPU statistics 3 time with 1 second interval

$ sar 1 3
Linux 4.15.0-66-generic (xps) 	11/13/2019 	_x86_64_	(4 CPU)

09:35:22 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
09:35:23 AM     all      2.54      0.00      2.79      0.00      0.00     94.67
09:35:24 AM     all      4.08      0.00      2.55      0.00      0.00     93.37
09:35:25 AM     all      2.56      0.00      2.05      0.26      0.00     95.14
Average:        all      3.06      0.00      2.46      0.08      0.00     94.39

report overall I/O activities staistics 3 time with 1 second interval

sar -b 1 3

Linux 4.15.0-66-generic (xps) 	11/13/2019 	_x86_64_	(4 CPU)

09:39:28 AM       tps      rtps      wtps   bread/s   bwrtn/s
09:39:29 AM      0.00      0.00      0.00      0.00      0.00
09:39:30 AM     12.00      0.00     12.00      0.00   1320.00
09:39:31 AM      2.00      2.00      0.00     24.00      0.00
Average:         4.67      0.67      4.00      8.00    440.00

report network device vrtal statistics for eth0,etc

sar -n DEV 1 1

I created some kafka data trasmit test example in Github