Kafka introduction

Why Kafka is fast?
There're some majar reasons why Kafka is much faster than most of its competitors:
- Avoid Random Disk Access
Kafka writes everything onto the disk in order and consumers fetch data in the same order too.
So disk access always works sequentially instead of randomly, because sequential acess is much faster than random access for HDD.
When we access the disk sequentially, the data reading is not quite slower than memory Random access, but still, the sequential memory access is much faster than Sequential Disk Access.
Why does Kafka choose disk acess instead of memory access?
Since Kafka runs on top of JVM:
1. The memory overhad of objects is very high, ofthen doubling the size of the data store
2. GC happens very frequently, so creating the objects in memory is very expensive, as in-heap data increases because we will need more time to collect unused data
- Memory Mapped File (MMAP)
MMAP can map files contents from the disk into memory.
when write the data into the mapped memory, the OS will flush the changes onto the disk later.(It's actually using the memory)
Why does it use MMAP to write data into disk, which later will be mapped into memory? Why not write data into memory directly?
Kafka runs on top of JVM, huge mount of data writing into memory direclty will make memory overhead high and GC would happend frequently.
- Zero Copy
Copy data directly from kernel Context to the Kernel contenxt
- Batch Data
When the sending data reach thebatch.size
, Kafka will send them at one time.
How to monitor the disk
After deploy the kafka, it's importatnt tot monitor the server ersource usage. Sysstat(SAR) is a good tool to monitor the serve resource usage.
install the Syssstat
apt intall sysstat
config the sysstat:
vim /etc/default/sysstat
ENABLED="true"
change the collection interval from every 10 mins to 1 min
vim /etc/cron.d/sysstat
change
5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1
to
*/1 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1
then restart service
service sysstat restart
If you didn't config properly, you may get the error like:
Cannot open /var/log/sysstat/sa13: No such file or directory
Please check if data collecting is enabled
use the tool
report the system CPU statistics 3 time with 1 second interval
$ sar 1 3
Linux 4.15.0-66-generic (xps) 11/13/2019 _x86_64_ (4 CPU)
09:35:22 AM CPU %user %nice %system %iowait %steal %idle
09:35:23 AM all 2.54 0.00 2.79 0.00 0.00 94.67
09:35:24 AM all 4.08 0.00 2.55 0.00 0.00 93.37
09:35:25 AM all 2.56 0.00 2.05 0.26 0.00 95.14
Average: all 3.06 0.00 2.46 0.08 0.00 94.39
report overall I/O activities staistics 3 time with 1 second interval
sar -b 1 3
Linux 4.15.0-66-generic (xps) 11/13/2019 _x86_64_ (4 CPU)
09:39:28 AM tps rtps wtps bread/s bwrtn/s
09:39:29 AM 0.00 0.00 0.00 0.00 0.00
09:39:30 AM 12.00 0.00 12.00 0.00 1320.00
09:39:31 AM 2.00 2.00 0.00 24.00 0.00
Average: 4.67 0.67 4.00 8.00 440.00
report network device vrtal statistics for eth0,etc
sar -n DEV 1 1
I created some kafka data trasmit test example in Github