My Personal Compute Cluster recent had a failure where only of my nodes disassociated from the cluster and the 2.5 Gbps high speed ethernet link that I had set up through a USB dongle became unresponsive. Investigating the problem, I saw in the system log on that node that the kernel thought it was getting a SYN flood through the 2.5 Gbps ethernet link. Basically, the kernel turned off that networking link because it thought it was getting a DDoS attack.
Clearly there wasn’t a true DDoS attack happening since my cluster is on its own network. I researched what would cause this and learned that the standard Linux kernel networking configuration is tuned for 1 Gbps ethernet links. Basically, the intense data transfer between my Spark nodes over the 2.5 Gbps ethernet links filled the kernel’s network queue. To fix the problem, I had to increase the size of the queue.
To make the needed improvements to how the kernel handles high network load over fast networks, I added the following lines to /etc/sysctl.d/99-sysctl.conf
file on each of my nodes:
# # Networking changes to help handle bursts with >1Gpbs ethernet # net.core.netdev_max_backlog = 250000 net.ipv4.tcp_max_syn_backlog = 200000
After rebooting the cluster, I have had no similar networking problems since.
Here are some good references on the changes I made:
Driver for USB Ethernet Dongle
It turns out that Ubuntu 18.04 doesn’t fully support the Realtek 8156 chip in my 2.5G Ethernet dongles. While it “worked”, the syslog
was spammed with USB disconnect message. Some research indicated that this was because the built in driver was insufficient. So I needed to download install the official driver from RealTek.
You can download the RealTek driver for ethernet dongles based off their chips here: RealTek 2.5G Gigabit Ethernet Driver. Note that they require an annoying captcha in order to download the software, so you can’t download it from the command line. Furthermore, this driver needs to be compiled. While you could probably write a script to compile it on computer and distribute that to the other nodes, I found t easier just to compile the driver on each node in my cluster since my cluster only has a few nodes.
To do that, the first step is to copy the driver download to each node, then on each node, uncompress and untar the package, the cd
into the driver’s directory. Once there:
sudo apt-get update sudo apt-get install build-essential -y make && sudo make install sudo depmod -a sudo update-initramfs -u
Furthermore, for good measure, I wanted to turn off the linux’s kernel’s “autosuspend” feature for USB which would power down a USB device is the kernel thinks it can. To do that, I edited the line with the GRUB_CMDLINE_LINUX_DEFAULT
setting in the /etc/default/grub
file to look like:
GRUB_CMDLINE_LINUX_DEFAULT="usbcore.autosuspend=-1"
Once edited, issue these commands:
sudo update-grub sudo reboot
Repeat this for each node in the cluster and the USB 2.5G Ethernet dongles will not spam syslog
and should be more reliable.
One thought on “Improving Linux Kernel Network Configuration for Spark on High Performance Networks”