[warning: this post contains lots of technical details, not safe for use while operating heavy machinery]
Summary: I needed to distribute work to multiple local processes with a single IP:port target. I chose to modify the Linux Virtual Server module to support multiple local targets.
I started from
an existing patch from 2005. This was a good start, but had the following problems for me:
- packet checksums were not being correctly calculated due to hardware checksumming
- having stats enabled would crash the kernel in the stats update function
- the kernel would occasionally crash in the packet recieve function (about once per week)
The fix for #1 was pretty easy, I just followed what the regular udp and tcp output routines did. The packets need to be re-checksummed because LVS changes the local port with NAT. The existing code assumes that the packets have correct UDP and TCP checksums, which is not correct for locally generated packets using hardware checksumming.
The fix for #2 and #3 was harder. But the problems turned out to be related. Most kernels have interrupt handling split into two parts: the top interrupt handler (also called the "hardware interrupt handler"), and the bottom interrupt handler (also called the "software interrupt handler"). The top handler must not block, sleep, or take a long time to complete. It runs with interrupts disabled. Typically it copies data from/to the hardware and exits without processing. The top handler also usually schedules the bottom handler to run. In networking, the bottom handler takes the packet, does a routing lookup, runs it through the firewall rules, and forwards it on or puts it in the local process tcp/udp buffer. These things can require locks, which can block the bottom handler. While a bottom handler is running, hardware interrupts are typically enabled, but other bottom handlers are disabled.
Since the LVS code does almost all of its work with forwarded packets, the code primarily runs inside the bottom handler. The exception to that is the one local node you're allowed to load balance to. If you use a local node, you have to configure the VIP as an IP locally, and have the process running on the VIP port. This way, the kernel doesn't need to NAT outgoing packets (or modify them in any way). It just needs to keep track of where to send the incoming packets. When I changed the local processes output to do the NAT work required, I ran into problems #2 and #3 listed above. This is because it created a race condition between local process output (run at normal kernel context) and packet input from the network (run at bottom handler context). The local process would acquire a lock on the stats or IPVS structures, an interrupt from the network would arrive on that CPU and run the top handler, the top handler would run the bottom handler, and the bottom handler would try to acquire the same lock. Since the regular kernel context would never be re-scheduled, it would never release the lock. So any packet recieved interrupt that happend right as that CPU was trying to send a packet would end in a deadlock. The solution to the problem was to disable bottom handlers (local_bh_disable) while a local process was sending a packet.
I have patches for:
RH5/CentOS5 2.6.18-92.1.17.el5(tested), and
Stock 2.6.27.7(untested)