I've spent the past couple of days dealing with a major security incident on
the SRCF's main server, pip (almost exactly 18 months since our last such incident). Now that things have calmed down a bit, I thought I'd write something about it for those who expressed an interest in the details.
The attacker gained root during the afternoon of Tuesday 21st April, via a recent vulnerability in udev (
USN 758-1). We had installed the updated udev packages shortly after they were published by Ubuntu, but missed the footnote in the advisory stating that a reboot is necessary to effect the update. (Which is interesting, as actually it's not - further investigation suggests that one only needs to restart udevd to secure the system; whilst Debian's version of the udev packages does this on upgrade, Ubuntu's does not. This leaves me somewhat baffled, but anyway.)
At 16:26 on Tuesday 21st April, the attacker installed version 2.3 of the Phalanx rootkit. This is a fun little program which hooks into the syscall table to subvert a variety of different things - most notably, it logs any data sent to or from a tty, thus very effectively harvesting passwords for any service connected to interactively via a shell on the SRCF. We don't know whether the attacker ever retrieved the log, but we must assume that he or she did. (Passwords to SRCF accounts themselves were not compromised, unless they were typed again after login - for example to a terminal locking application, email client, onward SSH client, etc.)
Phalanx also has other abilities: according to the documentation for an older version, Phalanx beta 6 (I can't find any more recent versions online) it opens up a root shell backdoor running with a magic group ID, and any program started as that group will be hidden from process lists. Similarly, any network connection started as that group will be hidden from netstat, and so on. The directory into which it was installed (/usr/lib/zupzz.p2) was also hidden from directory listings, but could be referred to explicitly by name.
The attacker set up a Cron job (in /etc/cron.d/zupzzplaceholder) which attempted to reinstall the rootkit every minute, just in case the system is rebooted or the rootkit is disabled by some other means:
* * * * * root /usr/lib/zupzz.p2/.p-2.3d i &> /dev/null
Funnily enough, the redirection to /dev/null broke for reasons unknown when I rebooted early Thursday morning (for what at the time I thought was an unrelated reason; more below). Unfortunately I did not check my log mailbox before going to bed after the reboot; if I had, I would have noticed the output from the rootkit installation application emailed to the sysadmins once a minute. This is what alerted the other sysadmins to the presence of the rootkit when they awoke;
mad_tigger took the server offline shortly after 10am.
Rewind slightly: on Tuesday and Wednesday we had been seeing unusual behaviour from the Apache web server. Occasional threads would cause a kernel "Oops" and lock up:
Apr 21 16:30:30 pip kernel: [828799.968957] Unable to handle kernel paging request at 00007fa5d06cc0c8 RIP:
Apr 21 16:30:30 pip kernel: [828799.973676] []
Apr 21 16:30:30 pip kernel: [828799.979587] PGD 2ff230067 PUD 22bd8a067 PMD 344d17067 PTE 0
Apr 21 16:30:30 pip kernel: [828799.985360] Oops: 0002 [1] SMP
Apr 21 16:30:30 pip kernel: [828799.988651] CPU 0
Apr 21 16:30:30 pip kernel: [828799.990788] Modules linked in: af_packet jfs xt_multiport binfmt_misc nfsd auth_rpcgss exportfs autofs4 nfs lockd nfs_acl sunrpc iptable_filter ip_tables x_tables ext3 jbd mbcache ipmi_devintf ipmi_si ipmi_msghandler parport_pc lp parport ipv6 psmouse serio_raw i5000_edac iTCO_wdt iTCO_vendor_support button shpchp edac_core pci_hotplug evdev pcspkr xfs sr_mod cdrom pata_acpi sg sd_mod ata_piix ata_generic ahci ehci_hcd uhci_hcd libata scsi_mod usbcore e1000 raid10 raid456 async_xor async_memcpy async_tx xor raid1 raid0 multipath linear md_mod dm_mirror dm_snapshot dm_mod thermal processor fan fbcon tileblit font bitblit softcursor fuse
Apr 21 16:30:30 pip kernel: [828800.049054] Pid: 1035, comm: apache2 Not tainted 2.6.24-23-server #1
Apr 21 16:30:30 pip kernel: [828800.055518] RIP: 0010:[] []
Apr 21 16:30:30 pip kernel: [828800.061658] RSP: 0018:ffff8101889f3ec8 EFLAGS: 00010293
Apr 21 16:30:30 pip kernel: [828800.067075] RAX: 00007fa5d06cc0c8 RBX: 0000000000002710 RCX: 0000000000000000
Apr 21 16:30:30 pip kernel: [828800.074330] RDX: 000000000000002f RSI: 0000000000000296 RDI: ffff810100a1a6e4
Apr 21 16:30:30 pip kernel: [828800.081579] RBP: ffff8101889f3f78 R08: 0000ba5d3eef3329 R09: 0000000000000000
Apr 21 16:30:30 pip kernel: [828800.088827] R10: ffff810001044c60 R11: 0000000000000001 R12: 00000000002dc6c0
Apr 21 16:30:30 pip kernel: [828800.096077] R13: 0000000000003d2f R14: ffffffff00000001 R15: 000000004e254078
Apr 21 16:30:30 pip kernel: [828800.103336] FS: 000000004e254950(0063) GS:ffffffff805c5000(0000) knlGS:0000000000000000
Apr 21 16:30:30 pip kernel: [828800.111545] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 21 16:30:30 pip kernel: [828800.117396] CR2: 00007fa5d06cc0c8 CR3: 0000000337974000 CR4: 00000000000006e0
Apr 21 16:30:30 pip kernel: [828800.124645] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 21 16:30:30 pip kernel: [828800.131894] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 21 16:30:30 pip kernel: [828800.139150] Process apache2 (pid: 1035, threadinfo ffff8101889f2000, task ffff8101152447f0)
Apr 21 16:30:30 pip kernel: [828800.147619] Stack: 62696c2f7273752f 6666666770757a2f fffffffffffffffd 00003d2f00000000
Apr 21 16:30:30 pip kernel: [828800.155844] 312f636f72702f2f ffffff0033363635 ffffffff802b6110 ffffffff8024b4e0
Apr 21 16:30:30 pip kernel: [828800.163447] ffffffff802489b0 ffffffff802128f0 ffffffff8029b070 ffffffff802b91d0
Apr 21 16:30:30 pip kernel: [828800.170835] Call Trace:
Apr 21 16:30:30 pip kernel: [828800.173603] [sys_write+0x0/0x90] sys_write+0x0/0x90
Apr 21 16:30:30 pip kernel: [828800.178674] [sys_kill+0x0/0x1a0] sys_kill+0x0/0x1a0
Apr 21 16:30:30 pip kernel: [828800.183748] [sys_getgid+0x0/0x10] sys_getgid+0x0/0x10
Apr 21 16:30:30 pip kernel: [828800.188910] [sys_mmap+0x0/0x140] sys_mmap+0x0/0x140
Apr 21 16:30:30 pip kernel: [828800.193985] [sys_munmap+0x0/0x80] sys_munmap+0x0/0x80
Apr 21 16:30:30 pip kernel: [828800.199145] [sys_newstat+0x0/0x50] sys_newstat+0x0/0x50
Apr 21 16:30:30 pip kernel: [828800.204406] [system_call+0x7e/0x83] system_call+0x7e/0x83
Apr 21 16:30:30 pip kernel: [828800.209742]
Apr 21 16:30:30 pip kernel: [828800.211337]
Apr 21 16:30:30 pip kernel: [828800.211337] Code: 88 10 48 83 45 f8 01 48 83 45 f0 01 83 45 e8 01 8b 45 e8 3b
Apr 21 16:30:30 pip kernel: [828800.220799] RIP []
Apr 21 16:30:30 pip kernel: [828800.224588] RSP
Apr 21 16:30:30 pip kernel: [828800.228183] CR2: 00007fa5d06cc0c8
Apr 21 16:30:30 pip kernel: [828800.232220] ---[ end trace b66b557d288d7a85 ]---
As we had previously had issues with the kernel taking a sudden dislike to Apache until rebooted, I tried rebooting the machine at 02:10 on Thursday morning. Of course, this problem turned out to be due to the rootkit, and the Oopses resumed shortly after reboot as soon as the attacker's Cron script ran. Not having noticed the rootkit by this point, I tried a variety of different things to unstick Apache (including running tests on the server's RAM, as for a while I suspected a hardware issue), and came to the conclusion that the bug was threading-related - installing the traditional "prefork" Apache multi-processing module stopped the Oopses. Looking back on this, I think I had discovered that the rootkit's sys_write hook wasn't threadsafe! Anyway, I left the prefork MPM installed as a temporary fix and went to bed stumped.
Worryingly, whilst juggling kernels, the server logged three machine check exceptions. I haven't determined the precise cause but hopefully they were false alarms thrown as a result of the rootkit meddling with kernel memory.
Fast-forward again to 11:00 Thursday. Sysadmins are running around like headless chickens acting in a calm and orderly fashion trying to figure out what has happened. Term has just started for undergraduates, and that means I was tied up doing Java practical assessments for first year computer scientists from 11:50 until 16:15. (That also went horribly wrong at first due to the course organisers misreading an email, but that's another story.) So for the first few hours of the servers being offline, I could do little other than offer advice via IRC during the short gaps between students to
djs203 and
doismellburning who were on-site (due in doismellburning's case to a very generous employer!) and the rest of the team who were intermittently prodding things via the serial console (network connectivity having been cut by me at the switch as soon as I became aware of what had happened). Their initial forensic efforts were invaluable in determining the state of the system and the extent of the damage.
At 17:00 I rebooted the server from known-clean media (I am frequently thankful of the bootable USB flash drive I keep on my keyring :-) ), took a snapshot of the server's contents and started reinstalling the OS from scratch - a tedious process but one I'm getting somewhat used to, having set up that server originally and reinstalled it once already after the 2007 compromise. In fact I reinstalled it from scratch twice because debootstrap disagreed with my assumption of what would be a sensible distribution to install... (it installed a 32-bit distribution despite running on a 64-bit kernel - in hindsight I should have realised that it would do this as it was running in a 32-bit userspace, but I was tired). At 02:14 I brought the server back up (with freshly-generated SSH keys and SSL certificates, just in case - incidentally,
ProntoSSL were very helpful in getting the certificate reissued promptly and without charge). Two users, who shall remain nameless, logged in despite the changed SSH keys before I had announced the change...
For the rest of the night I set about dealing with a few rough edges where the fresh installation didn't quite match the behaviour of the old one, and also checking our other servers for signs of intrusion (I took our second public server, cyclone, offline at the same time as pip as a precaution but thankfully it and our internal servers all escaped the attacker's notice despite being vulnerable to the same attack). I made an
announcement to our users at 05:58, and finally got to bed sometime around 07:45. This of course meant that the rest of the support team dealt with the bulk of the user support aftermath throughout the day, for which I am very grateful. :-)
Finally I would like to express my gratitude towards
CamCERT (the Computer Emergency Response Team of the Cambridge University Computing Service) for their ongoing support and in particular for their help in identifying the vulnerability used to compromise our system based on their knowledge of similar incidents elsewhere.