Find out how a simple system update brought DreamHost down for nearly two days, and how the MS Updater Trojan works.
PLUS: We answer frequently asked DNS questions, and a war story you’ll never forget!
All that and more, on this week’s TechSNAP!
Super special savings for TechSNAP viewers only. Get a .co domain for only $7.99 (regular $29.99, previously $17.99). Use the GoDaddy Promo Code cofeb8 before February 29, 2012 to secure your own .co domain name for the same price as a .com.
Pick your code and save:
cofeb8: .co domain for $7.99
techsnap7: $7.99 .com
techsnap10: 10% off
techsnap20: 20% off 1, 2, 3 year hosting plans
techsnap40: $10 off $40
techsnap25: 25% off new Virtual DataCenter plans
Deluxe Hosting for the Price of Economy (12+ mo plans)
Dates: Feb 1-29
Direct Download Links:
Subscribe via RSS and iTunes:
- The research provides detailed analysis of the ‘MSUpdater Trojan’
- The trojan was mostly spread using targeted spear phishing attacks, emailing people who would have access to sensitive information
- The goal of the remote administration trojan was to steal sensitive or classified information about aerospace or defense designs
- The trojan changed rapidly to avoid detection, and used a variety of methods to infect computers, including zero-day PDF exploits, fake conference invitations (usually specifically targeted to the recipient area of interest, including ISSNIP, IEEE Aerospace Conference, and an Iraq Peace Conference)
- Communications between the infected machines and the C&C servers often took the form of HTTP traffic using the URL structure of Microsoft Windows Update (where the trojan got its name) and Windows Error Reporting likely to avoid detection by some IDSs and manual traffic analysis. Other versions of the trojan included fake google searches with encoded parameters
- The trojan dropped was able to detect that it was being run in a virtual machine, and if so would not attempt to infect the machine. This allowed it to go on undetected for a longer period of time and until discovered, hampered its analysis by researchers
- Outline by Researchers
- Research and Analysis of the Trojan
- Research paper on detecting Virtual Machines
- DreamHost had a policy where they would automatically install the latest packages from the their repository on all of their machines, including VPS and Dedicated servers rented to customers
- Something in one or more of these packages caused some dependencies to be uninstalled resulting in Apache, the FTP server and in some instances, MySQL being uninstalled or unable to start properly
- DreamHost is a very large attack target due to the number of servers and domains that they host, they must work diligently to ensure updates are applied to prevent massive numbers of machines from becoming compromised
- DreamHost has to manually resolve many of the dependencies was unable to fix the issue in an automated fashion, requiring hands on admin time on each individual server and VPS
- DreamHost has now changed their policy regarding updates, where they will now test all of the packages from Debian extensively before they are pushed to all customer servers
A: I personally use Nagios + NagiosGraph for my monitoring, although I have also experimented with Zabbix recently. We discussed a number of monitoring applications in TechSNAP 20 – Keeping it up . Nagios configures each host/service from files, but supports extensive templating and host/service groups, allowing you to quickly configure servers that are nearly identical. Zabbix is powered by a database, which is both a pro and a con, but the main advantage I gave to NagiosGraph was that the historical data is stored in RRD files rather than a database, meaning it is aged to require less space. Zabbix by default deleted old data to avoid accumulating massive amounts of data.
A: If the CNAME is inside the same domain, the authoritative server will usually return the result with the response for the CNAME. For example, if static.example.com is a CNAME to www.example.com, the A record for www.example.com will be included in the response. However if the CNAME is for something like example.cdn.scaleengine.net then a 2nd lookup is required. To answer the second part of your question, it is not possible to do an HTTP redirect at the DNS level, so NGINX is the best place to do it, if done correctly this redirect can be cached by Varnish to avoid any additional latency. You could hard-code the redirect in to Varnish as well. I applaud your use of a cookieless domain for your static content.
This week’s war story is sent in by Irish_Darkshadow (the other other Alan)
IBM has essentially two “faces”, one is the commercial side that deals with all of the clients and the other is a completely internal organisation called the IGA (IBM Global Account) that provides IT infrastructure and support to all parts of IBM engaged with commercial business.
The events described here took place in early 2005.
There is an IBM location in Madrid, Spain which was stafffed by about two thousand people at the time of this war story. The call centre in Dublin was tasked with supporting the users in that site and every single one of them had been trained in what I called “Criticial Situations – Connectivity Testing”. The training took about 4 hours to deliver and was followed up with some practical tests over the next two weeks to ensure the content was sinking in. There was also some random call recording done to detect the techniques being used on live calls too.
Early one morning a call came in to the Spanish support line from a user who had arrived to work late and was unable to get access to her email server. The agent immediately started to drill into the specifics of the problem and realised that the user simply had no network connectivity to her email. The next step in the training says to establish whether the user actually has partial connectivity or a complete loss. The agent began with a simple IPCONFIG /ALL and noticed right away that the user had a 192.168.x.x IP address. This is quite an unusual thing to get on a call from an internal IBM user and the agent didn’t know what to do next and started to get some empirical data before escalating the issue. The key question was – are you the only user affected? The user confirmed that everyone around her was working away with no issues.
The team leader for the Spanish support desk picked up on the call and decided to call my team for some troubleshooting tips. I dropped over to the call and started listening in (which was useless as it was all in frickin’ Spanish) in the hopes of catching something “weird” from the call. The 192 address piqued my curiosity so I had the agent check for a statically assigned IP address…the XP based computer the user was operating was set to use DHCP. Hmmmm…
While this call really started to gain my interested I started hearing of other calls beginning to come in from other users in the same building with the same problem. The agents on those calls were able to confirm to me that these users were on different floors than the original user. So I now had a building on my hands that was slowly losing connectivity to these 192 addresses and the only possibility was a rogue DHCP server.
I suspected that the network topology and physical structure was about to play an important part in isolating the problem so I called up the onsite technicians and managed to get one who knew the building and the network inside out. Each floor of this 20+ floor building has a comms room where 24 / 48 port switches were used to supply each area of the floors. The best part was that this guy actually had a map of which ports were patched to which desks for every floor.
Now that I was firmly into Sherlock Holmes mode I asked the onsite guy to arrange some teams for me. For each of the know affected floors I needed a tech in the comms room and another testing computers. We had hatched a plan to start from the original floor that was affected by unpatching one switch at a time from the building network and doing a release / renew on a PC in that newly unpatched section to see if we got a 169.254.x.x address. If that happened then we knew that the rogue DHCP server was not in that specific section (clever eh? what do you mean no? well screw you, you werent’ there man…it was a warzone!). We repeated this pattern for five floors with no success so we expanded one floor up and one floor down. Eventually one of the techs ran the test and the PC picked up a new 192.168.x.x lease…..we had the root of the problem within our grasp and it was time to close the net (too much? I’m trying to make this sound all actiony….it my head it has AWESOME danger music).
The onsite guys managed to check every PC in the suspect floor area and the rogue server was still not found. They yanked the cable from every PC in the area and while the rest of the building was recovering, we knew that if we repatched this section that the problem would spread again. When all the PCs were disconnected, I asked the onsite guy to check the switch for activity and there was still one port showing traffic. Despite having all the PCs on the floor disconnect…the rogue was still operational. I questioned if there were any meeting rooms or offices on the floor and there was one. AHA! Upon closer inspection, the empty office had a laptop on the desk that was showing activity on the NIC lights. They yanked the cable and tested a PC on the floor…..169.254.x.x…SUCCESS. The switch was repatched to the building network and all of the PCs recovered. The technician I had called originally started to cackle maniacally over the phone. Perhaps it was better described as derisive laughter. Apparently the door to the office that housed the rogue DHCP laptop had a sign on it that read – IT Manager!!!
When we managed to get a full post mortem / lessons learned done it turned out that the IT Manager had arrived to the building about an hour after most users start work and half an hour prior to the arrival of the original caller to the Dublin support centre. So every user who worked normal hours had arrived to work and gotten a valid IP lease. Then the IT Manager showed up, connected his laptop and buggered off to a meeting. 192.168.x.x addresses started getting issued. At that point the original user arrives to work, gets a bad IP and calls the support desk. It turned out that over the weekend the IT Manager had enabled Internet Connection Sharing so that his daughter could get online through the broadband on the laptop from her home PC. He hibernated the laptop, forgot all about the ICS being enabled and just connected it up at work that morning without even thinking about it .
Sometimes, late at night….I can still hear that derisive laughter and it makes me sad when I think of all those IT Managers out there who can do stupid shit like this and yet retain their positions!
It just goes to show, that the methodical approach may not always be the fastest approach, but because it solves the problem every single time, it usually results in a faster resolution and a better understanding of what the issue was.