UPDATE 11:10 3 June 2015 UTC: The Linode status page regarding the datacenter failure incident has been updated with the official postmortem, including the RFO (Reason For Outage) from the colocation provider (Hurricane Electric).  As noted below in the original post, that status page can be found here.


 

UPDATE 00:00 31 May 2015 UTC: LizardNet Minecraft server c1 has been successfully restored from a 6-day-old backup, and all LizardNet Minecraft servers are now fully operational.  Likewise, I have reenabled job running on Jenkins, so a round of webmap genreation should begin shortly.


 

At around 01:30 UTC on 30 May 2015, Hurricane Electric’s sjc2 (also known as fmt2) datacenter in Fremont, California, suffered a critical power failure.  This coincided with the failure of one of the datacenter’s standby emergency generators, and this caused a power outage that brought down core networking hardware as well as most, if not all, of Linode’s servers in the datacenter (among other affected customers as well).  This likewise brought down all three of LizardNet’s servers (ridley, phazon, and minecraft1), as they are all hosted in the same datacenter for architectural reasons.

According to Linode’s incident report (which is still being updated as of this writing):

We have received word from our colocation provider that there has been a power event in a section of the Fremont datacenter. The affected space is where a critical part of the datacenter network is located….

At approximately 6:30PM PDT [01:30 AM UTC] , the Fremont datacenter experienced a power utility outage. One out of eight generators also experienced an electromechanical failure.

Despite initial estimates that service would be restored by 04:30 UTC, service restoration occurred at around with servers coming up between 06:15 and 06:30 UTC, for a total downtime of about five hours.  In addition, it became quickly apparent that the servers had suffered hard shutdowns, despite some indications that the power failure was limited to networking equipment only.

Since then, I have been working to restore LizardNet services.  LizardNet Minecraft seems to have suffered the most from this – all of the CraftBukkit-powered servers (c1, s1, and s2) suffered map data corruption due to the hard shutdown.  s1 and s2 have since been restored from copies made on phazon from the LizardNet Minecraft web maps generator.  c1, unfortunately, as an unmapped server, has no recent backups on phazon, so I have opted to spin up a new Linode server to act as a temporary target for restoring ridley’s Linode Backups, from which I will grab c1’s data for restoration.  The backup restore is, as of this writing, currently in progress and this blog post will be updated when c1’s status changes.  The non-CraftBukkit servers (c2, m1, and m2) seem to have come out of this unscathed and are operating normally.  All Minecraft servers except c1 are currently up and running.

LizardIRC was extremely hard hit by this outage, which is ironic as the planned merge of the JDNet IRC network into LizardIRC is upcoming and in the preparation stages.  Before continuing, it is worth noting that the merge, once complete, will add servers in different datacenters and geographical locations to the LizardIRC network, so this should never happen again to the IRC network even if another event like this one occurs.  Because LizardIRC currently only consists of two IRC servers running on phazon and ridley, the entire network was taken offline by the datacenter outage.  Since the servers were hard shutdown, all IRCd state information was lost.  This effectively means that all data related to unregistered IRC channels (with the exception of ChanFix records) and all non-ChanServ-stored data related to registered IRC channels (for example, modes not set using /cs SET MLOCK including bans and invexes) have been lost.  Note that if you registered your channel, all data saved in ChanServ, such as the channel topic and the access lists, are still there, and that it appears no services data (such as NickServ nickname registrations) has been lost or corrupted.  I would like to again emphasize at this point that we are in the process of adding new servers to the network (outside of the HE sjc2 datacenter) to prevent an occurrence like this again; indeed, we were in the process of doing this before the disaster tonight.  Furthermore, it’s also worth noting that this was an extraordinary disaster, a veritable one-in-a-million event that, due to bad luck or fate, happened to occur.

As for the servers themselves, checks with fsck and tripwire seem to indicate that there was no corruption to any critical system files.  Some extremely minor corruption was noted in the syslog file and other system log files, but no data loss.  All services, with the exception of LizardNet Minecraft server c1 as noted above, have been restored as of this writing, and there is no indication of corruption occurring elsewhere on the system.  All MySQL databases on all servers have been verified as being in the “OK” state.  However, if you have SSH (shell) access to phazon or ridley, please take some time in the next couple of days to examine any critical files you have in your home directory and ensure everything is running smoothly.  If you notice any strange behavior or data corruption, contact me immediately for investigation and restoration.  Since Linode Backups are cycled out after a period of time, one week is the maximum time I can guarantee that pre-downtime backups will be available.

About the author

Amateur radio operator (Technician-class), motorcycle rider, Wikipedia editor, computer programmer and tech, sysadmin, photographer, laser enthusiast, Minecraft addict and server operator.

Leave a Reply

Unexpected downtime post-mortem: All servers, 30 May 2015 / LizardBlog by FastLizard4 is licensed under a Attribution-ShareAlike CC BY-SA