{"id":111,"date":"2015-05-30T03:48:42","date_gmt":"2015-05-30T10:48:42","guid":{"rendered":"http:\/\/fastlizard4.org\/blog\/?p=111"},"modified":"2015-06-03T04:11:16","modified_gmt":"2015-06-03T11:11:16","slug":"unexpected-downtime-post-mortem-all-servers-30-may-2015","status":"publish","type":"post","link":"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/","title":{"rendered":"Unexpected downtime post-mortem: All servers, 30 May 2015"},"content":{"rendered":"<p>&nbsp;<\/p>\n<p><strong>UPDATE 11:10 3 June 2015 UTC:<\/strong> The Linode status page regarding the datacenter failure incident has been updated with the official postmortem, including the RFO (Reason For Outage) from the colocation provider (Hurricane Electric). \u00a0As noted below in the original post, that status page can be found\u00a0<a href=\"http:\/\/status.linode.com\/incidents\/2rm9ty3q8h3x\">here<\/a>.<\/p>\n<hr \/>\n<p>&nbsp;<\/p>\n<p><strong>UPDATE 00:00 31 May 2015 UTC:<\/strong> LizardNet Minecraft server c1 has been successfully restored from a 6-day-old backup, and all LizardNet Minecraft servers are now fully operational. \u00a0Likewise, I have reenabled job running on Jenkins, so a round of webmap genreation should begin shortly.<\/p>\n<hr \/>\n<p>&nbsp;<\/p>\n<p>At around 01:30 UTC on 30 May 2015, Hurricane Electric&#8217;s sjc2 (also known as fmt2) datacenter in Fremont, California, suffered a critical power failure. \u00a0This coincided with the failure of one of the datacenter&#8217;s standby emergency generators, and this caused a power outage that brought down core networking hardware as well as most, if not all, of Linode&#8217;s servers in the datacenter (among other affected customers as well). \u00a0This likewise brought down all three of LizardNet&#8217;s servers (ridley, phazon, and minecraft1), as they are all hosted in the same datacenter for architectural reasons.<\/p>\n<p>According to Linode&#8217;s\u00a0<a href=\"http:\/\/status.linode.com\/incidents\/2rm9ty3q8h3x\">incident report<\/a>\u00a0(which is still being updated as of this writing):<\/p>\n<p style=\"padding-left: 30px;\"><span class=\"Apple-style-span\"><span class=\"Apple-style-span\">We have received word from our colocation provider that there has been a power event in a section of the Fremont datacenter. The affected space is where a critical part of the datacenter network is located&#8230;.<\/span><\/span><\/p>\n<p style=\"padding-left: 30px;\">&#8230;<span class=\"Apple-style-span\"><span class=\"Apple-style-span\">At approximately 6:30PM PDT [01:30 AM UTC] , the Fremont datacenter experienced a power utility outage. One out of eight generators also experienced an electromechanical failure.<\/span><\/span><\/p>\n<p>Despite initial estimates that service would be restored by 04:30 UTC, service restoration occurred at around with servers coming up between 06:15 and 06:30 UTC, for a total downtime of about five hours. \u00a0In addition, it became quickly apparent that the servers had suffered hard shutdowns, despite some indications that the power failure was limited to networking equipment only.<\/p>\n<p>Since then, I have been working to restore LizardNet services. \u00a0LizardNet Minecraft seems to have suffered the most from this &#8211; all of the CraftBukkit-powered servers (c1, s1, and s2) suffered map data corruption due to the hard shutdown. \u00a0s1 and s2 have since been restored from copies made on phazon from the\u00a0<a href=\"https:\/\/mcmaps.fastlizard4.org\">LizardNet Minecraft web maps generator<\/a>. \u00a0c1, unfortunately, as an unmapped server, has no recent backups on phazon, so I have opted to spin up a new Linode server to act as a temporary target for restoring ridley&#8217;s Linode Backups, from which I will grab c1&#8217;s data for restoration. \u00a0The backup restore is, as of this writing, currently in progress and this blog post will be updated when c1&#8217;s status changes. \u00a0The non-CraftBukkit servers (c2, m1, and m2) seem to have come out of this unscathed and are operating normally. \u00a0All Minecraft servers except c1 are currently up and running.<\/p>\n<p>LizardIRC was extremely hard hit by this outage, which is ironic as the <a href=\"https:\/\/fastlizard4.org\/wiki\/LizardIRC-JDNet_network_merge\">planned merge of the JDNet IRC network into LizardIRC<\/a>\u00a0is upcoming and in the preparation stages. \u00a0Before continuing, it is worth noting that the merge, once complete, will add servers in different datacenters and geographical locations to the LizardIRC network, so this should never happen again to the IRC network even if another event like this one occurs. \u00a0Because LizardIRC currently only consists of two IRC servers running on phazon and ridley, the entire network was taken offline by the datacenter outage. \u00a0Since the servers were hard shutdown, all IRCd state information was lost. \u00a0This effectively means that all data related to\u00a0unregistered\u00a0IRC channels (with the exception of ChanFix records) and all non-ChanServ-stored data related to registered IRC channels (for example, modes not set using <code>\/cs SET MLOCK<\/code> including bans and invexes) have been lost. \u00a0Note that if you registered your channel, all data saved in ChanServ, such as the channel topic and the access lists, are still there, and that it appears no services data (such as NickServ nickname registrations) has been lost or corrupted. \u00a0I would like to again emphasize at this point that we are in the process of adding new servers to the network (outside of the HE sjc2 datacenter) to prevent an\u00a0occurrence\u00a0like this again; indeed, we were in the process of doing this before the disaster tonight. \u00a0Furthermore, it&#8217;s also worth noting that this was an extraordinary disaster,\u00a0a veritable one-in-a-million event that, due to bad luck or fate, happened to occur.<\/p>\n<p>As for the servers themselves, checks with fsck and tripwire seem to indicate that there was no corruption to any critical system files. \u00a0Some extremely minor corruption was noted in the syslog file and other system log files, but no data loss. \u00a0All services, with the exception of LizardNet Minecraft server c1 as noted above, have been restored as of this writing, and there is no indication of corruption\u00a0occurring\u00a0elsewhere on the system. \u00a0All MySQL databases on all servers have been verified as being in the &#8220;OK&#8221; state. \u00a0However, if you have SSH (shell) access to phazon or ridley, please take some time in the next couple of days to examine any critical files you have in your home directory and ensure everything is running smoothly. \u00a0If you notice any strange behavior or data corruption, contact me immediately for investigation and restoration. \u00a0Since Linode Backups are cycled out after a period of time, one week is the maximum time I can guarantee that pre-downtime backups will be available.<\/p>\n<div class=\"sharedaddy sd-sharing-enabled\"><div class=\"robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing\"><h3 class=\"sd-title\">Share this:<\/h3><div class=\"sd-content\"><ul><li class=\"share-facebook\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-facebook-111\" class=\"share-facebook sd-button share-icon\" href=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/?share=facebook\" target=\"_blank\" title=\"Click to share on Facebook\" ><span>Facebook<\/span><\/a><\/li><li class=\"share-twitter\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-twitter-111\" class=\"share-twitter sd-button share-icon\" href=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/?share=twitter\" target=\"_blank\" title=\"Click to share on Twitter\" ><span>Twitter<\/span><\/a><\/li><li><a href=\"#\" class=\"sharing-anchor sd-button share-more\"><span>More<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><div class=\"sharing-hidden\"><div class=\"inner\" style=\"display: none;\"><ul><li class=\"share-email\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-email sd-button share-icon\" href=\"mailto:?subject=%5BShared%20Post%5D%20Unexpected%20downtime%20post-mortem%3A%20All%20servers%2C%2030%20May%202015&body=https%3A%2F%2Ffastlizard4.org%2Fblog%2F2015%2F05%2F30%2Funexpected-downtime-post-mortem-all-servers-30-may-2015%2F&share=email\" target=\"_blank\" title=\"Click to email a link to a friend\" data-email-share-error-title=\"Do you have email set up?\" data-email-share-error-text=\"If you&#039;re having problems sharing via email, you might not have email set up for your browser. You may need to create a new email yourself.\" data-email-share-nonce=\"8b8fa3c679\" data-email-share-track-url=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/?share=email\"><span>Email<\/span><\/a><\/li><li class=\"share-print\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-print sd-button share-icon\" href=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/\" target=\"_blank\" title=\"Click to print\" ><span>Print<\/span><\/a><\/li><li class=\"share-end\"><\/li><li class=\"share-reddit\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-reddit sd-button share-icon\" href=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/?share=reddit\" target=\"_blank\" title=\"Click to share on Reddit\" ><span>Reddit<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><\/div><\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>&nbsp; UPDATE 11:10 3 June 2015 UTC: The Linode status page regarding the datacenter failure incident has been updated with the official postmortem, including the RFO (Reason For Outage) from the colocation provider (Hurricane Electric). \u00a0As noted below in the original post, that status page can be found\u00a0here. &nbsp; UPDATE 00:00 31 May 2015 UTC: <a href=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/\"><b>&#8230;Read the Rest<\/b><\/a><\/p>\n<div class=\"sharedaddy sd-sharing-enabled\"><div class=\"robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing\"><h3 class=\"sd-title\">Share this:<\/h3><div class=\"sd-content\"><ul><li class=\"share-facebook\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-facebook-111\" class=\"share-facebook sd-button share-icon\" href=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/?share=facebook\" target=\"_blank\" title=\"Click to share on Facebook\" ><span>Facebook<\/span><\/a><\/li><li class=\"share-twitter\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-twitter-111\" class=\"share-twitter sd-button share-icon\" href=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/?share=twitter\" target=\"_blank\" title=\"Click to share on Twitter\" ><span>Twitter<\/span><\/a><\/li><li><a href=\"#\" class=\"sharing-anchor sd-button share-more\"><span>More<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><div class=\"sharing-hidden\"><div class=\"inner\" style=\"display: none;\"><ul><li class=\"share-email\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-email sd-button share-icon\" href=\"mailto:?subject=%5BShared%20Post%5D%20Unexpected%20downtime%20post-mortem%3A%20All%20servers%2C%2030%20May%202015&body=https%3A%2F%2Ffastlizard4.org%2Fblog%2F2015%2F05%2F30%2Funexpected-downtime-post-mortem-all-servers-30-may-2015%2F&share=email\" target=\"_blank\" title=\"Click to email a link to a friend\" data-email-share-error-title=\"Do you have email set up?\" data-email-share-error-text=\"If you&#039;re having problems sharing via email, you might not have email set up for your browser. You may need to create a new email yourself.\" data-email-share-nonce=\"8b8fa3c679\" data-email-share-track-url=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/?share=email\"><span>Email<\/span><\/a><\/li><li class=\"share-print\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-print sd-button share-icon\" href=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/\" target=\"_blank\" title=\"Click to print\" ><span>Print<\/span><\/a><\/li><li class=\"share-end\"><\/li><li class=\"share-reddit\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-reddit sd-button share-icon\" href=\"https:\/\/fastlizard4.org\/blog\/2015\/05\/30\/unexpected-downtime-post-mortem-all-servers-30-may-2015\/?share=reddit\" target=\"_blank\" title=\"Click to share on Reddit\" ><span>Reddit<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><\/div><\/div><\/div><\/div><\/div>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[16],"tags":[19,20,17],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1rJy3-1N","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/posts\/111"}],"collection":[{"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/comments?post=111"}],"version-history":[{"count":4,"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/posts\/111\/revisions"}],"predecessor-version":[{"id":115,"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/posts\/111\/revisions\/115"}],"wp:attachment":[{"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/media?parent=111"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/categories?post=111"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/tags?post=111"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}