{"id":76,"date":"2013-04-26T13:34:46","date_gmt":"2013-04-26T20:34:46","guid":{"rendered":"http:\/\/fastlizard4.org\/blog\/?p=76"},"modified":"2013-04-28T22:46:05","modified_gmt":"2013-04-29T05:46:05","slug":"unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013","status":"publish","type":"post","link":"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/","title":{"rendered":"Unexpected Downtime Postmortem: ridley.fastlizard4.org, 26 April 2013"},"content":{"rendered":"<p>Today, ridley (the server, not the villain from\u00a0<em>Metroid<\/em>) had its first major unexpected downtime incident. \u00a0Following is a reconstruction of events. \u00a0All times are given in <a title=\"Coordinated Universal Time - Wikipedia, the free encyclopedia\" href=\"https:\/\/en.wikipedia.org\/wiki\/Coordinated_Universal_Time\">UTC<\/a>.<em><br \/>\n<\/em><\/p>\n<ul>\n<li><strong>26 April 2013, approximately 18:01<\/strong> &#8211; All connections between ridley and the outside world mysteriously drop. \u00a0Icinga notices this and reports it as issues connecting to phazon for service and host checks.<\/li>\n<li><strong>18:05<\/strong> &#8211; The Icinga event log suddenly cuts, and I can only interpret this as the Linode server that hosts ridley (Fremont586) and other systems either crashing or shutting down. \u00a0Either way, ridley was hard-stopped.<\/li>\n<li><strong>18:06<\/strong> &#8211; I notice the issue and attempt to connect to ridley by SSH and eventually through <a title=\"LInode SHell\" href=\"https:\/\/library.linode.com\/troubleshooting\/using-lish-the-linode-shell\">LISH<\/a>, but to no avail &#8211; I can&#8217;t even contact the host server.<\/li>\n<li><strong><\/strong><strong>approximately 18:15<\/strong>\u00a0&#8211; I connect to the #linode IRC channel and discover that this is a large scale problem affecting many, many customers with Linodes in Fremont.<\/li>\n<li><strong>18:20<\/strong> &#8211; Linode posts their first update to their <a title=\"Linode Status\" href=\"http:\/\/status.linode.com\/\">Status Blog<\/a>, &#8220;We&#8217;ve been alerted to a network connectivity issue affecting the Fremont facility at this time. We are currently in the process of investigating and will provide more information as it becomes available.&#8221;<\/li>\n<li><strong>18:28<\/strong> &#8211; I post the first details (<a title=\"Twitter \/ LizardWiki\" href=\"https:\/\/twitter.com\/LizardWiki\/status\/327851601362558976\">1<\/a>, <a title=\"Twitter \/ LizardWiki\" href=\"https:\/\/twitter.com\/LizardWiki\/status\/327851719197335553\">2<\/a>, <a title=\"Twitter \/ LizardWiki\" href=\"https:\/\/twitter.com\/LizardWiki\/status\/327851772087529472\">3<\/a>) to @LizardWiki&#8217;s Twitter.<\/li>\n<li><strong><\/strong><strong>approximately 18:30<\/strong> &#8211; Users in #linode reporting intermittent connectivity to their Fremont Linodes, but ridley is still completely unreachable &#8211; 100% packet loss from phazon since I first noticed my connections drop.<\/li>\n<li><strong>18:46<\/strong> &#8211; Pings start arriving from ridley to phazon. \u00a0Icinga reports that at 18:46:29 that it had cold-started. \u00a0Server cold-boots, and according to the Linode Manager the boot is &#8220;host initiated&#8221;<span style=\"line-height: 13px;\">. \u00a0I begin damage control.<\/span><\/li>\n<li><strong>18:50<\/strong> &#8211; I <a title=\"Twitter \/ LizardWiki\" href=\"https:\/\/twitter.com\/LizardWiki\/status\/327857225022443520\">report<\/a> on @LizardWiki&#8217;s Twitter that the server now appears to be up.<\/li>\n<li><strong>19:20<\/strong> &#8211; Linode Status reports, &#8220;We&#8217;re still working on resolving the connectivity issues being experienced at our Fremont facility. At this time there is no ETA for full resolution. Once more information is available we&#8217;ll be providing an update here.&#8221;<\/li>\n<li><b>19:45<\/b> &#8211; Linode Status finally reports, &#8220;The networking issue should be resolved at this time. If you continue to experience any problems please open a support ticket from within the Linode Manager.&#8221;<\/li>\n<li><strong>20:01<\/strong> &#8211; I <a title=\"Twitter \/ LizardWiki\" href=\"https:\/\/twitter.com\/LizardWiki\/status\/327874978601111552\">report<\/a>\u00a0on @LizardWiki&#8217;s Twitter that the server is stable and the downtime is over.<\/li>\n<\/ul>\n<div id=\"attachment_77\" style=\"width: 310px\" class=\"wp-caption alignright\"><a href=\"http:\/\/fastlizard4.org\/blog\/wp-content\/uploads\/2013\/04\/processes-day.png\"><img aria-describedby=\"caption-attachment-77\" data-attachment-id=\"77\" data-permalink=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/processes-day\/\" data-orig-file=\"http:\/\/fastlizard4.org\/blog\/wp-content\/uploads\/2013\/04\/processes-day.png\" data-orig-size=\"497,364\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"Munin &#8220;Processes By Day&#8221; Graph\" data-image-description=\"&lt;p&gt;Munin &#8220;Processes by day&#8221; graph.  Note the gap around 18:00 UTC indicating the downtime period when no data could be accounted (since the server was down).&lt;\/p&gt;\n\" data-image-caption=\"&lt;p&gt;Munin &#8220;Processes by day&#8221; graph.  Note the gap around 18:00 UTC indicating the downtime period when no data could be accounted (since the server was down).&lt;\/p&gt;\n\" data-medium-file=\"http:\/\/fastlizard4.org\/blog\/wp-content\/uploads\/2013\/04\/processes-day-300x219.png\" data-large-file=\"http:\/\/fastlizard4.org\/blog\/wp-content\/uploads\/2013\/04\/processes-day.png\" decoding=\"async\" loading=\"lazy\" class=\"size-medium wp-image-77\" alt=\"Munin &quot;Processes by day&quot; graph.  Note the gap around 18:00 UTC indicating the downtime period when no data could be accounted (since the server was down).\" src=\"http:\/\/fastlizard4.org\/blog\/wp-content\/uploads\/2013\/04\/processes-day-300x219.png\" width=\"300\" height=\"219\" srcset=\"http:\/\/fastlizard4.org\/blog\/wp-content\/uploads\/2013\/04\/processes-day-300x219.png 300w, http:\/\/fastlizard4.org\/blog\/wp-content\/uploads\/2013\/04\/processes-day.png 497w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-77\" class=\"wp-caption-text\">Munin &#8220;Processes by day&#8221; graph. Note the gap around 18:00 UTC indicating the downtime period when no data could be accounted (since the server was down).<\/p><\/div>\n<p>According to Pingdom, the downtime lasted a total of approximately 45 minutes, consistent with the figure shown to the right.<\/p>\n<p>There appears to have been no serious permanent damage caused by the downtime, although the server was hard-halted when the host system was shutdown or crashed (unclear which). \u00a0There was some damage to the MySQL databases, but `mysqlcheck` handled that with no issues. \u00a0Munin had some issues with plugins on ridley after the reboot, but that has been fixed as well as a strange error with Rav3nZNC related to the ident module. \u00a0Gerrit was extremely slow to start up after the reboot, causing the control application to erroneously signal failure, but it did in fact start up correctly with no errors, and a restart of Gerrit confirmed this with an OK signal. \u00a0At this time, all services on ridley appear to be up and stable and Icinga reports no problems with ridley. \u00a0 Although the downtime was caused by factors well beyond my (and probably everyones&#8217;) control, I apologize for the inconvenience this has caused. \u00a0As always, if you have any questions please feel free to post a comment here or contact me directly. \u00a0Users of ridley should contact me directly if they notice any damage to their files or if they notice any services not working properly.<\/p>\n<p><del>Linode has not yet posted a detailed postmortem, but this post will be updated if\/when one becomes available.<\/del><\/p>\n<p><strong>UPDATE, 5:45 29 April 2013 (UTC):<\/strong> Linode has released their postmortem on the situation, and you can find that as well as their timeline of events <a title=\"RESOLVED - Fremont Connectivity Issues\" href=\"http:\/\/status.linode.com\/2013\/04\/fremont-connectivity-issues.html\">here<\/a>. \u00a0In addition, I received a support ticket in my Linode manager that ridley&#8217;s host was rebooted for emergency maintenance, explaining the cold boot of ridley.<\/p>\n<p>Thank you for flying LizardNet!<\/p>\n<div class=\"sharedaddy sd-sharing-enabled\"><div class=\"robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing\"><h3 class=\"sd-title\">Share this:<\/h3><div class=\"sd-content\"><ul><li class=\"share-facebook\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-facebook-76\" class=\"share-facebook sd-button share-icon\" href=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/?share=facebook\" target=\"_blank\" title=\"Click to share on Facebook\" ><span>Facebook<\/span><\/a><\/li><li class=\"share-twitter\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-twitter-76\" class=\"share-twitter sd-button share-icon\" href=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/?share=twitter\" target=\"_blank\" title=\"Click to share on Twitter\" ><span>Twitter<\/span><\/a><\/li><li><a href=\"#\" class=\"sharing-anchor sd-button share-more\"><span>More<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><div class=\"sharing-hidden\"><div class=\"inner\" style=\"display: none;\"><ul><li class=\"share-email\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-email sd-button share-icon\" href=\"mailto:?subject=%5BShared%20Post%5D%20Unexpected%20Downtime%20Postmortem%3A%20ridley.fastlizard4.org%2C%2026%20April%202013&body=http%3A%2F%2Ffastlizard4.org%2Fblog%2F2013%2F04%2F26%2Funexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013%2F&share=email\" target=\"_blank\" title=\"Click to email a link to a friend\" data-email-share-error-title=\"Do you have email set up?\" data-email-share-error-text=\"If you&#039;re having problems sharing via email, you might not have email set up for your browser. You may need to create a new email yourself.\" data-email-share-nonce=\"87d97c808e\" data-email-share-track-url=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/?share=email\"><span>Email<\/span><\/a><\/li><li class=\"share-print\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-print sd-button share-icon\" href=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/\" target=\"_blank\" title=\"Click to print\" ><span>Print<\/span><\/a><\/li><li class=\"share-end\"><\/li><li class=\"share-reddit\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-reddit sd-button share-icon\" href=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/?share=reddit\" target=\"_blank\" title=\"Click to share on Reddit\" ><span>Reddit<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><\/div><\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Today, ridley (the server, not the villain from\u00a0Metroid) had its first major unexpected downtime incident. \u00a0Following is a reconstruction of events. \u00a0All times are given in UTC. 26 April 2013, approximately 18:01 &#8211; All connections between ridley and the outside world mysteriously drop. \u00a0Icinga notices this and reports it as issues connecting to phazon for <a href=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/\"><b>&#8230;Read the Rest<\/b><\/a><\/p>\n<div class=\"sharedaddy sd-sharing-enabled\"><div class=\"robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing\"><h3 class=\"sd-title\">Share this:<\/h3><div class=\"sd-content\"><ul><li class=\"share-facebook\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-facebook-76\" class=\"share-facebook sd-button share-icon\" href=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/?share=facebook\" target=\"_blank\" title=\"Click to share on Facebook\" ><span>Facebook<\/span><\/a><\/li><li class=\"share-twitter\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"sharing-twitter-76\" class=\"share-twitter sd-button share-icon\" href=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/?share=twitter\" target=\"_blank\" title=\"Click to share on Twitter\" ><span>Twitter<\/span><\/a><\/li><li><a href=\"#\" class=\"sharing-anchor sd-button share-more\"><span>More<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><div class=\"sharing-hidden\"><div class=\"inner\" style=\"display: none;\"><ul><li class=\"share-email\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-email sd-button share-icon\" href=\"mailto:?subject=%5BShared%20Post%5D%20Unexpected%20Downtime%20Postmortem%3A%20ridley.fastlizard4.org%2C%2026%20April%202013&body=http%3A%2F%2Ffastlizard4.org%2Fblog%2F2013%2F04%2F26%2Funexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013%2F&share=email\" target=\"_blank\" title=\"Click to email a link to a friend\" data-email-share-error-title=\"Do you have email set up?\" data-email-share-error-text=\"If you&#039;re having problems sharing via email, you might not have email set up for your browser. You may need to create a new email yourself.\" data-email-share-nonce=\"87d97c808e\" data-email-share-track-url=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/?share=email\"><span>Email<\/span><\/a><\/li><li class=\"share-print\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-print sd-button share-icon\" href=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/\" target=\"_blank\" title=\"Click to print\" ><span>Print<\/span><\/a><\/li><li class=\"share-end\"><\/li><li class=\"share-reddit\"><a rel=\"nofollow noopener noreferrer\" data-shared=\"\" class=\"share-reddit sd-button share-icon\" href=\"http:\/\/fastlizard4.org\/blog\/2013\/04\/26\/unexpected-downtime-postmortem-ridley-fastlizard4-org-26-april-2013\/?share=reddit\" target=\"_blank\" title=\"Click to share on Reddit\" ><span>Reddit<\/span><\/a><\/li><li class=\"share-end\"><\/li><\/ul><\/div><\/div><\/div><\/div><\/div>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[16],"tags":[19,18,17],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1rJy3-1e","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/posts\/76"}],"collection":[{"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/comments?post=76"}],"version-history":[{"count":5,"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/posts\/76\/revisions"}],"predecessor-version":[{"id":82,"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/posts\/76\/revisions\/82"}],"wp:attachment":[{"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/media?parent=76"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/categories?post=76"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/fastlizard4.org\/blog\/wp-json\/wp\/v2\/tags?post=76"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}