Outage post-mortem
A module on an network switch connecting our servers to the internet failed at 950pm EST. After several attempts to get the switch to operate normally, the switch module was replaced. Phanfare was available at 10:59pm,
Upon analysis, we could have avoided downtown by further replicating some of our network and firewall infrastructure to provide for multiple network paths through redundancy. We are evaluating the cost effectiveness of adding the required hardware to avoid this type of outage in the future.
To give some perspective, being down one hour per month provides us with 99.86% uptime, not the 99.999 that Ma Bell used to provide, but pretty good. It is believed that Amazon, for example, strives for 99.99% uptime for their systems.
Sorry for the disruption in service!






Add New Comment
Viewing 8 Comments
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Add New Comment
Trackbacks
(Trackback URL)
November 12, 2008 at 12:19 pm
[...] Please see our blog for an explanation of yesterday’s [...]