Views from Phanfare CEO and Co-founder Andrew Erlichson

Link Outage post-mortem

A module on an network switch connecting our servers to the internet failed at 950pm EST. After several attempts to get the switch to operate normally, the switch module was replaced. Phanfare was available at 10:59pm,

Upon analysis, we could have avoided downtown by further replicating some of our network and firewall infrastructure to provide for multiple network paths through redundancy. We are evaluating the cost effectiveness of adding the required hardware to avoid this type of outage in the future.

To give some perspective, being down one hour per month provides us with 99.86% uptime, not the 99.999 that Ma Bell used to provide, but pretty good. It is believed that Amazon, for example, strives for 99.99% uptime for their systems.

Sorry for the disruption in service!

  • Pingback: Outage post-mortem « Phanfare Health Status Blog

  • anon

    “strives for” being the key word — they aren’t hitting that. 99.99 is very hard actually.

  • Brad

    4 nines is a waste of time and money for a photo site. If you are getting 3 you are knocking it out of the park. As long as you aren’t losing data (other than any potential losses of new data during the downtime), 2 nines is plenty for me. Most downtime is scheduled for off-hours anyway.

  • Ken Donoghue

    Customer satisfaction and services availability are understandably Andrew’s concerns (Kudos, BTW, for explaining the outage.) Brad’s input, as a user of Phanfare, is worth considering though because it’s users who often define what an acceptable level of availability is for an application or service. The gap between the two is not nearly as wide as it once was (and ususally the service expectations are reversed! Kudos to Brad). A quick web search will yield a number of solutions that are neither complex nor expensive and can deliver up to four nines … some better and some with virtualization included. This end of the availability spectrum is changing at an incredible pace, for the better.

  • Andrew Erlichson

    We are most concerned about the durability of the data versus the availability. Our data is stored at Amazon S3, and they provide excellent durability and very good availability.

    We actually don’t need Amazon to be up to serve must of our traffic. In fact, in the big Amazon S3 outage, most Phanfare users were not aware of the issues.

  • ricky

    Comments were being made such as: “and this is why you have to setup a fail-safe”, “the s3 service is great but this just proves you can’t rely on it”, “My business is effectively closed right now because Amazon did something wrong. I’ll have to reconsider using the service now.”, “Errors happen, but there MUST be a fail-safe way of reporting them.”

  • Brad

    If you are using Amazon and are serving up critical information then you have to at least have your own cache if not your own complete copy. That can be prohibitively expensive, especially for something like personal photos where durability is far more important than dealing with a few hours of outage once or twice a year.

  • Andrew Erlichson

    We actually do cache the recent stuff on our own servers so that we can have Amazon be down and still have Phanfare be mostly up. We cache websized photos and that is only 10% of the data. We can actually cache the entire Phanfare data set in web sized renditions at just 10% of the storage cost of storing the originals; less in fact because our cache does not need to be durable (replicated across datacenters and geographies)

  • Brad

    So you take uploads into your datacenter, process them and then send them to Amazon? It sounds like you are paying for upload bandwidth 3 times: customer to you, and you to amazon twice (once to your bandwidth provider and once to Amazon). Or do you only do that when Amazon is offline?

  • Andrew Erlichson

    We pay for the bandwidth 2x in our mind. Once when it hits our datacenter and once when we move it to Amazon. Our datacenter, like many, is more focused on how many bytes you serve than how many bytes you absorb since the bulk of the consumed bandwidth is outbound.

    Back to Phanfare blog home »

© 2007-8 Phanfare, Inc.