Views from Phanfare CEO and Co-founder Andrew Erlichson

Link Danger data loss give hosted services a bad name

Last week we learned that Danger, a subsidiary of Microsoft, has lost huge amounts of customer data. Danger makes the sidekick smartphone, and they offer a service to synchronize the phone (contacts, photos, etc) to hosted servers, aka, the Cloud. The critics wanted to know “why was there no backup?” And of course then there was the inevitable refrain that if you want to keep your data, you should be backing it up yourself and not relying on cloud services.

I think this is entirely wrong. Using a cloud service should free the consumer from having to do backup. Most times when you use a cloud service, backup is not even possible. How do you backup your gmail account? How about your facebook account?

The whole reason to use a hosted service is to free you of having to deal with the muck of running your own servers and doing backup. It makes sense precisely because building a reliable service is so difficult.

I am not on the inside at Danger, but I think I know how this happened. While people think that Danger lost a lot of data, the truth is, they lost very little. The type of data they hosted (contacts, text emails) is small compared to photo and video data, of which they had relatively little. My guess is that they lost under 10 terabytes of data. It might have been under a terabyte. And when you don’t host huge amounts of data, you might be tempted to just put it on RAID’ed servers and try to do nightly backups. Turns out, its very hard not to lose all your data when you use RAID.

RAID pretty much requires you to run nightly backups or rely on a proprietary replication scheme. RAID is sold as being completely reliable but anyone who has used RAID knows this is far from true. Double disk failures are more common than expected, especially when drives are from the same lot. Sometimes you lose the whole RAID chain. Replacement of disks is a manual process and sometimes people replace the wrong disk. Corruption of a RAID volume is not unheard of.

Then there is the backup window, which becomes longer each day until you finally start spending more hours backing up the data than there are hours in the day (been there, done that). And when backups are occurring, the performance of the RAID is significantly degraded. In sum, RAID does not scale. And any service that gets large enough eventually abandons RAID for some distributed solution that scales better, and coincidentally, is a lot more resistant to losing all the data at once.

Phanfare uses Amazon S3 for storage of photos and videos. Amazon is fairly vague on how it works, and we are under NDA, but the basic story is that it works much like other modern distributed file systems. It keeps multiple copies on multiple servers, geographically distributed, and has a scheme for replicating data when it programatically detects that a copy of an object has been lost.

As such, there are no backups of Phanfare. Yup, that’s right. We don’t backup the image and video data. It’s on Amazon S3 and that system uses an approach to persistence that is fundamentally different than the approach that bit Danger in the you know what.

Truth is, backups serve two purposes in am modern system. They do help assure that you don’t lose data to a system problem. And they serve as checkpoint against human error of deliberately deleting data.

The problem with S3 is, when you give it to the command to delete a file, it gets deleted, reliably. There is no going back to last night’s checkpoint. To combat this issue, we don’t really run deletes when end users delete their images. We wait a while. And we have a trash can system to make absolutely sure you want to delete data. Waiting on the deletes is really to protect against a systemic failure on our part (rogue code that deletes files).

We still use some RAID storage at Phanfare for some relational database systems holding meta data. The web service caches this data using memcache. (Another rule of large scale systems is that relational databases don’t scale either). At some point, we will scale past being able to use RAID and caching for that. Until that point, we do have to perform old school backups of the relational database to a secondary data center. And I worry a lot more about those than I do about the image and video data at Amazon S3.

The whole Danger incident sends the wrong message. Companies are much better at keeping data reliably compared to consumers. That Danger dropped the ball should not indict the whole industry. Instead, consumers should demand that companies be more transparent about their approaches to keeping data reliably.

In recognition that the ultimate risk is always that you do not know all the risks, we also offer a DVD subscription service that returns your data to you incrementally over time, automatically, so both we have and you have it.

  • Brad Murray

    I wouldn't call S3 geographically distributed. They have east (DC – I believe) and west (Seattle) clusters of datacenters (and at least one in Europe if you're willing to pay more to store your stuff). If your request routes to the west cost it is stored only there (in multiple local datacenters). I was under the delusion that they kept at least 2 copies where it's uploaded and sent another one to the other side of the country, but that is not the case. Granted – a large-scale terrorist attack in the Seattle area would probably cripple Amazon anyway, but it would have a high likelihood of destroying a lot of data and I think a lot of their customers would be shocked.

  • erlichson

    I don’t know the details but my recollection, from the days where I worked in banking, is that most institutions try to keep data centers approximately 20 miles away from each other to satisfy the requirements of disaster recovery. Amazon considers the exact details of their implementation to be highly proprietary and does not disclose them to the public to my knowledge.

  • Brad Murray

    They don't give the details, but they have confirmed to me that they don't keep S3 objects on both coasts. The only way to do that is to store multiple copies in S3 yourself and make sure that the transfers are routed correctly to get the objects on the correct coast. They may have made the destination configurable in the request, but I haven't seen it.

  • rlieving

    The most likely risk of losing data at Phanfare is business risk, not the risk of terrorist attack or technical failure. I like your analysis about what happened at Microsoft/Danger and agree that it might send the wrong signals about the trustworthiness of the cloud.

    On the other hand, I think it is irresponsible to imply that a person should not make multiple backups of their data. In the age of large, blue chip, publicly traded companies going out of business, it isn't unreasonable to think that the same thing could happen to a small company like Phanfare or even a big company like Amazon.

    I personally keep a copy of my pictures on my hard drive, back up to JungleDisk AND use Phanfare. As Brad pointed out below, I have a slight redundancy risk in that JungleDisk and Phanfare use Amazon as their persistence method. But I think the risk is minimal that S3 AND the vendors I deal with will all go out of business at the same time that I have a hard drive failure. And if that does happen, perhaps the zombies will be rising out of the ground and I will have bigger issues to deal with.

    Here's the lesson I took away from all this: my analysis that the Sidekick is weak was correct. Get an iPhone, Nokia, Blackberry (or even) Windows Mobile and personally back up your data.

  • erlichson

    I agree that business risk is a greater risk than technology when storing data at phanfare, but I don’t think the business risk is much of a risk because we would need to disappear overnight, which is not going to happen. Even if for some reason phanfare was to discontinue the service, that event would not happen overnight. There would be time to get back your data.

    I will grant you, however, that retrieving your data from any online service all at once can be difficult. If your collection is over 10 GB, you would likely find it easier to have us send it on DVD, which we offer today.

    Personally, I like our DVD subscription service. That way you have it and we have it, and you get it back in a form that benefits from the organizational effort you expended with your Phanfare account.

  • tonybuy

    China manufacturer offers dvd case,cd case,dvd box,cd box,dvd/cd pp box automatic and semi automatic welding machine,second hand injection machine,plastic moulds both hot runner moulds and cold runner molds,PP/poly and PS recycle material,CPP film for dvd cases and poly cd cases and Marker Pens.

  • Desktop Hard Drives

    I strongly recommend you to take backup of all your important files. This is the only way to retrieve files

  • Stylish Templates

    The web service caches this data using memcache. (Another rule of large scale systems is that relational databases don’t scale either). At some point, we will scale past being able to use RAID and caching for that. Until that point, we do have to perform old school backups of the relational database to a secondary data center.

  • Pingback: uberVU - social comments

    Back to Phanfare blog home »

© 2007-8 Phanfare, Inc.