Friday, December 14, 2012

Cloud + Local imagery storage

The Department of Homeland Security has said that it wants imagery delivered in the cloud. Several counties and states have expressed the same desire. Amazon S3 prices are quite reasonable for final imagery data delivered, especially compared to the outrageous prices that imagery vendors have been charging for what amounts to a well-backed-up on-site web server with static content. I've heard of hundreds of thousands of dollars for serving tens of terabytes.

Everyone wants the reliability, size, and performance scalability of cloud storage. No matter how popular your imagery becomes (e.g. you happen to have the only pre-Sandy imagery of Ocean City), and no matter how crappy your local internet link happens to be, folks all around the world can see your imagery. And many customers are more confident in Amazon or Google backups than their local IT backups.

But everyone also wants the reliability of local storage. E911 services have to stay up even when the internet connection goes down. So they need their own local storage. This also helps avoid some big bandwidth bills on the one site you know is going to hammer the server all the time.

So really, imagery customers want their data in both places. But that presents a small problem because you do want to ensure that the data on both stores is the same. This is the cache consistency problem. If you have many writers frequently updating your data and need transaction semantics, this problem forces an expensive solution. But, if like most imagery consumers you have an imagery database which is updated every couple of months by one of a few vendors, with no contention and no need for transaction semantics, then you don't need an expensive solution.

NetApp has a solution for this problem which involves TWO seriously expensive pieces of NetApp hardware, one at the customer site and one in a solo site with a fat pipe to Amazon. The two NetApp machines keep their data stores synchronized, and Amazon's elastic cloud accesses data stored at the colo over the fat pipe. This is... not actually cloud storage. This is really the kind of expensive shoehorned solution that probably pisses off customers more than me because they have to write the checks.

The right answer (for infrequently-updated imagery) is probably a few local servers with separate UPSes running Ceph and something like rsync to keep updates synchronized to S3. Clients in the call center fail over from the local servers to S3, clients outside the call center just use S3 directly.

I feel sure there must be some Linux vendor who would be happy to ship the Nth system they've build to do exactly this, for a reasonable markup to the underlying hardware.