Thursday, September 9, 2010

Google App Engine: Scaling cost

I recently released an application for the iPhone called Device Locator. This application has an interesting pricing model: We provide hosting for users' recent location data for a one time fee. Currently the application runs on a LAMP stack on a shared server. The over all cost is reasonable. This solution, however will not be viable for long. For every new user we get, we can add about 33 new requests per day. These requests are going to come in no matter what. In order to keep this service running in the future, we will need to write a scalable service.

Choosing a Service

There are many services out there in the cloud computing realm that we can utilize to help us scale. I will not go into comparing the different services, but simply talk about my approach to solving the problem.

My initial gut feeling was that the Google App Engine should be reasonable. I reasoned that: I know java (later you will see why this didn't matter), Google does crazy stuff and heck I have taken a graduate level course in distributed computing which talked specifically about Google's scaling infrastructure. One of the other major factors in choosing a cloud computing service was that I didn't have to manually run any services. I do not need to worry about administering my virtual servers. Google's App Engine seems to provide everything I can ask for. (Except running an Apple Push Client)

The Problem

The problem is simple. The service should keep meta data about a device. When it receives a location update, it should add the location to a queue and delete the oldest location if the number of locations is greater then a specific constant.

With the scaling problem simplified to its guts, we can now try approach the problem.

Getting a Feel for the App Engine

Before we invest heavily in the app engine, we needed to determine the real cost of running this service on the App Engine. Our solution was to simply forward all locations from our current service to the Google App Engine and see how well it handles the requests.

At first we implemented the service in Java using proper classes and domain models. We had a Device Object with an array of DeviceLocations. The array was sorted using the annotations provided by JDO. So when we got a new location we simply remove the last location and add the new one at the end.

This solution was simple, but costly. It took 500 ms to process each request. Our queue size for the locations was 100. This processing time was simply unacceptable. We were being charged $0.10 for each hour. With the amount of users it would easily add up.

I searched far and wide as to what could be causing the CPU usage to be so much. I ran into some discussions which talked about how the CPU time was not just web server's CPU time but also the datastore's. I read some more and realized that Google's calculation of CPU time was not exact. It was an estimate. Their CPU time is at a fixed gigahertz rate. If they use faster CPU's you can burn through your time much much quicker. This was a little unsettling.

To debug what was causing the CPU usage, I stripped down the code to the bare minimum. I realized that to pull an entity and update a value it took 150 ms. After some more research I found out that applications written in Python could be 50% faster.

So I took to writing the service in Python. I quickly realized that Python had much more control over the datastore. It was really easy to pick which fields to index and which fields to ignore. In Java, all fields were indexed. This might have added to the CPU usage. Python also had a better implementation for the different data types. There was a data type to store a 1mb string.

At the end I decided to only use 1 entity type: Device. To store the locations I serialized and deserialized the an array stored in JSON. I reasoned that since since Google does not calculate CPU time exactly, I wanted to reduce the amount of time spent in I/O. My new implementation would query the locations using an ID and then manipulate the field and save it.

To my surprise the same exact functionality on average used only 116 ms. This was more reasonable.

Although I am still not impressed, it is reasonable.

According to Moore's Law every two years costs should be recalculated. I do not think Google has done that yet.

TL;DR For the Google App Engine use Python. You do not want to scale a costly service.