Location-based mobile app foursquare seems to have reached stable condition after two long outages, with the first starting at around 11 a.m. ET Monday and lasting for some 11 hours, and the second knocking it out for about six hours Tuesday night, from around 6:30 p.m.-12:30 a.m. Wednesday.
A foursquare outage doesn’t have the same effect as Facebook or Twitter going down, as the user experience consists mostly of quick visits, and not spending long periods of time on the site or app, but it may have cost a certain WebNewser editor the chance to become Mayor of Bowl Rite Bowling Lanes in Union City, N.J.
An indication of what kind of night it must have been at foursquare headquarters Tuesday was this update at 10:28 p.m.: “Running on pizza and Red Bull. Another long night.”
From the foursquare Blog:
So, that was a bummer.
Yesterday (Monday), we experienced a very long downtime. All told, we were down for about 11 hours, which is unacceptably long. It sucked for everyone (including our team — we all check in everyday, too). We know how frustrating this was for all of you because many of you told us how much you’ve come to rely on foursquare when you’re out and about. For the 32 of us working here, that’s quite humbling. We’re really sorry.
This blog post is a bit technical. It has the details of what happened, and what we’re doing to make sure it doesn’t happen again in the future.
What happened: The vast bulk of the data we store is from user check-in histories. The way our databases are structured is that that data is spread evenly across multiple database “shards,” each of which can only store so many check-ins. Starting around 11 a.m. ET yesterday, we noticed that one of these shards was performing poorly because a disproportionate share of check-ins were being written to it. For the next hour-and-a-half, until about 12:30 p.m., we tried various measures to ensure a proper load balance. None of these things worked. As a next step, we introduced a new shard, intending to move some of the data from the overloaded shard to this new one.
We wanted to move this data in the background while the site remained up. For reasons that are not entirely clear to us right now, though, the addition of this shard caused the entire site to go down. In addition, moving the data over to the new shard did not free up as much space as anticipated (partially because of data fragmentation, partially because our database is partitioned by user ID). We spent the next five hours trying different approaches to migrating data to this new shard and then restarting the site, but each time we encountered the same problem of overloading the initial shard, keeping the site down.
At 6:30 p.m. ET, we determined that the most effective course of action was to reindex the shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours. At 11:30 p.m., the site was brought back up. Because of our safeguards and extensive backups, no data was lost.
What we’ll be doing differently — technically speaking: So we now have more shards and no danger of overloading in the short- to medium-term. There are three general technical things we’re investigating to prevent this type of error from happening in the future:
The makers of MongoDB — the system that powers our databases — are working very closely with us to better deal with the problems we encountered.
We’re making changes to our operational procedures to prevent overloading, and to ensure that future occurrences have safeguards so foursquare stays up.
Currently, we’re also looking at things like artful degradation to help in these situations. There may be times when we’re overloaded in the future, and it would be better if certain functionalities were turned off rather than the whole site going down, obviously.
We have a number of improvements we’ll be making in the coming weeks, and we’ll detail those in a future post.
What we’re doing differently — in terms of process: So, in addition to our technical stumble, we also learned that we need a better process to keep all of you posted when something like this happens.
During these outages, regular updates (at least hourly) will be tweeted from @4sqsupport.
We’ve created a new status blog at status.foursquare.com, which will have the latest updates.
A more useful error page; instead of having a static graphic saying we’re upgrading our servers (which was not completely accurate), will have a more descriptive status update. Of course we hope not to see the pouty princess in the future …
Hopefully, this makes what happened clear and will help lead to a more reliable foursquare in the future. We feel tremendous responsibility to our community and yesterday’s outage was both disappointing and embarrassing for us. We’re sorry.
From the foursquare Status Blog:
6:39 p.m.: We are actively investigating the issue.
UPDATE Oct. 5 6:59 p.m.: We’ve identified the issue and are working to resolve it.
UPDATE Oct. 5 8:01 p.m.: Our server team is still working to resolve the problem. The issue is related to yesterday’s outage.
UPDATE Oct. 5 9:09 p.m.: Still plugging away at HQ. Hoping to resolve the issue soon!
UPDATE Oct. 5 10:28 p.m.: Running on pizza and Red Bull. Another long night.
UPDATE Oct. 6 12:30 a.m.: And we’re back. Web, API, and apps are up and running.