Advertisement

Joshua Linden's post-1.18.5 server deployment post-mortem

Joshua Linden of Linden Lab has posted a comprehensive post-mortem of the issues involving the 1.18.5 server release on the Second Life grid. Yes, the viewer itself is only up to version 1.18.4. A previous effort called 'Message Liberation' has provided the first steps in allowing different server and viewer versions to coexist with each-other.

For those people who've ever managed or been involved with large-scale grid deployments, you'll probably sit back and chuckle ruefully, having experienced similar things first-hand yourself, or even accidentally caused them.

As always, when you're gearing up for something big third-party omens loom large. In this case, it started with a network outage at the Dallas colocation facility. The ISP thinks hardware problems.

When the central systems changes began to be applied on Tuesday morning, login services went awry. It wasn't apparent at the time, but it appeared to be an overload in a pool system that was designed to handle high-loads better than the older system. Only it wasn't. The amount of resources allocated was too small, but it wasn't until much later in the week that the reason that it was too small became apparent. Changes were made to increase the amount of allocated resources and everything was rescheduled.

The pool system is essentially supposed to allocate a fixed amount of resources, rather than taking the time to allocate and deallocate them on the fly. Requests then grab one of these resources as they come in, and when the request is done the resource becomes available for the next request. This is commonly used in high-performance web-servers and other high-performance systems to save time and enable expeditious handling of higher loads.

Thursday saw things go awry again. The asset cluster, a central cluster of database servers storing vast amounts of content didn't do what it was supposed to when new hardware was plugged in and old hardware was removed.

Apparently it sat their weeping for its lost loved ones rather than switching over to the new equipment for approximately a half hour until cajolery, bribery and threats convinced it to behave.

Once the new code rolled out, the real problem behind the pool system overload starts to become apparent. Two sets of services communicating with each-other, but each one thinking that the other one is going to close the data connection first. Eventually it closes itself, but each one ties up a unit of those pre-allocated resources. A communication/documentation error, essentially.

Hasty changes to code are made to perform a minimum-effort change to make one side actually blink first.

There are signs at this point that other things are going wrong, but with the update partway through not all of the reporting and monitoring systems are speaking the same language.

By the end of the restart it's apparent that a bunch of sims are crashing more often than they ought. A lot more often.

Friday morning, a quick attempt at a fix introduces a typo which blows things up. After fixing that the problem is traced to a move from UDP connections to TCP connections for some data. TCP connections are more reliable, but consume more system resources under poor network conditions, and the system makes heroic efforts to deliver that data, even after hope is lost, meaning it can stick around longer than the user it is intended for.

Only problem was that switching back to UDP caused a crash, and another rolling restart was required late on Friday.

Bonus problems with the networking between colocation facilities again on Saturday - this time an expired security certificate for the private network.

That's the overview. Joshua Linden explains it all in detail.