Advertisement

EVE Online devblog addresses recent server issues

EVE Online's server is a complex beast, holding the title of biggest supercomputer in the gaming industry. The main server cluster is housed in London and serviced by a team of IBM engineers. In addition to constant hardware upgrades to take advantage of the newest technology, CCP's network programmers work around the clock to improve performance and track down bugs that will affect the game. EVE is no stranger to lag or network issues and older players know all too well that server troubles are expected around patch days.

When the Dominion expansion was released, there were far more complaints of server issues than could be attributed to the usual "patch day blues". Now several months down the line we're still hearing horror stories of fleet battles lagging unbearably with only a few hundred players. The last few months have seen an increasing number of node deaths and database failovers, in some cases causing unscheduled server reboots. Read on to find out what CCP is doing to combat the issue.


In their latest devblog "Restoring tranquility to Tranquility", CCP Valar explains what they're doing to track down the current server issues and restore the EVE server to normality. As the server uses a complex system of concurrent process threads, a rare issue with "race conditions" can arise where instructions arrive in an order that the system can't handle. The current EVE server woes stem from a bug with Tranquility's network manager and CCP has been working with their support vendors to track it down. This is a high priority problem and CCP is using every resource they can to get it fixed.

As any programmer will tell you, the problem with concurrency bugs is that they're often remarkably hard to reproduce and thus difficult to fix. When they occur on the live server, it's often impossible to tell what caused it and if it can't be reproduced reliably, a solution can't be engineered. CCP's programmers are currently working to produce test scripts which should hopefully cause the issue to occur on the test server. For now, however, they've installed a workaround procedure that has decreased the rate of database failovers and should help keep the server stable while they hunt for the elusive bug.