For my last Tech Tuesday, I described how we are going to save all of the offline ships in the world and then delete them all in a subsequent update.
Well, it turns out that as of late the PvE deployment, in particular, has been really struggling with the sheer amount of ships in the world so strange bugs such as ships not appearing have started to happen much more frequently.
Due to this, we’ve decided to bring forward the work of deleting all of the offline ships and respawning them when registered players login. I completed the work to delete offline ships today and the results are even better than we expected, so here are some graphs!
First, here’s the graph showing the number of entities in the world:
It peaks at 3 million entities as it takes a while for SpatialOS to load the massive snapshot. Over the course of an hour, we save all of the ships and then delete them hence the linear drop in the number of entities. We end up at just 43 thousand entities, an absolutely massive reduction.
This has had a huge impact on the CPU usages of our servers!
Firstly, the resting CPU has decreased pretty much linearly with the reduction in the number of entities. This is surprising to us because profiling showed that the offline ships weren’t directly doing a lot of work. What we think is happening is that they were indirectly causing everything to be more expensive just by being in lists of things to process as well as making garbage collection more expensive.
The other thing to note is that the CPU spikes from taking snapshots are effectively gone. We suspected that these spikes may have been translating to large in-game lag spikes, so fingers crossed there will be a big improvement there.
There is something quite mysterious going on here though. About 10 minutes after the last ship is destroyed, CPU usage shoots up temporarily. We think this may be explained by the next graph – memory usage:
Strangely, memory usage initially falls, but then begins to increase again as more and more ships get deleted. Then, once it reaches its previous peak we get the huge CPU spike and the memory usage drops. My theory is that the heap is massively fragmented at that point and it reaches a threshold at which the Java Virtual Machine decides to defragment the memory. I would expect to see garbage collection CPU usage increase at that point, but it doesn’t, however, that could just be due to profiling only measuring the sweep phase but not the defrag phase of garbage collection.
So what does this mean for latency? Here’s a graph of the current PvE deployment:
You can ignore the green line, which is player latency, but the next line below shows the latency of the worst server. It’s 180ms which is pretty crazy considering the machines are located in the same data center! Here’s the new deployment with ships being removed:
Granted there are no actual players on this deployment, but you can actually see the physics server latency dropping all the way down to 50ms as the ships are removed, and no sign of that enormous CPU spike! It’s going to be really interesting to see how this change impacts player latency.
Overall we expect this should fix a lot of technical issues with the game, as well as having other cool benefits in terms of the new features based on the ship saving (blueprints) and other random things like it only takes 2 minutes to clean the world snapshot now so maintenance should also be much faster!
(Screenshot by Spacebar Jazz on the Worlds Adrift Discord (https://discord.gg/worldsadrift))