Monday, February 13, 2023

Linden Lab Explains "February 1st Outage"

 
With the scale of the disruptions on Wednesday February 1st, it was likely Linden Lab would find themselves obligated to their customer base, the residents, to try and explain what happened. It turns out they made a statement about the "Recent February 1st Outage" the following day in the "Tools and Technology" section of the official blog. 

This Wednesday, February 1st, was a rocky one for Second Life, especially for regions on our Release Candidate (RC) channels. As with many large-scale services, we experience service interruptions from time to time, but yesterday's problems were unique and noteworthy enough that they deserve a sincere apology and more thorough explanation. 
 
The Lab stated the main cause was "a bug in object inventory behavior that was introduced by new code" saying the bug, "caused some objects' scripts to enter a non-running state."

RC rolls are regularly scheduled for 7AM Pacific (SLT) every Wednesday. While conducting one of these deployments, we received initial reports of issues relating to script behavior at 8:30AM. We immediately started investigating, and by 9AM it was clear that we needed to stop the roll, because the issue was pervasive and clearly related to the new changes. As we do in these situations, we declared an “Incident” and an Incident Commander took charge. 

By 9:30 we started a rollback, reverting the affected simulators to their previous version. We also evaluated what additional actions we would need to take, as it was unclear that a rollback alone would start the scripts which had stopped. This quickly proved to be the case, and we came up with a new plan - one that would ensure that scripts would perform as expected going forward, but possibly undo the changes our residents had been making since the time we had introduced the bug. Although this meant more downtime, it prevented further content loss, and proved to be the best way to put the grid back in order. The team came up with a quick, clean, and efficient way to achieve this and get everyone back on track. 

By 12PM the decision was made to take this direction. By 1:15PM the code was complete, by 1:40PM we confirmed that it worked. By 2:25PM all regions were brought back up.  

The report in full, signed by Grumpity Linden "and the Second Life team," can be found (here).

"I’m grateful to know that while we may make mistakes, we will not sweep them under the rug, nor look for someone to blame - we will come together and make it right."

1 comment:

  1. However, I and almost all my friends (on various viewers), are STILL experiencing numerous lag-related problems to this day, two weeks after Feb 1. Is Linden Lab unaware of all this? Help!!

    ReplyDelete