On Friday, February 7 2020, Eclipse.org suffered a severe service disruption to many of its web properties when our primary authentication server and file server suffered a hardware failure.
For 90 minutes, our main website, www.eclipse.org, was mostly available, as was our Bugzilla bug tracking tool, but logging in was not possible. Wiki, Eclipse Marketplace and other web properties were degraded. Git and Gerrit were both completely offline for 2 hours and 18 minutes. Authenticated access to Jiro -- our Jenkins+Kubernetes-based CI system, was not possible, and builds that relied on Git access failed during that time.
There was no data loss, but there were data inconsistencies. A dozen Git repositories and Gerrit code changes were in an inconsistent state due to replication schedules, but thanks to the distributed nature of Git, the code commits were still in local developer Git repositories, as well as on the failed server, which we were eventually able to revive (in an offline environment). Data inconsistencies were more severe in our LDAP accounts database, where dozens of users were unable to log in, and in some isolated cases, users reported that their account was reverted back to old data from years prior.
In hindsight, we feel this outage could have, and should have been avoided. We’ve identified many measures we must enact to prevent such unplanned outages in the future. Furthermore, our communication and incident handling processes proved to be flawed, and will be scrutinized and improved, to ensure our community is better informed during unplanned incidents.
Lastly, we’ve identified aging hardware and Single Points of Failure (SPoF) that must be addressed.
File server & authentication setup
At the center of the Eclipse infra is a pair of servers that handle 2 specific tasks:
The server pair consists of a primary system, which handles all the traffic, and a hot spare. Both servers are configured identically for production service, but the spare server sits idly and receives data periodically from the primary. This specific architecture was originally implemented in 2005, with periodical hardware upgrades over time.
Timeline of events
Friday Feb 7 - 12:33pm EST: Fred Gurr (Eclipse Foundation IT/Releng team) reports on the Foundation’s internal Slack channel that something is happening to the Infra. Denis observes many “Flaky” status reports on https://status.eclipse.org but is in transit and cannot investigate further. Webmaster Matt Ward investigates.
12:43pm: Matt confirms that our primary nfs/ldap server is not responding, and activates “Plan A: assess and fix”.
12:59pm: Denis reaches a computer and activates “Plan B: prepare for Failover” while Matt works on Plan A. The “Sorry, we are down” page is served for all Flaky services except www.eclipse.org, which continues to be served successfully by our nginx cache.
1:18pm: The standby server is ready to assume the “primary” role.
1:29pm: Matt makes the call for failover, as the severity of the hardware failure is not known, and not easily recoverable.
1:49pm: www.eclipse.org, Bugzilla, Marketplace, Wiki return to stable service on the new primary.
2:18pm: Git and Gerrit return to stable service.
2:42pm: Our Kubernetes/OpenShift cluster is updated to the latest patchlevel and all CI services restarted.
4:47pm: All legacy JIPP servers are restarted, and all other remaining services report functional. At this time, we are not aware of any issues.
During the weekend, Matt continues to monitor the infra. Authentication issues crop up over the weekend, which are caused by duplicated accounts and are fixed by Matt.
Monday, 4:49am EST: Mikaël Barbero (Eclipse Foundation IT/Releng team) reports that there are more duplicate users in LDAP that cannot log into our systems. This is now a substantial issue. They are fixed systematically with an LDAP duplicate finder, but the process is very slow.
10:37am: First Foundation broadcast on the cross-project mailing list that there is an issue with authentication.
Tuesday, 9:51am: Denis blogs about the incident and posts a message to the eclipse.org-committers mailing list about the ongoing authentication issues. The message, however, is held for moderation and is not distributed until many hours later.
Later that day: Most duplicated accounts have been removed, and just about everything is stabilized. We do not yet understand the source of the duplicates.
Wednesday: duplicate removals continue, as well as investigation into the cause.
Thursday 9:52am: We file a dozen bugs against projects whose Git and Gerrit repos may be out of sync. Some projects had already re-pushed or rebased their missing code patches and resolved the issue as FIXED.
Friday, 2:58pm: All remaining duplicates are removed. Our LDAP database is fully cleaned. The failed server re-enters production as the hot standby - even though its hardware is not reliable. New hardware is sourced and ordered.
The physical servers assuming our NAS/LDAP setup are server-class hardware, 2U chassis with redundant power supplies, ECC (error checking and correction) memory, RAID-5 disk arrays with a battery-backup RAID controller memory. Both primary and standby servers were put into production in 2011.
On February 7, the primary server experienced a kernel crash from the RAID controller module. The RAID controller detected an unrecoverable ECC memory error. The entire server became unresponsive.
As originally designed in 2005, periodical (batched) data updates from the primary to the hot spare were simple to set up and maintain. This method also had a distinct advantage over live replication: rapid recovery in case of erasure (accidental or malicious) or data tampering. Of course, this came at a cost of possible data loss. However, it was deemed that critical data (in our case, Source Code) susceptible to loss during the short time was also available on developer workstations.
Failover and return to stability
As the standby server was prepared for production service, the reasons for the crash on the primary server were investigated. We assessed the possibility of continuing service on the primary; that course of action would have provided the fastest recovery with the fewest surprises later on.
As the nature of the hardware failure remained unknown, failover was the only option. We confirmed that some data replication tasks had run less than one hour prior to failure, and all data replication was completed no later than 3 hours prior. IP addresses were updated, and one by one, services that depended on NFS and authentication were restarted to flush caches and minimize any potential for an inconsistent state.
At about 4:30pm, or four hours after the failure, both webmasters were confident that the failover was successful, and that very little dust would settle over the weekend.
Throughout the weekend, we had a few reports of authentication issues -- which were expected, since we failed over to a standby authentication source that was at least 12 hours behind the primary. These issues were fixed as they were reported, and nothing seemed out of place.
On Monday morning, Feb 10th, the Foundation’s Releng team reported that several committers had authentication issues to the CI systems. We then suspected that something else was at play with our authentication database, but it was not clear to us what had happened, or what the magnitude was. The common issue was duplicate accounts -- some users had an account in two separate containers simultaneously, which prevented users from being able to authenticate. These duplicates were removed as rapidly as we could, and we wrote scripts to identify old duplicates and purge them -- but with >450,000 accounts, it was time-consuming.
At that time, we got so wrapped up in trying to understand and resolve the issue that we completely underestimated its impact on the community, and we were absolutely silent about it.
On Friday afternoon, February 14, we were able to finally clean up all the duplicate accounts and understand why they existed in the first place.
Prior to December, 2011, our LDAP database only contained committer accounts. In December 2011, we imported all the non-committer accounts from Bugzilla and Wiki into an LDAP container we named “Community”. This allowed us to centralize authentication around a single source of truth: LDAP.
All new accounts were, and are created in the Community container, and are moved into the Committer container if/when they became an Eclipse Committer.
Our primary->secondary LDAP sync mechanism was altered, at that time, to sync the Community container as well -- but it was purely additive. Once you had an account in Community, it was there for life on the standby server, even if you became a committer later on. Or if you’d ever change your email address. This was the source of the duplicate accounts on the standby server.
A new server pair has been ordered on February 14, 2020 . These servers will be put into production service as soon as possible, and the old hardware will be recommissioned to clustered service. With these new machines, we believe our existing architecture and configuration can continue to serve us well over the coming months and years.
Take-aways and proposed improvements
Although the outage didn’t last incredibly long (2 hours from failure to the beginning of restored service), we feel it shouldn’t have occurred in the first place. Furthermore, we’ve identified key areas where our processes can be improved - notably, in how we communicate with you.
Here are the action items we’re committed to implementing in the near term, to improve our handling of such incidents:
Communication: Improved Service Status page. https://status.eclipse.org gives a picture of what’s going on, but with an improved service, we can communicate the nature of outages, the impact, and estimated time until service is restored.
Communication: Internally, we will improve communication within our team and establish a maintenance log, whereby members of the team can discover the work that has been done.
Staffing: we will explore the possibility of an additional IT hire, thus enhancing our collective skillset, and enabling more overall time on the quality and reliability of the infra.
Aging Hardware: we will put top-priority on resolving aging SPoF, and be more strict about not running hardware devices past their reasonable life expectancy.
In the longer term, we will continue our investment in replacing SPoF with more robust technologies. This applies to authentication, storage, databases and networking.
Process and procedures: we will allocate more time to testing our disaster recovery and business continuity procedures. Such tests would likely have revealed the LDAP sync bug.
We believe that these steps will significantly reduce unplanned outages such as the one that occured on February 7. They will also help us ensure that, should a failure occur, we recover and return to a state of stability more rapidly. Finally, they will help you understand what is happening, and what the timelines to restore service are, so that you can plan your work tasks and remain productive.