We wanted to share this timeline of yesterday's outage so teams are better able to understand how their systems were affected during these times.
At 19:33 UTC 10/31/2019 our internal monitoring systems detected that the backend storage system utilized for our performance cache for near head data was degraded and our on-call team was notified.
At 20:00 UTC 10/31/2019 our on-call team completed the failover to backup infrastructure which began serving mainnet traffic and the incident was resolved.
At 00:14 UTC 11/1/2019 our internal monitoring systems detected that the backup storage system for our near head data was degraded and our on-call team was notified.
At 00:54 UTC 11/1/2019 our on-call team completed the failover to backup infrastructure which began serving mainnet traffic but was running in a degraded state (possible cross-method block height inconsistency issues for users)
At 08:19 UTC UTC 11/1/2019 a new storage system came online and began serving mainnet traffic and the incident was resolved.
As we mentioned in our previous post-mortem, our team is actively working on moving away from our current storage system in our next architecture update which is coming soon. While this work is ongoing we are actively improving our monitoring and alerting capabilities and are working to optimize our backup-failover process and tools.