Stale mainnet performance cache
Incident Report for Infura
Postmortem

We wanted to share this timeline of yesterday's outage so teams are better able to understand how their systems were affected during these times.

At 19:33 UTC 10/31/2019 our internal monitoring systems detected that the backend storage system utilized for our performance cache for near head data was degraded and our on-call team was notified.

At 20:00 UTC 10/31/2019 our on-call team completed the failover to backup infrastructure which began serving mainnet traffic and the incident was resolved.

At 00:14 UTC 11/1/2019 our internal monitoring systems detected that the backup storage system for our near head data was degraded and our on-call team was notified.

At 00:54 UTC 11/1/2019 our on-call team completed the failover to backup infrastructure which began serving mainnet traffic but was running in a degraded state (possible cross-method block height inconsistency issues for users)

At 08:19 UTC UTC 11/1/2019 a new storage system came online and began serving mainnet traffic and the incident was resolved.

As we mentioned in our previous post-mortem, our team is actively working on moving away from our current storage system in our next architecture update which is coming soon. While this work is ongoing we are actively improving our monitoring and alerting capabilities and are working to optimize our backup-failover process and tools.

Posted Nov 02, 2019 - 02:08 UTC

Resolved
This incident has been resolved.
Posted Nov 01, 2019 - 08:32 UTC
Update
We have confirmed that the fix for the near-head performance cache has resolved the initial errors. Some users may see increased data propagation latency between RPC method types until our backup systems come fully online.
Posted Nov 01, 2019 - 01:11 UTC
Monitoring
We are seeing elevated errors related to near head RPC traffic. Our team has implemented and rolled out a fix and we are continuing to monitor error rates which are beginning to return to normal.
Posted Nov 01, 2019 - 00:52 UTC
This incident affected: Mainnet (JSON-RPC API).