Stale mainnet performance cache

Incident Report for Infura

Postmortem

At Infura, we always strive for 100% data accuracy and service availability. When we fall short of that, we want our users to be aware and to understand how we are learning from previous incidents so that the quality of our service continues to improve.

During the 33 minutes of this incident, our API was returning stale block data with requests for most data being stalled at block 8795342

We utilize a high-performance datastore for storing near head data to optimize for data retrieval latency and consistency. This datastore is used as a common backend for answering common request methods like eth_getBlockNumber and eth_getBlcokByNumber, as well as state-accessing calls such as eth_call. In the event of a degradation of our primary datastore, we have hot-standby infrastructure in place but the current process for promoting the standby infrastructure requires manual approval steps. Because of this, our time to recovery is not as quick as it could be.

Going forward, we are going to prioritize work on automating this failover process. While failures occur in any system, automated remediation should shield our users from being affected when these events occur.

Additionally, we will strive to share updates in real-time as an incident is occurring so you have as much information as possible to keep your team and your users informed.

As we roll out improvements to our infrastructure we will keep you informed via our blog

Posted Oct 24, 2019 - 00:54 UTC

Resolved

At 07:17 UTC our internal monitoring systems detected that the backend storage system utilized for our performance cache for near head data was degraded and our on-call team was notified.

At 07:22 UTC our team began triaging the incident and after confirming that a failover to the backup infrastructure was appropriate, began a failover to route traffic to our redundant infrastructure.

At 07:23 UTC requests for some RPC methods began stalling at block 8795342.

At 07:57 UTC our on-call team completed the failover to backup infrastructure which began serving mainnet traffic and the incident was resolved.

Posted Oct 23, 2019 - 21:50 UTC