Increased HTTPS error rates
Incident Report for Infura
Postmortem

On 2020-10-15 20:35 UTC we deployed a change to our JSONRPC routing system used primarily by the HTTPS Ethereum endpoints. However, this change had an unexpected impact on our autoscaling subsystems which decide how many resources we should dedicate to JSONRPC routing, which led to a gradual reduction of routing capacity over the next 39 minutes. At 21:14 UTC this reduction in capacity started to affect end user connections and at 21:17 UTC our on-call engineers were alerted to the issue. At 21:21 UTC the issue was identified as a lack of compute capacity, and at 21:29 UTC the root cause was determined to be related to the change deployed at 20:35 UTC and a rollback to previous state was initiated. By 21:34 the rollback was complete and compute capacity was restored and all alarms regarding user-facing issues recovered by 21:36 UTC.

During our initial triage, we believed this issue only affected Ethereum JSONRPC over HTTP. However upon further review we have also determined that the lack of capacity affected Ethereum JSONRPC WebSocket connections that were actively sending JSONRPC requests during this time period, which could result in WebSocket disconnects upon sending a JSONRPC request during the incident. WebSockets that were only used for receiving existing subscription notifications during this time period should not have been impacted.

Additionally, the root issue with the original deployment has also been identified, and additional safeguards have been put in place to avoid similar unforeseen interactions between our routing and autoscaling subsystems in the future.

Posted Oct 16, 2020 - 16:27 UTC

Resolved
This issue has been resolved. During the affected time period and issue with erroneous autoscaling rules resulted in a loss of production capacity for our Ethereum HTTPS endpoints. The autoscaling change was reverted and the root cause identified for future rollouts.
Posted Oct 15, 2020 - 21:50 UTC
Identified
We have identified an issue with increased HTTPS error rates on our Ethereum endpoints and are rolling out a fix.
Posted Oct 15, 2020 - 21:29 UTC
This incident affected: Infura Ethereum API (Mainnet HTTPS JSON-RPC API).