Summary

S2 - https://harmonyone.pagerduty.com/incidents/Q1WXHGZB39IFPC 11 min down S1 - https://harmonyone.pagerduty.com/incidents/Q039HJKM2PHRMH 47 min down

Both happened at the same time and at epoch change. Both shard would eventually recovers without doing anything special and just waiting for validator to catchup their beacon shard DB

During S1 troubleshooting internal validator was at beacon chain block was at 24739505 and next epoch block was at 24739840. Network voting power was at 59% explaining why we lost consensus. Once epoch block hit the epoch block and we realized consensus was back, voting power was at 98%. On top of the internal node, there was definitely external node also having the same issue.

Root Cause Analysis - 5 Why’s

S1 and S2 consensus loss

Why did S1 and S2 lose consensus

Why shard voting power went below 66.66% ?

Why validator node was lagging behind ?

Action item

  1. Communication
  1. Network spam
  1. Improve sync
  1. Monitoring