I would like discuss the uptime and scaling for the services of Harmony.
Since the launch of the Harmony protocol in June/2019, Harmony protocol had grown from an infant period (June/2019 to June/2020), to a little kid period (June/2020 to June/2021). Now, after June/2021, Harmony is in its teenager period (June/2021 - now). At different ages, we faced different issues, from consensus stuck, to network down, from p2p spamming attack to network recovery taking too long. Nowadays, the most prominent issue we are facing is the growing pain. With so many users, ever-growing ecosystem projects, faster block time, and a higher number of transactions, our web2 infrastructure is sometimes won't be able to catch up with the scaling demands. We have to provide higher uptime and scale our service infrastructure in order to continue serving our users and ecosystem projects smoothly. Of course, the protocol level enhancement can't be stopped as it is the foundation of our technology. But without a scalable web2 service infrastructure, the perception of the blockchain to end users won't be good and will disappoint them. It can become our weakest link in the entire protocol.
Checking Daniel's document and we can see the impact and importance of the web2 infrastructure.
There are a few principles/tenets I would like to share at first.
The first is scalability. Our infrastructure has to scale in order to meet the upcoming 10x or 100x growth of blockchain applications and users. Patches to the existing codebase will not work as it was designed to be a single node solution, while a new framework has to be adopted. Lessons from web2 scaling solutions have to be learned.
The second is security. Blockchain is an open ledger technology and carries digital assets worth millions of dollars. Security should always be the first priority and the one-click blocker to any potential risks. Service infrastructure has also need to take account of the security measure on existing services or any new services.
The third one is frugality. Frugality doesn't mean we can't spend money. We just need to get the most out of the money we spent. Again, simply adding more nodes may only solve the temporary problem. Going longer term, we need to think out of the box. The explorer v2 design was a perfect example of how new architecture can solve our plaguing issue.
There are a few urgent issues we shall discuss in our service infrastructure.
I've proposed a new architecture solution back in July. Here is the link to the proposal.
In short, the solution is to separate DB syncing from RPC service nodes. Using separate writer and reader instances to scale the reading operations. The writer can keep syncing and writing to the cloud DB, while readers read from the DB and serve all the RPC requests.