Erigon Quick Fix Leads to Over 1,000% RPS Improvement!

A story of how our developer helped make Erigon better for everyone.
Written by
Jiří Makovský
August 3, 2024
4
min. read

Introduction

While working to improve our product, one of our developers, Tomáš Holcman, encountered a challenge while optimizing Erigon—a high-performance client for Ethereum, Polygon, and more. He noticed that the RPC daemon's performance was bottlenecked, handling only a very low number of requests per second (RPS) even with high concurrency, suggesting a deeper issue in the system's ability to handle parallel operations. 

Because work related to blockchain technology is a collaborative effort, he took the issue to GitHub, where he and others started working on diagnostics and solutions. This is the story of how it went and how we helped make Ethereum better for everyone.

The Issue, Diagnosis, and Solution

Tholcman began by running a series of performance tests on Erigon's eth_getTransactionReceipt RPC call, which requests random transaction receipts from the chain's history. Despite adjusting various parameters, including db.read.concurrency, he found that performance wasn't improving as expected with increasing read concurrency. CPU utilization and IOPS usage were minimal, so something was off. 

--db.read.concurrency 1 => 22.04 RPS
--db.read.concurrency 2 => 38.58 RPS (+75%)
--db.read.concurrency 5 => 43.99 RPS (+14%)
--db.read.concurrency 20000 => 53.56 RPS
--db.read.concurrency 200000 => 58.56 RPS

Some of the growth above is due to caching, so the numbers are not completely accurate, but they clearly show that there is an issue. Using parameters of 20,000 and 200,000 is nonsense, but support recommended it to help with diagnosis.

With Tatum, developing on blockchain is easy.

[.c-wr-center][.button-black]Start Now[.button-black][.c-wr-center]

Multiple tests were run. The results showed that increasing concurrency only slightly improved performance but significantly slowed down response times, sometimes even failing due to time-out. The bottleneck was somewhere within the Erigon codebase.

$ cat ... | vegeta attack ... -rate=100 -timeout 1s
Requests      [total, rate, throughput]         3000, 100.03, 6.61
Latencies     [min, mean, 50, 90, 95, 99, max]  848.307µs, 990.396ms, 1s, 1.001s, 1.001s, 1.001s, 1.002s
Success       [ratio]                           6.83%
Status Codes  [code:count]                      0:2795  200:205

To determine the cause, we dove into the code. We discovered that the View() function in RoSnapshots used an exclusive read-write lock even while only reading files, causing a significant bottleneck by blocking concurrent read operations. 

The code needed to be tweaked to use read locks (RLock) instead of read-write locks. This change allowed multiple concurrent reads. This immensely boosted performance. The locking mechanism was further adjusted to lock only specific file types, not all of them, and locks were released immediately after read operations to reduce the duration. These changes improved throughput performance even more.

Before the change:

func (r *BlockReader) blockWithSenders(...) (...) {
	// Acquire RW locks for all file types
	view := r.sn.View()
	// Read Headers
	// other code, where we don't need the files to be locked
	// Read Bodies
	// more code, where we don't need the files to be locked
	// Read Transactions
	// a lot of code, where we don't need the files to be locked
	// Release all locks just before returning the result
	view.Close()
	return block, senders, nil
}

After the change:

func (r *BlockReader) blockWithSenders(...) (...) {
	// Acquire read only lock specifically for Headers
	seg, ok, release := r.sn.ViewSingleFile(coresnaptype.Headers, blockHeight)
	// Read the Headers
	// Immediately release the Headers lock
	release()
	// Other code, where we don't need the files to be locked
	// Acquire lock for Bodies
	bodySeg, ok, release := r.sn.ViewSingleFile(coresnaptype.Bodies, blockHeight)
    ...
	release()
    ...
	// Acquire lock for Transactions
	txnSeg, ok, release := r.sn.ViewSingleFile(coresnaptype.Transactions, blockHeight)
	...
	release()
	...
	return block, senders, nil
}

The performance went from 75 RPS to over 850 RPS (on our standardized test suite, with the same node configuration and hardware, and with the same conditions), with CPU utilization hitting 75% and peak read IOPS at 25k—hardware was utilized as it should have been with those operations. As is typical with open projects, the programmer reached out to the community for feedback and further refinement.

Collaborative Efforts

A fellow collaborator confirmed the necessity of using RLock and suggested additional tests. Together, they refined the changes and prepared them for wider implementation, opening relevant pull requests for the community to review:

Those will be reviewed and later implemented in future updates. That is how Tatum made Ethereum a slightly better experience for everyone developing or testing! Erigon, a node affected by the issue, is used for BSC and Polygon too. For detailed information and the latest developments, visit the ErigonTech GitHub repository.

[.c-box-wrapper][.c-box][.c-text-center]You might be interested in: 15 Books for Blockchain Developers[.c-text-center][.c-box][.c-box-wrapper]