statediff: Use a worker pool #43

roysc · 2020-11-19T12:35:39Z

roysc commented

2020-11-19 12:35:39 +00:00

Changes statediff service to spawn a pool of workers each running an independent WriteLoop.

Number of workers is configured with a new statediff.workers CLI flag.
Expands the lastBlock cache to use a map of hashes to blocks, with a max size also limited to the number of workers.
Geth uses an internal cache for receipts and other data; this causes a data race to be detected inside receipts.DeriveFields() when used during indexing, though I didn't debug this deeply enough to find out why the same receipts data would ever be touched by goroutines which should only be working on separate blocks.
There is still a data race issue in the Trie data access, ~~probably due to the state.Database hitting its cache limit.~~(cache miss shouldn't cause this)

Resolves https://github.com/vulcanize/go-ethereum/issues/41

Changes statediff service to spawn a pool of workers each running an independent `WriteLoop`. * Number of workers is configured with a new `statediff.workers` CLI flag. * Expands the lastBlock cache to use a map of hashes to blocks, with a max size also limited to the number of workers. * Geth uses an internal cache for receipts and other data; this causes a data race to be detected inside `receipts.DeriveFields()` when used during indexing, though I didn't debug this deeply enough to find out why the same receipts data would ever be touched by goroutines which should only be working on separate blocks. * There is still a data race issue in the Trie data access, <strike>probably due to the `state.Database` hitting its cache limit.</strike>(cache miss shouldn't cause this) Resolves https://github.com/vulcanize/go-ethereum/issues/41

roysc commented

2020-11-20 10:51:53 +00:00

Running this on a fresh full chain, I'm not able to reproduce the trie node errors I was seeing before, so they may not be directly related - removing the WIP tag.

i-norden requested changes 2020-11-24 08:47:05 +00:00

i-norden left a comment

Looks great! A few comments but only one or two changes. Running locally everything seem to be working (ran with 4 and 8 workers, around block 80,000). I do see some errors on shutdown that I think are new but also probably inconsequential:
WARN [11-24|02:40:55.972] Error from chain event subscription error=nil WARN [11-24|02:40:56.031] Error from chain event subscription error=nil WARN [11-24|02:40:56.032] Error from chain event subscription error=nil

Are you still seeing a Trie data race issue on your end?

I haven't benchmarked, it would be good to get a rough estimate on performance impact.

Looks great! A few comments but only one or two changes. Running locally everything seem to be working (ran with 4 and 8 workers, around block 80,000). I do see some errors on shutdown that I think are new but also probably inconsequential: `WARN [11-24|02:40:55.972] Error from chain event subscription error=nil WARN [11-24|02:40:56.031] Error from chain event subscription error=nil WARN [11-24|02:40:56.032] Error from chain event subscription error=nil` Are you still seeing a Trie data race issue on your end? I haven't benchmarked, it would be good to get a rough estimate on performance impact.

statediff/helpers.go

i-norden commented

2020-11-24 07:48:17 +00:00

I know you can't do it with rlp encoding/decoding since it only encodes the consensus fields but I wonder if you could copy by marshaling the receipt to JSON bytes and then unmarshalling into a new types.Receipt.

I know you can't do it with rlp encoding/decoding since it only [encodes the consensus fields](https://github.com/vulcanize/go-ethereum/blob/master/core/types/receipt.go#L130) but I wonder if you could copy by marshaling the receipt to JSON bytes and then unmarshalling into a new `types.Receipt`.

statediff/service.go

						
				@ -105,3 +116,3 @@

					SubscriptionTypes map[common.Hash]Params

					// Cache the last block so that we can avoid having to lookup the next block's parent

					lastBlock lastBlockCache

					BlockCache blockCache

i-norden commented

2020-11-24 07:49:22 +00:00

I like the params getting passed together like this!

statediff/service.go

i-norden commented

2020-11-24 07:54:21 +00:00

I'm wondering if the cache is still beneficial with the concurrent approach given the additional complexity.

statediff/service.go

i-norden commented

2020-11-24 07:55:17 +00:00

I may be missing it but I think we still need to cache the parentBlock returned by bc.GetBlockByHash?

I may be missing it but I think we still need to cache the parentBlock returned by `bc.GetBlockByHash`?

statediff/service.go

i-norden commented

2020-11-24 08:00:46 +00:00

When integrating with the metrics we might want to have an intermediate select loop that logs the metric off the primary core.ChainEvent channel before passing the event along to a secondary worker queue channel here, if we want to avoid logging metrics from within worker goroutines.

When integrating with the metrics we might want to have an intermediate `select` loop that logs the metric off the primary core.ChainEvent channel before passing the event along to a secondary worker queue channel here, if we want to avoid logging metrics from within worker goroutines.

statediff/service.go

i-norden commented

2020-11-24 08:04:58 +00:00

Brackets and comment can go I think

statediff/service.go

i-norden commented

2020-11-24 08:37:45 +00:00

This is simpler and appears to be working when I run locally:

newRct := new(types.Receipt)
*newRct = *rct
receipts = append(receipts, newRct)

This is simpler and appears to be working when I run locally: ```go newRct := new(types.Receipt) *newRct = *rct receipts = append(receipts, newRct) ```

roysc reviewed 2020-11-24 13:54:09 +00:00

statediff/service.go

roysc commented

2020-11-24 13:54:09 +00:00

I checked and the hit/access ratio is very close to 1, so it does still seem to be helping, I can verify further with benchmarks.

roysc reviewed 2020-11-24 13:58:16 +00:00

statediff/service.go

roysc commented

2020-11-24 13:58:15 +00:00

My reasoning was that two blocks shouldn't be processed with the same parent block, so we just cache the current one so it's available when its child block gets processed. I did some logging to verify this - blocks are accessed no more than once. Also, the number of misses is so low that hardly any parents would get added to the cache (since we would only add them on a miss).

roysc reviewed 2020-11-24 14:04:34 +00:00

statediff/service.go

roysc commented

2020-11-24 14:04:34 +00:00

oh yeah, meant to clean it up

roysc reviewed 2020-11-24 16:05:47 +00:00

statediff/service.go

roysc commented

2020-11-24 16:05:47 +00:00

It's odd - I was getting a data race detected at various points in this loop, but now I can't reproduce it even on that commit where it was happening. That was the reason for the deep copy before, but now it seems even the shallow copy is not necessary, so I can revert those changes.

It's odd - I was getting a data race detected at various points in [this loop](https://github.com/vulcanize/go-ethereum/blob/v1.9.24-statediff/core/types/receipt.go#L306), but now I can't reproduce it even on that commit where it was happening. That was the reason for the deep copy before, but now it seems even the shallow copy is not necessary, so I can revert those changes.

roysc reviewed 2020-11-24 16:08:58 +00:00

statediff/service.go

roysc commented

2020-11-24 16:08:58 +00:00

Done, after rebasing on this branch

Done, after rebasing on [this branch](https://github.com/vulcanize/go-ethereum/pull/44)

i-norden reviewed 2020-11-25 02:03:05 +00:00

statediff/service.go

i-norden commented

2020-11-25 02:03:05 +00:00

Oh that's awesome, thanks for checking that!

i-norden reviewed 2020-11-25 02:11:50 +00:00

statediff/service.go

i-norden commented

2020-11-25 02:11:50 +00:00

Ah yeah that makes sense, and is generally true. The only time this wouldn't be true is when there are reorgs while syncing at the head of chain (will never see it happen during chain import e.g. during any of our local tests). I think in that case it is likely still better to need to reach past the cache to retrieve the block rather than to waste cache space on the off chance.

i-norden approved these changes 2020-11-25 02:17:54 +00:00

i-norden left a comment

LGTM! Feel free to merge when you are ready.

roysc commented

2020-11-25 10:29:17 +00:00

The chain event subscription error seems to be from the workers receiving the close of the error channel before the quit channel. Like you said, doesn't seem to indicate a real problem, though.

Sign in to join this conversation.

No reviewers

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cerc-io/go-ethereum#43