Write state diffs directly to Postgres #30

Merged
telackey merged 19 commits from statediff-into-postgres into statediff_at_anyblock-1.9.11 2020-11-11 16:26:25 +00:00
Member

This provides a new RPC method statediff_WriteStateDiffAt to write diffs directly into PostgreSQL upon client request (per https://github.com/vulcanize/eth-statediff-service/issues/11). To do this, I copied over and refactored some of the tooling from https://github.com/vulcanize/ipld-eth-indexer.

@i-norden How should we merge the code hash & storage output from https://github.com/vulcanize/go-ethereum/pull/27? I suppose it depends on how that data is consumed.

More minor questions:

  • I added three new CLI parameters to pass in the DB connection URI, node ID, and client name. It may be cleaner to pass these as a single comma separated string or even a separate toml file.
  • Do we still want to configure the max open/idle/lifetime DB parameters, or leave these at default for now?
  • "Transformer" is not really appropriate anymore, maybe DBWriter?
  • I'm not sure it will be possible to perform writes concurrently and still wrap them in an atomic transaction - I suspect not, but haven't found a definitive answer.
This provides a new RPC method `statediff_WriteStateDiffAt` to write diffs directly into PostgreSQL upon client request (per https://github.com/vulcanize/eth-statediff-service/issues/11). To do this, I copied over and refactored some of the tooling from https://github.com/vulcanize/ipld-eth-indexer. @i-norden How should we merge the code hash & storage output from https://github.com/vulcanize/go-ethereum/pull/27? I suppose it depends on how that data is consumed. More minor questions: * I added three new CLI parameters to pass in the DB connection URI, node ID, and client name. It may be cleaner to pass these as a single comma separated string or even a separate toml file. * Do we still want to configure the max open/idle/lifetime DB parameters, or leave these at default for now? * "Transformer" is not really appropriate anymore, maybe DBWriter? * I'm not sure it will be possible to perform writes concurrently and still wrap them in an atomic transaction - I suspect not, but haven't found a definitive answer.
i-norden reviewed 2020-10-22 03:23:51 +00:00
i-norden left a comment
Member

I still need a chance to take a closer look at the code but so far this looks great! In regards to those questions:

  • One option is to go ahead and rebase this on top of the latest state of statediff_at_anyblock-1.9.11, which shouldn't be too messy since we touched mostly different sections of code, and then merge this and apply the code=>codehash update in a separate PR.
  • I think the CLI params like that is good. If anything I'd recommend splitting the URI out into separate CLI flags (hostname, port, database, user, password).
  • I think leaving open/idle/lifetime as default is fine for now, and we can add stuff like that in other PRs to keep this one more focused (already a very significant change w/ large diff!)
  • Agree on Transformer name- DBWriter makes more sense. Or what do think about renaming the current Indexer to DBWriter/PGWriter and the Transformer to Indexer?
  • Yeah it's not possible to write to a single sql.Tx from within separate goroutines, unfortunately. So it would require some additional rework to be able to use the concurrent iterator, but I think it's still possible e.g. we could have all the separate goroutines of the concurrent builder/iterator send their processed StateNode objects out a single channel where a lone worker receives and writes them into a single transaction.
I still need a chance to take a closer look at the code but so far this looks great! In regards to those questions: * One option is to go ahead and rebase this on top of the latest state of statediff_at_anyblock-1.9.11, which shouldn't be _too_ messy since we touched mostly different sections of code, and then merge this and apply the code=>codehash update in a separate PR. * I think the CLI params like that is good. If anything I'd recommend splitting the URI out into separate CLI flags (hostname, port, database, user, password). * I think leaving open/idle/lifetime as default is fine for now, and we can add stuff like that in other PRs to keep this one more focused (already a very significant change w/ large diff!) * Agree on Transformer name- DBWriter makes more sense. Or what do think about renaming the current Indexer to DBWriter/PGWriter and the Transformer to Indexer? * Yeah it's not possible to write to a single sql.Tx from within separate goroutines, unfortunately. So it would require some additional rework to be able to use the concurrent iterator, but I think it's still possible e.g. we could have all the separate goroutines of the concurrent builder/iterator send their processed StateNode objects out a single channel where a lone worker receives and writes them into a single transaction.
Author
Member
  • Roger on rebasing for code/hash stuff
  • I'll leave the CLI like it is then, but if we change it I think I would vote for using a separate toml file just in the name of cleanliness of config handling (and allowing to reuse config files for the indexer service)
  • Agreed on naming, and in that case I may also want to rename the indexer's methods from index*CID to merge*CID or upsert*CID, to be more direct
  • Yeah, consuming thread + channels should work, and still improve performance. I would also really prefer to merge this and the standalone statediff service into a single codebase once we get to that point.
* Roger on rebasing for code/hash stuff * I'll leave the CLI like it is then, but if we change it I think I would vote for using a separate toml file just in the name of cleanliness of config handling (and allowing to reuse config files for the indexer service) * Agreed on naming, and in that case I may also want to rename the indexer's methods from `index*CID` to `merge*CID` or `upsert*CID`, to be more direct * Yeah, consuming thread + channels should work, and still improve performance. I would also really prefer to merge this and the standalone statediff service into a single codebase once we get to that point.
Owner

Yeah, consuming thread + channels should work, and still improve performance. I would also really prefer to merge this and the standalone statediff service into a single codebase once we get to that point.

Sooner rather than later, please.

>Yeah, consuming thread + channels should work, and still improve performance. I would also really prefer to merge this and the standalone statediff service into a single codebase once we get to that point. Sooner rather than later, please.
Member

Will try to finish up my review today @roysc

In regards to the CLI vs .toml config- I don't have a strong preference but for the sake of usability inside of a docker container in staging environments it is often more convenient to be able to pass config params in by CLI flag or env variable than by .toml file.

I might challenge the idea we should merge eth-statediff-service into here, as the standalone service binary is much smaller and quicker to build and it allows us to provide focused top-level README documentation specific to that service. But I think I could be convinced of the advantages of merging in here.

Will try to finish up my review today @roysc In regards to the CLI vs .toml config- I don't have a strong preference but for the sake of usability inside of a docker container in staging environments it is often more convenient to be able to pass config params in by CLI flag or env variable than by .toml file. I might challenge the idea we should merge eth-statediff-service into here, as the standalone service binary is much smaller and quicker to build and it allows us to provide focused top-level README documentation specific to that service. But I think I could be convinced of the advantages of merging in here.
Author
Member

Alright, I incorporated the SQL output tests from ipld-eth-indexer and otherwise this is pretty much covered by the existing test suite since it's behind the original Builder interface.

What remains -

  • handling the code & code hash, we could just copy over how it's handled in the indexer, but if that output is significant enough it can also be done iteratively
  • we will need to trigger these write calls from somewhere - I have some basic code in the ipld-eth-indexer to test it
  • update the metrics output

@AFDudley I can start on concurrency as soon as we merge this.

Alright, I incorporated the SQL output tests from `ipld-eth-indexer` and otherwise this is pretty much covered by the existing test suite since it's behind the original Builder interface. What remains - * handling the code & code hash, we could just copy over how it's handled in the indexer, but if that output is significant enough it can also be done iteratively * we will need to trigger these write calls from somewhere - I have some basic code in the `ipld-eth-indexer` to test it * update the metrics output @AFDudley I can start on concurrency as soon as we merge this.
i-norden requested changes 2020-10-26 13:04:09 +00:00
i-norden left a comment
Member

Looks great! The main thing is that we need to add a new service loop that automatically writes diffs at the head of chain. Only a couple small comments aside from that.

Eventually we'll want to figure out a better way of sharing packages between this and ipld-eth-indexer (use a 3rd repo for some stuff).

And eventually we should convert these ported tests away from ginkgo/gomega since go-ethereum uses the standard lib test package.

Looks great! The main thing is that we need to add a new service loop that automatically writes diffs at the head of chain. Only a couple small comments aside from that. Eventually we'll want to figure out a better way of sharing packages between this and ipld-eth-indexer (use a 3rd repo for some stuff). And eventually we should convert these ported tests away from ginkgo/gomega since go-ethereum uses the standard lib test package.
Member

I think it'd be better to define a new StateDiffParams struct or pass the three arguments explicitly rather than this dbParams array pointer, especially since it shows up as an argument in more than one function.

I think it'd be better to define a new StateDiffParams struct or pass the three arguments explicitly rather than this dbParams array pointer, especially since it shows up as an argument in more than one function.
@ -0,0 +22,4 @@
// Env variables
const (
DATABASE_NAME = "DATABASE_NAME"
Member

Ideally we would pass individual db connection string components in by CLI flag in a manner that is overridable by these env variables

Ideally we would pass individual db connection string components in by CLI flag in a manner that is overridable by these env variables
Member

We need another method that we turn on by CLI param, which automatically writes statediffs to Postgres as they stream off the chainEvents channel of the syncing chain. This will look almost exactly like Loop() method on line 154, except instead of sending the diff to subscribers it writes directly to Postgres.

We need another method that we turn on by CLI param, which automatically writes statediffs to Postgres as they stream off the chainEvents channel of the syncing chain. This will look almost exactly like `Loop()` method on line 154, except instead of sending the diff to subscribers it writes directly to Postgres.
i-norden reviewed 2020-10-26 13:04:54 +00:00
@ -0,0 +22,4 @@
// Env variables
const (
DATABASE_NAME = "DATABASE_NAME"
Member

Note: this doesn't need to be resolved now

Note: this doesn't need to be resolved now
roysc reviewed 2020-10-26 14:33:25 +00:00
Author
Member

Definitely - this was a bit of a punt, meant to write a TODO

Definitely - this was a bit of a punt, meant to write a TODO
roysc reviewed 2020-10-26 14:34:51 +00:00
Author
Member

Makes sense - would that just be triggered via CLI or also toggleable by an RPC call?

Makes sense - would that just be triggered via CLI or also toggleable by an RPC call?
i-norden reviewed 2020-10-26 16:53:37 +00:00
Member

Cool cool yeah this doesn't need to be done now

Cool cool yeah this doesn't need to be done now
i-norden reviewed 2020-10-26 16:53:47 +00:00
Member

I think just turn it on by CLI is fine for now

I think just turn it on by CLI is fine for now
Member

@roysc Btw I don't think we should try to port the concurrency stuff into here at this time, there will be data races with the regular ongoing sync processes plus the statediff service needs to operate through the same StateDB object as the BlockChain to utilize it's in-memory cache. This is another reason to maintain the separate repos IMO.

Rather, I think we should port these changes into eth-statediff-service and in doing so utilize the concurrent node iterator used there.

@roysc Btw I don't think we should try to port the concurrency stuff into here at this time, there will be data races with the regular ongoing sync processes plus the statediff service needs to operate through the same StateDB object as the BlockChain to utilize it's in-memory cache. This is another reason to maintain the separate repos IMO. Rather, I think we should port these changes into eth-statediff-service and in doing so utilize the concurrent node iterator used there.
Member

Hey @roysc, tested this locally and it appears to be working!

Using
./build/bin/geth --syncmode=full --gcmode=archive --statediff --statediff.writing --statediff.db=postgres://localhost:5432/vulcanize_testing?sslmode=disable --statediff.dbnodeid=yeee --statediff.dbclientname=yooo

Seeing results in every cid table (including uncles and receipts)

Also working without --gcmode=archive

Although there is this error when shutting it down in this mode: ERROR[11-04|11:06:19.592] Dangling trie nodes after full cleanup , which I think may actually be appropriate in our case since we purposely prevent dereferencing of nodes we have not diffed yet.

Hey @roysc, tested this locally and it appears to be working! Using `./build/bin/geth --syncmode=full --gcmode=archive --statediff --statediff.writing --statediff.db=postgres://localhost:5432/vulcanize_testing?sslmode=disable --statediff.dbnodeid=yeee --statediff.dbclientname=yooo` Seeing results in every cid table (including uncles and receipts) Also working without `--gcmode=archive` Although there is this error when shutting it down in this mode: `ERROR[11-04|11:06:19.592] Dangling trie nodes after full cleanup `, which I think may actually be appropriate in our case since we purposely prevent dereferencing of nodes we have not diffed yet.
i-norden approved these changes 2020-11-04 16:51:33 +00:00
i-norden left a comment
Member

Noting some changes that don't necessarily need to be made in this PR.

Noting some changes that don't necessarily need to be made in this PR.
Member

We should change this to the Debug level

We should change this to the Debug level
Member

We should update all the ported packages to use Geth's log package

We should update all the ported packages to use Geth's log package
i-norden reviewed 2020-11-04 16:52:52 +00:00
Member

Similarly, we should update any tests we ported over to use the default test pkg that the rest of geth uses instead of ginko/gomega

Similarly, we should update any tests we ported over to use the default test pkg that the rest of geth uses instead of ginko/gomega
Author
Member

@i-norden Refactored the logging and tests - (btw. noticed we have the gomega module vendored in, guessing we can remove it).

@i-norden Refactored the logging and tests - (btw. noticed we have the gomega module [vendored in](https://github.com/vulcanize/go-ethereum/tree/statediff_at_anyblock-1.9.11/vendor/github.com/onsi/gomega), guessing we can remove it).
Member

@roysc yes please get rid of the vendor/ dir

@roysc yes please get rid of the vendor/ dir
i-norden approved these changes 2020-11-11 16:25:57 +00:00
i-norden left a comment
Member

LGTM. I think we should go ahead and merge this so we can cut a release and apply the rebase on top. Running it locally things look good and I have confirmed the codehash=>code is being stored correctly. There is a pretty big lag between the head in the Postgres db and the head of the current block import but I'm not surprised considering how quickly it imports chunks of blocks while catching up to head and the limiting resources of my computer.

LGTM. I think we should go ahead and merge this so we can cut a release and apply the rebase on top. Running it locally things look good and I have confirmed the codehash=>code is being stored correctly. There is a pretty big lag between the head in the Postgres db and the head of the current block import but I'm not surprised considering how quickly it imports chunks of blocks while catching up to head and the limiting resources of my computer.
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cerc-io/go-ethereum#30
No description provided.