Investigate using semantically meaningful UUIDs as primary keys for all of our tables #148

New Issue

i-norden · 2021-11-12T23:14:18Z

i-norden commented

2021-11-12 23:14:18 +00:00

Aka for tables where CID is unique, we can use the CID as the primary key. For tables where CID is not unique we need to be a bit more clever (e.g. for state_cids we can use the CID of the header_cid record we FK reference + the path of the state node).

The primary motivation is that these can serve as universally unique ids, enabling us to painlessly merge databases that were constructed in parallel.

Other pros:

More elegant.
Keys are deterministic.
IDs are constructed client side, offloading some work from Postgres which is becoming the bottleneck. This also makes it easier to batch relationally complex writes as we don't need to map relations from a returned ID result set.
Since serial primary keys are stored physically next to each-other, we might experience more contention as many different services are writing to the same database page. Whereas these CIDs/CID-derived keys are highly entropic (while still being deterministic) and so there could be less contention by spreading the writes all over the DB.

Cons:

Will result in mild increase in disk usage, since these will be larger than a serial and since Postgres FKs are complete copies of the referenced key not a more compact pointer. But relative to the immense amount of data we are storing it should be negligible and worth the advantages.
On the flip-side of the last "pro" above: serial int keys mean that IDs are laid out in the B-tree index in chronological order (for a single service anyways). So when you insert a new row into a big table, the only index blocks that generally need to be loaded into memory are the ones along the right-hand edge of the B-tree.
Increased code complexity.
Dev hours.

Aka for tables where CID is unique, we can use the CID as the primary key. For tables where CID is not unique we need to be a bit more clever (e.g. for state_cids we can use the CID of the header_cid record we FK reference + the path of the state node). The primary motivation is that these can serve as universally unique ids, enabling us to painlessly merge databases that were constructed in parallel. Other pros: * More elegant. * Keys are deterministic. * IDs are constructed client side, offloading some work from Postgres which is becoming the bottleneck. This also makes it easier to batch relationally complex writes as we don't need to map relations from a returned ID result set. * Since serial primary keys are stored physically next to each-other, we might experience more contention as many different services are writing to the same database page. Whereas these CIDs/CID-derived keys are highly entropic (while still being deterministic) and so there could be less contention by spreading the writes all over the DB. Cons: * Will result in mild increase in disk usage, since these will be larger than a serial and since Postgres FKs are complete copies of the referenced key not a more compact pointer. But relative to the immense amount of data we are storing it should be negligible and worth the advantages. * On the flip-side of the last "pro" above: serial int keys mean that IDs are laid out in the B-tree index in chronological order (for a single service anyways). So when you insert a new row into a big table, the only index blocks that generally need to be loaded into memory are the ones along the right-hand edge of the B-tree. * Increased code complexity. * Dev hours.

i-norden commented

2021-11-12 23:25:08 +00:00

Proposed scheme:

header_cids: use CID
uncles_cids: use CID
transaction_cids: use CID (or alternatively, use referenced header CID + index)
access_list_element: use referenced tx primary key (defined above) + index
receipt_cids: use FK reference to tx primary key (defined above) (mapping is one-to-one so we can use FK as primary)
log_cids: use referenced rct primary key (defined above) + index
state_cids: use referenced header primary key (defined above) + path
state_accounts: use FK reference to state primary key (defined about) (mapping is one-to-one so we can use FK as primary)
storage_cids: use referenced state primary key (defined above) + path

Proposed scheme: header_cids: use CID uncles_cids: use CID transaction_cids: use CID (or alternatively, use referenced header CID + index) access_list_element: use referenced tx primary key (defined above) + index receipt_cids: use FK reference to tx primary key (defined above) (mapping is one-to-one so we can use FK as primary) log_cids: use referenced rct primary key (defined above) + index state_cids: use referenced header primary key (defined above) + path state_accounts: use FK reference to state primary key (defined about) (mapping is one-to-one so we can use FK as primary) storage_cids: use referenced state primary key (defined above) + path

i-norden commented

2021-11-12 23:39:31 +00:00

Moved it here https://github.com/vulcanize/ipld-eth-db/issues/16

Sign in to join this conversation.

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cerc-io/go-ethereum#148