lotus/documentation/en/architecture/architecture.md
Raúl Kripalani 3795cc2bd2 segregate chain and state blockstores.
This paves the way for better object lifetime management.

Concretely, it makes it possible to:
- have different stores backing chain and state data.
- having the same datastore library, but using different parameters.
- attach different caching layers/policies to each class of data, e.g.
  sizing caches differently.
- specifying different retention policies for chain and state data.

This separation is important because:
- access patterns/frequency of chain and state data are different.
- state is derivable from chain, so one could never expunge the chain
  store, and only retain state objects reachable from the last finality
  in the state store.
2021-02-28 22:49:44 +00:00

25 KiB

Lotus

Lotus is an implementation of the Filecoin Distributed Storage Network. A Lotus node syncs blockchains that follow the Filecoin protocol, validating the blocks and state transitions. The specification for the Filecoin protocol can be found here.

For information on how to setup and operate a Lotus node, please follow the instructions here.

Components

At a high level, a Lotus node comprises the following components:

FIXME: No mention of block production here, cross-reference with schomatis's miner doc

  • The Syncer, which manages the process of syncing the blockchain
  • The State Manager, which can compute the state at any given point in the chain
  • The Virtual Machine (VM), which executes messages
  • The Repository, where all data is stored
  • P2P stuff (FIXME missing libp2p listed under other PL dependencies)? allows hello, blocksync, retrieval, storage
  • API / CLI (FIXME missing, in scratchpad)
  • Other Filecoin dependencies (specs actors, proofs, storage, etc., FIXME missing)
  • Is the Builder worth its own component?
  • Other PL dependencies (IPFS, libp2p, IPLD? FIXME, missing)
  • External libraries used by Lotus and other deps (FIXME, missing)

Preliminaries

We discuss some key Filecoin concepts here, aiming to explain them by contrasting them with analogous concepts in other well-known blockchains like Ethereum. We only provide brief descriptions here; elaboration can be found in the spec.

Tipsets

Unlike in Ethereum, a block can have multiple parents in Filecoin. We thus refer to the parent set of a block, instead of a single parent. A tipset is any set of blocks that share the same parent set.

There is no concept of "block difficulty" in Filecoin. Instead, the weight of a tipset is simply the number of blocks in the chain that ends in that tipset. Note that a longer chain can have less weight than a shorter chain with more blocks per tipset.

We also allow for "null" tipsets, which include zero blocks. This allows miners to "skip" a round, and build on top of an imaginary empty tipset if they want to.

We call the heaviest tipset in a chain the "head" of the chain.

Actors and Messages

An Actor is analogous to a smart contract in Ethereum. Filecoin does not allow users to define their own actors, but comes with several builtin actors, which can be thought of as pre-compiled contracts.

A Message is analogous to transactions in Ethereum.

Sync

Sync refers to the process by which a Lotus node synchronizes to the heaviest chain being advertised by its peers. At a high-level, Lotus syncs in a manner similar to most other blockchains; a Lotus node listens to the various chains its peers claim to be at, picks the heaviest one, requests the blocks in the chosen chain, and validates each block in that chain, running all state transitions along the way.

The majority of the sync functionality happens in the Syncer, internally managed by a SyncManager.

We now discuss the various stages of the sync process.

Sync setup

When a Lotus node connects to a new peer, we exchange the head of our chain with the new peer through the hello protocol. If the peer's head is heavier than ours, we try to sync to it. Note that we do NOT update our chain head at this stage.

Fetching and Persisting Block Headers

Note: The API refers to these stages as StageHeaders and StagePersistHeaders.

We proceed in the sync process by requesting block headers from the peer, moving back from their head, until we reach a tipset that we have in common (such a common tipset must exist, thought it may simply be the genesis block). The functionality can be found in Syncer::collectHeaders().

If the common tipset is our head, we treat the sync as a "fast-forward", else we must drop part of our chain to connect to the peer's head (referred to as "forking").

FIXME: This next para might be best replaced with a link to the validation doc Some of the possible causes of failure in this stage include:

  • The chain is linked to a block that we have previously marked as bad, and stored in a BadBlockCache.
  • The beacon entries in a block are inconsistent (FIXME: more details about what is validated here wouldn't be bad).
  • Switching to this new chain would involve a chain reorganization beyond the allowed threshold (SPECK-CHECK).

Fetching and Validating Blocks

Note: The API refers to this stage as StageMessages.

Having acquired the headers and found a common tipset, we then move forward, requesting the full blocks, including the messages.

For each block, we first confirm the syntactic validity of the block (SPECK-CHECK), which includes the syntactic validity of messages included in the block. We then apply the messages, running all the state transitions, and compare the state root we calculate with the provided state root.

FIXME: This next para might be best replaced with a link to the validation doc Some of the possible causes of failure in this stage include:

  • a block is syntactically invalid (including potentially containing syntactically invalid messages)
  • the computed state root after applying the block doesn't match the block's state root
  • FIXME: Check what's covered by syntactic validity, and add anything important that isn't (like proof validity, future checks, etc.)

The core functionality can be found in Syncer::ValidateTipset(), with Syncer::checkBlockMessages() performing syntactic validation of messages.

Setting the head

Note: The API refers to this stage as StageSyncComplete.

If all validations pass we will now set that head as our heaviest tipset in ChainStore. We already have the full state, since we calculated it during the sync process.

FIXME (aayush) I don't fuilly understand the next 2 paragraphs, but it seems important. Confirm and polish. Relevant issue in IPFS: https://github.com/ipfs/ipfs-docs/issues/264

It is important to note at this point that similar to the IPFS architecture of addressing by content and not by location/address (FIXME: check and link to IPFS docs) the "actual" chain stored in the node repo is relative to which CID we look for. We always have stored a series of Filecoin blocks pointing to other blocks, each a potential chain in itself by following its parent's reference, and its parent's parent, and so on up to the genesis block. (FIXME: We need a diagram here, one of the Filecoin blog entries might have something similar to what we are describing here.) It only depends on where (location) do we start to look for. The only address/location reference we hold of the chain, a relative reference, is the heaviest pointer. This is reflected by the fact that we don't store it in the Blockstore by a fixed, absolute, CID that reflects its contents, as this will change each time we sync to a new head (FIXME: link to the immutability IPFS doc that I need to write).

FIXME: Create a further reading appendix, move this next para to it, along with other extraneous content This is one of the few items we store in Datastore by key, location, allowing its contents to change on every sync. This is reflected in the (*ChainStore) writeHead() function (called by takeHeaviestTipSet() above) where we reference the pointer by the explicit chainHeadKey address (the string "head", not a hash embedded in a CID), and similarly in (*ChainStore).Load() when we start the node and create the ChainStore. Compare this to a Filecoin block or message which are immutable, stored in the Blockstore by CID, once created they never change.

Keeping up with the chain

A Lotus node also listens for new blocks broadcast by its peers over the gossipsub channel (see FIXME for more). If we have validated such a block's parent tipset, and adding it to our tipset at its height would lead to a heavier head, then we validate and add this block. The validation described is identical to that invoked during the sync process (indeed, it's the same codepath).

State

In Filecoin, the chain state at any given point is a collection of data stored under a root CID encapsulated in the StateTree, and accessed through the StateManager. The state at the chain's head is thus easily tracked and updated in a state root CID. (FIXME: Talk about CIDs somewhere, we might want to explain some of the modify/flush/update-root mechanism here.))

Calculating a Tipset State

Recall that a tipset is a set of blocks that have identical parents (that is, that are built on top of the same tipset). The genesis tipset comprises the genesis block(s), and has some state corresponding to it.

The methods TipSetState() and computeTipSetState() in StateManager are responsible for computing the state that results from applying a tipset. This involves applying all the messages included in the tipset, and performing implicit operations like awarding block rewards.

Any valid block built on top of a tipset ts should have its Parent State Root equal to the result of calculating the tipset state of ts. Note that this means that all blocks in a tipset must have the same Parent State Root (which is to be expected, since they have the same parent tipset)

Preparing to apply a tipset

When StateManager::computeTipsetState() is called with a tipset, ts, it retrieves the parent state root of the blocks in ts. It also creates a list of BlockMessages, which wraps the BLS and SecP messages in a block along with the miner that produced the block.

Control then flows to StateManager::ApplyBlocks(), which builds a VM to apply the messages given to it. The VM is initialized with the parent state root of the blocks in ts. We apply the blocks in ts in order (see FIXME for ordering of blocks in a tipset).

Applying a block

For each block, we prepare to apply the ordered messages (first BLS, then SecP). Before applying a message, we check if we have already applied a message with that CID within the scope of this method. If so, we simply skip that message; this is how duplicate messages included in the same tipset are skipped (with only the miner of the "first" block to include the message getting the reward). For the actual process of message application, see FIXME (need an internal link here), for now we simply assume that the outcome of the VM applying a message is either an error, or a MessageReceipt and some other information.

We treat an error from the VM as a showstopper; there is no recovery, and no meaningful state can be computed for ts. Given a successful receipt, we add the rewards and penalties to what the miner has earned so far. Once all the messages included in a block have been applied (or skipped if they're a duplicate), we use an implicit message to call the Reward Actor. This awards the miner their reward for having won a block, and also awards / penalizes them based on the message rewards and penalties we tracked.

We then proceed to apply the next block in ts, using the same VM. This means that the state changes that result from applying a message are visible when applying all subsequent messages, even if they are included in a different block.

Finishing up

Having applied all the blocks, we send one more implicit message, to the Cron Actor, which handles operations that must be performed at the end of every epoch (see FIXME for more). The resulting state after calling the Cron Actor is the computed state of the tipset.

Virtual Machine

The Virtual Machine (VM) is responsible for executing messages. The Lotus Virtual Machine invokes the appropriate methods in the builtin actors, and provides a Runtime interface to the builtin actors that exposes their state, allows them to take certain actions, and meters their gas usage. The VM also performs balance transfers, creates new account actors as needed, and tracks the gas reward, penalty, return value, and exit code.

Applying a Message

The primary entrypoint of the VM is the ApplyMessage() method. This method should not return an error unless something goes unrecoverably wrong.

The first thing this method does is assess if the message provided meets any of the penalty criteria. If so, a penalty is issued, and the method returns. Next, the entire gas cost of the message is transferred to a temporary gas holder account. It is from this gas holder that gas will be deducted; if it runs out of gas, the message fails. Any unused gas in this holder will be refunded to the message's sender at the end of message execution.

The VM then increments the sender's nonce, takes a snapshot of the state, and invokes VM::send().

The send() method creates a Runtime for the subsequent message execution. It then transfers the message's value to the recipient, creating a new account actor if needed.

Method Invocation

We use reflection to translate a Filecoin message for the VM to an actual Go function, relying on the VM's invoker structure. Each actor has its own set of codes defined in specs-actors/actors/builtin/methods.go. The invoker structure maps the builtin actors' CIDs to a list of invokeFunc (one per exported method), which each take the Runtime (for state manipulation) and the serialized input parameters.

FIXME (aayush) Polish this next para.

The basic layout (without reflection details) of (*invoker).transform() is as follows. From each actor registered in NewInvoker() we take its Exports() methods converting them to invokeFuncs. The actual method is wrapped in another function that takes care of decoding the serialized parameters and the runtime, this function is passed to shimCall() that will encapsulate the actors code being run inside a defer function to recover() from panics (we fail in the actors code with panics to unwrap the stack). The return values will then be (CBOR) marshaled and returned to the VM.

Returning from the VM

Once method invocation is complete (including any subcalls), we return to ApplyMessage(), which receives the serialized response and the ActorError. The sender will be charged the appropriate amount of gas for the returned response, which gets put into the MessageReceipt.

The method then refunds any unused gas to the sender, sets up the gas reward for the miner, and wraps all of this into an ApplyRet, which is returned.

Building a Lotus node

When we launch a Lotus node with the command ./lotus daemon (see here for more), the node is created through dependency injection. This relies on reflection, which makes some of the references hard to follow. The node sets up all of the subsystems it needs to run, such as the repository, the network connections, the chain sync service, etc. This setup is orchestrated through calls to the node.Override function. The structure of each call indicates the type of component it will set up (many defined in node/modules/dtypes/), and the function that will provide it. The dependency is implicit in the argument of the provider function.

As an example, consider the modules.ChainStore() function that provides the ChainStore structure. It takes as one of its parameters the ChainBlockstore type, which becomes one of its dependencies. For the node to be built successfully the ChainBlockstore will need to be provided before ChainStore, a requirement that is made explicit in another Override() call that sets the provider of that type as the ChainBlockstore() function.

The Repository

The repo is the directory where all of a node's information is stored. The node is entirely defined by its repo, which makes it easy to port to another location. This one-to-one relationship means we can speak of the node as the repo it is associated with, instead of the daemon process that runs from that repo.

Only one daemon can run be running with an associated repo at a time. A process signals that it is running a node associated with a particular repo, by creating and acquiring a repo.lock.

lsof ~/.lotus/repo.lock
# COMMAND   PID
# lotus   52356

Trying to launch a second daemon hooked to the same repo leads to a repo is already locked (lotus daemon already running) error.

The node.Repo() function (node/builder.go) contains most of the dependencies (specified as Override() calls) needed to properly set up the node's repo. We list the most salient ones here.

Datastore

Datastore and ChainBlockstore: Data related to the node state is saved in the repo's Datastore, an IPFS interface defined here. Lotus creates this interface from a Badger DB in FsRepo. Every piece of data is fundamentally a key-value pair in the datastore directory of the repo. There are several abstractions laid on top of it that appear through the code depending on how we access it, but it is important to remember that we're always accessing it from the same place.

FIXME: Maybe mention the Batching interface as the developer will stumble upon it before reaching the Datastore one.

Blocks

FIXME: IPFS blocks vs Filecoin blocks ideally happens before this / here

The Blockstore interface structures the key-value pair into the CID format for the key and the Block interface for the value. The Block value is just a raw string of bytes addressed by its hash, which is included in the CID key.

ChainBlockstore creates a Blockstore in the repo under the /blocks namespace. Every key stored there will have the blocks prefix so that it does not collide with other stores that use the same repo.

FIXME: Link to IPFS documentation about DAG, CID, and related, especially we need a diagram that shows how do we wrap each datastore inside the next layer (datastore, batching, block store, gc, etc).

Metadata

modules.Datastore() creates a dtypes.MetadataDS, which is an alias for the basic Datastore interface. Metadata is stored here under the /metadata prefix. (FIXME: Explain what is metadata in contrast with the block store, namely we store the pointer to the heaviest chain, we might just link to that unwritten section here later.)

FIXME: Explain the key store related calls (maybe remove, per Schomatis)

LockedRepo

LockedRepo(): This method doesn't create or initialize any new structures, but rather registers an OnStop hook that will close the locked repository associated with it on shutdown.

Repo types / Node types

FIXME: This section needs to be clarified / corrected...I don't fully understand the config differences (what do they have in common, if anything?)

At the end of the Repo() function we see two mutually exclusive configuration calls based on the RepoType (node/repo/fsrepo.go).

			ApplyIf(isType(repo.FullNode), ConfigFullNode(c)),
			ApplyIf(isType(repo.StorageMiner), ConfigStorageMiner(c)),

As we said, the repo fully identifies the node so a repo type is also a node type, in this case a full node or a miner. (FIXME: What is the difference between the two, does full imply miner?) In this case the daemon command will create a FullNode, this is specified in the command logic itself in main.DaemonCmd(), the FsRepo created (and passed to node.Repo()) will be initiated with that type (see (*FsRepo).Init(t RepoType)).

Online

FIXME: Much of this might need to be subsumed into the p2p section

The node.Online() configuration function (node/builder.go) initializes components that involve connecting to, or interacting with, the Filecoin network. These connections are managed through the libp2p stack (FIXME link to this section when it exists). We discuss some of the components found in the full node type (that is, included in the ApplyIf(isType(repo.FullNode), call).

Chainstore

modules.ChainStore() creates the store.ChainStore) that wraps the stores previously instantiated in Repo(). It is the main point of entry for the node to all chain-related data (FIXME: this is incorrect, we sometimes access its underlying block store directly, and probably shouldn't). It also holds the crucial heaviest pointer, which indicates the current head of the chain.

ChainExchange and ChainBlockservice

ChainExchange() and ChainBlockservice() establish a BitSwap connection (FIXME libp2p link) to exchange chain information in the form of blocks.Blocks stored in the repo. (See sync section for more details, the Filecoin blocks and messages are backed by these raw IPFS blocks that together form the different structures that define the state of the current/heaviest chain.)

Incoming handlers

HandleIncomingBlocks() and HandleIncomingMessages() start the services in charge of processing new Filecoin blocks and messages from the network (see <undefined> for more information about the topics the node is subscribed to, FIXME: should that be part of the libp2p section or should we expand on gossipsub separately?).

Hello

RunHello(): starts the services to both send ((*Service).SayHello()) and receive ((*Service).HandleStream(), node/hello/hello.go) hello messages. When nodes establish a new connection with each other, they exchange these messages to share chain-related information (namely their genesis block and their heaviest tipset).

Syncer

NewSyncer() creates the Syncer structure and starts the services related to the chain sync process (FIXME link).

Ordering the dependencies

We can establish the dependency relations by looking at the parameters that each function needs and by understanding the architecture of the node and how the different components relate to each other (the chief purpose of this document).

As an example, the sync mechanism depends on the node being able to exchange different IPFS blocks with the network, so as to be able to request the "missing pieces" needed to construct the chain. This dependency is reflected by NewSyncer() having a blocksync.BlockSync parameter, which in turn depends on ChainBlockservice() and ChainExchange(). The chain exchange service further depends on the chain store to save and retrieve chain data, which is reflected in ChainExchange() having ChainGCBlockstore as a parameter (which is just a wrapper around ChainBlockstore capable of garbage collection).

This block store is the same store underlying the chain store, which is an indirect dependency of NewSyncer() (through the StateManager). (FIXME: This last line is flaky, we need to resolve the hierarchy better, we sometimes refer to the chain store and sometimes to its underlying block store. We need a diagram to visualize all the different components just mentioned otherwise it is too hard to follow. We probably even need to skip some of the connections mentioned.)