diff --git a/docs/rfc/README.md b/docs/rfc/README.md index 24887ac3f6..d861c91ad9 100644 --- a/docs/rfc/README.md +++ b/docs/rfc/README.md @@ -36,3 +36,5 @@ sections. ## Table of Contents + +* [RFC-002: Zero Copy Encoding](./rfc-002-zero-copy-encoding.md) diff --git a/docs/rfc/rfc-002-zero-copy-encoding.md b/docs/rfc/rfc-002-zero-copy-encoding.md new file mode 100644 index 0000000000..aeef7308de --- /dev/null +++ b/docs/rfc/rfc-002-zero-copy-encoding.md @@ -0,0 +1,406 @@ +# RFC 002: Zero-Copy Encoding + +## Changelog + +* 2022-03-08: Initial draft + +## Background + +When the SDK originally migrated to [protobuf state encoding](./../architecture/adr-019-protobuf-state-encoding.md), +zero-copy encodings such as [Cap'n Proto](https://capnproto.org/) +and [FlatBuffers](https://google.github.io/flatbuffers/) +were considered. We considered how a zero-copy encoding could be beneficial for interoperability with modules +and scripts in other languages and VMs. However, protobuf was still chosen because the maturity of its ecosystem and +tooling was much higher and the client experience and performance were considered the highest priorities. + +In [ADR 033: Protobuf-based Inter-Module Communication](./../architecture/adr-033-protobuf-inter-module-comm.md), the +idea of cross-language/VM inter-module +communication was considered again. And in the discussions surrounding [ADR 054: Semver Compatible SDK Modules](./../architecture/adr-054-semver-compatible-modules.md), +it was determined that multi-language/VM support in the SDK is a near term priority. + +While we could do cross-language/VM inter-module communication with protobuf binary or even JSON, the performance +overhead is deemed to be too high because: +* we are proposing replacing keeper calls with inter-module message calls and the overhead of even the inter-module +routing checks has come into question by some SDK users without even considering the possible overhead of encoding. +Effectively we would be replacing function calls with encoding. One of the SDK's primary objectives currently is +improving performance, and we want to avoid inter-module calls from becoming a big step backward. +* we want Rust code to be able to operate in highly resource constrained virtual machines so whatever we can do to +reduce performance overhead as well as the size of generated code will make it easier and more feasible to deploy +first-class integrations with these virtual machines. + +Thus, the agreement when the [ADR 054](./../architecture/adr-054-semver-compatible-modules.md) working group concluded +was to pursue a performant zero-copy encoding which is suitable for usage in highly resource constrained environments. + +## Proposal + +This RFC proposes a zero-copy encoding that is derived from the schema definitions defined in .proto files in the SDK +and all app chains. This would result in a new code generator for that supports both this zero-copy encoding as well as +the existing protobuf binary and JSON encodings as well as the google.golang.org/protobuf API. To make this zero-copy +encoding work, a number of changes are needed to how we manage the versioning of protobuf messages that should +address other concerns raised in [ADR 054](./../architecture/adr-054-semver-compatible-modules.md). The API for using +protobuf in golang would also change and this will be described in the [code generation](#generated-code) section +along with a proposed Rust code generator. + +An alternative approach to building a zero-copy encoding based on protobuf schemas would be to switch to FlatBuffers +or Cap'n Proto directly. However, this would require a complete rewrite of the SDK and all app chains. Places this +burden on the ecosystem would not be a wise choice when creating a zero-copy encoding compatible with all our +existing types and schemas is feasible. In the future, we may consider a native schema language for this encoding +that is more natural and succinct for its rules, but for now we are assuming that it is best to continue supporting +the existing protobuf based workflow. + +Also, we are not proposing a new encoding for transactions or gRPC query servers. From a client API perspective nothing +would change. The SDK would be capable of marshaling any message to and from protobuf binary and this zero-copy encoding +as needed. + +Furthermore, migrating to the **new golang generated code would be 100% opt-in** because the inter-module router will +simply marshal existing gogo proto generated types to/from the zero-copy encoding when needed. So migrating to the new +code generator would provide a performance benefit, but would not be required. + +In addition to supporting first-class Cosmos SDK modules defined in other languages and VMs, this encoding is intended +to be useful for user-defined code executing in a VM. To satisfy this, this encoding is designed to enable proper bounds +checking on all memory access at the expense of introducing some error return values in generated code. + +### New Protobuf Linting and Breaking Change Rules + +This zero-copy encoding places some additional requirements on the definition and maintenance of protobuf schemas. + +#### No New Fields Can Be Added To Existing Messages + +The biggest change is that it will be invalid to add a new field to an existing message and a breaking change detector +will need to be created which augments [buf breaking](https://docs.buf.build/breaking/overview) to detect this. + +The reasons for this are two-fold: + +1) from an API compatibility perspective, adding a new field to an existing message is actually a state machine breaking + change which in [ADR 020](../architecture/adr-020-protobuf-transaction-encoding.md) required us to add an unknown + field detector. Furthermore, in [ADR 054](../architecture/adr-054-semver-compatible-modules.md) this "feature" of protobuf + poses one of the biggest problems for correct forward compatibility between different versions of the same module. +2) not allowing new fields in existing messages makes the generated code in languages like Rust (which is currently our + highest priority target), much simpler and more performant because we can assume a fixed size struct gets allocated. + If new fields can be added to existing messages, we need to encode the number of fields into the message and then + do runtime checks. So this both increases memory layers and requires another layout of indirection. With the encoding + proposed below, "plain old Rust structs" (used with some special field types) can be used. + +Instead of adding new fields to existing messages, APIs can add new messages to existing packages or create new packages +with new versions of the messages. Also, we are not restricting the addition of cases to `oneof`s or values to `enum`s. +All of these cases are easier to detect at runtime with standard `switch` statements than the addition of new fields. + +#### Additional Linting Rules + +The following additional rules will be enforced by a linter that +complements [buf lint](https://docs.buf.build/lint/overview): + +* all message fields must be specified in continuous ascending order starting from `1` +* all enums must be specified in continuous ascending order starting from `0` - otherwise it is too complex to check at + runtime whether an enum value is unknown. An alternative would be to make adding new values to existing enums breaking +* all enum values must be `<= 255`. Any enum in a blockchain application which needs more than 256 values is probably + doing something very wrong. +* all oneof's must be the *only* element in their containing message and must start at field number `1` and be added in + continuous ascending order - this makes it possible to quickly check for unknown values +* all `oneof` field numbers must be `<= 255`. Any `oneof` which needs more field cases is probably doing something very + wrong. + +These requirements make the encoding and generated code simpler. + +### Encoding + +#### Buffers and Memory Management + +By default, this encoding attempts to use a single fixed size encoding buffer of 64kb. This imposes a limit on the +maximum size of a message that can be encoded. In the context of a message passing protocol for blockchains, this +is generally a reasonable limit and the only known valid use case for exceeding it is to store user-uploaded byte +code for execution in VMs. To accommodate this, large `string` and `bytes` values can be encoded in additional +standalone buffers if needed. Still, the body of a message included all scalar and message fields +must fit inside the 64kb buffer. + +While this design decision greatly simplifies the encoding and decoding logic, as well as the complexity of +generated code, it does mean that APIs will need to do proper bounds checking when writing data that is not fixed +size and return errors. + +The term `Root` is used to refer to the main 64kb buffer plus any additional large `string`/`bytes` buffers that are +allocated. + +#### Scalar Encoding + +* `bool`s are encoded as 1 byte - `0` or `1` +* `uint32`, `int32`, `sint32`, `fixed32`, `sfixed32` are encoded as 4 bytes by default +* `uint64`, `int64`, `sint64`, `fixed64`, `sfixed64` are encoded as 8 bytes by default +* `enum`s are encoded as 1 byte and values *MUST* be in the range of `0` to `255`. +* all scalars declared as `optional` are prefixed with 1 additional byte whose value is `0` or `1` to indicate presence + +All multibyte integers are encoded as little-endian which is by far the most common native byte order for modern +CPUs. Signed integers always use two's complement encoding. + +#### Message Encoding + +By default, messages field are encoded inline as structs. Meaning that if a message struct takes 8 bytes then its inline +field in another struct will add 8 bytes to that struct size. + +`optional` message fields will be prefixed by 1 byte to indicate presence. (Alternatively, we could encode optional +message fields as pointers (see below) if the desire is to save memory when they are rarely used needed.) + +#### Oneof’s + +`oneof`s are encoded as a combination of a `uint8` discriminant field and memory that is as large as the largest member +field. `oneof` field numbers *MUST* be between `1` and `255`. + +```protobuf +message Foo { + oneof sum { + bool x = 1; + int32 y = 2; + } +} +``` + +A discriminant of `0` indicates that the field is not set. + +#### Pointer Types: Bytes and Strings and Repeated fields + +A pointer is an 16-bit unsigned integer that points to an offset in the current memory buffer or to another memory +buffer. If the bit mask `0xFF00` on the is unset, then the pointer points to an offset in the main 64kb memory buffer. +If that bit mask is set, then the pointer points to a large `string` or `bytes` buffer. Up to 256 such buffers +can be referenced in a single `Root`. The pointer `0` indicates that a field is not defined. + +`bytes`, `string`s and repeated fields are encoded as pointers to a memory location that is prefixed with the +length of the `bytes`, `string` or repeated field value. If the referenced memory location is in the main 64kb memory +buffer, then this length prefix will be a 16-bit unsigned integer. If the referenced memory location is a large +`string` or `bytes` buffer, then this length prefix will be a 32-bit unsigned integer. + +#### `Any`s + +`Any`s are encoded as a pointer to the type URL string and a pointer to the start of the message +specified by the type URL. + +#### Maps + +Maps are not supported. + +#### Extended Encoding Options + +We may choose to allow customizing the encoding of fields so that they take up less space. + +For example, we could allow 8-bit or 16-bit integers: +`int32 x = 1 [(cosmos_proto.int16) = true]` would indicate that the field only needs 2 bytes + +Or we could allow `string`, `bytes` or `repeated` fields to have a fixed size rather than being encoding as +pointers to a variable-length value: +`string y = 2 [(cosmos_proto.fixed_size) = 3]` could indicate that this is a fixed width 3 byte string + +If we choose to enable these encoding options, changing these options would be a breaking change that needs to be +prevented by the breaking change detector. + +### Generated Code + +We will describe the generated Go and Rust code using this example protobuf file: + +```protobuf +message Foo { + int32 x = 1; + optional uint32 y = 2; + string z = 3; + Bar bar = 4; + repeated Bar bars = 5; +} + + +message Bar { + ABC abc = 1; + Baz baz = 2; + repeated uint32 xs = 3; +} + +message Baz { + oneof sum { + uint32 x = 1; + string y = 2; + } +} + +enum ABC { + A = 0; + B = 1; + C = 2; + D = 3; +} +``` + +#### Go + +In golang, the generated code would not expose any exported struct fields, but rather getters and setters as an +interface +or struct methods, ex: + +```go +type Foo interface { + X() int32 + SetX(int32) + Y() zpb.Option[uint32] + SetY(zpb.Option[uint32]) + Z() (string, error) + SetZ(string) error + Bar() Bar + Bars() (zpb.Array[Bar], error) +} + +type Bar interface { + Abc() ABC + SetAbc(ABC) Bar + Baz() Baz + Xs() (zpb.ScalarArray[uint32], error) +} + +type Baz interface { + Case() Baz_case + GetX() uint32 + SetX(uint32) + GetY() (string, error) + SetY(string) +} + +type Baz_case int32 +const ( + Baz_X Baz_case = 0 + Baz_Y Baz_case = 1 +) + +type ABC int32 +const ( + ABC_A ABC = 0 + ABC_B ABC = 1 + ABC_C ABC = 2 + ABC_D ABC = 3 +) +``` + +Special types `zpb.Option`, `zpb.Array` and `zpb.ScalarArray` are used to represent `optional` and repeated fields +respectively. These types would be included in the runtime library (called `zpb` here for zero-copy protobuf) and would +have an API like this: + +```go +type Option[T] interface { + IsSet() bool + Value() T +} + +type Array[T] interface { + InitWithLength(int) error + Len() int + Get(int) T +} + +type ScalarArray[T] interface { + Array[T] + Set(int, T) +} +``` + +Arrays in particular would not be resizable, but would be initialized with a fixed length. This is to ensure that arrays +can be written to the underlying buffer in a linear way. + +In golang, buffers would be managed transparently under the hood by the first message initialized, and usage of this +generated code might look like this: + +```go +foo := NewFoo() +foo.SetX(1) +foo.SetY(zpb.NewOption[uint32](2)) +err := foo.SetZ("hello") +if err != nil { + panic(err) +} + +bar := foo.Bar() +bar.Baz().SetX(3) + +xs, err = bar.Xs() +if err != nil { + panic(err) +} +xs.InitWithLength(2) +xs.Set(0, 0) +xs.Set(1, 2) + +bars, err = foo.Bars() +if err != nil { + panic(err) +} +bars.InitWithLength(3) +bars.Get(0).Baz().SetY("hello") +bars.Get(1).SetAbc(ABC_B) +bars.Get(2).Baz().SetX(4) +``` + +Under the hood the generated code would manage memory buffers on its own. The usage of `oneof`s is a bit easier than +the existing go generated code (as with `bar.Baz()` above). And rather than using setters on embedded messages, we +simply get the field (already allocated) and set its fields (as in the case of `foo.Bar()` above or the repeated +field `foo.Bars()`). Whenever a field is stored with a pointer (`string`, `bytes`, and `repeated` fields), there is +always an error returned on the getter to do proper bounds checking on the buffer. + +#### Rust + +This encoding should allow generating native structs in Rust that are annotated with `#[repr(C, align(1))]`. It should +be fairly natural to use from Rust with a key difference that memory buffers (called `Root`s) must be manually allocated +and passed into any pointer type. + +Here is some example code that uses library types `Option`, `Enum`, `String`, `OneOf` and `Repeated` +as well as little-endian integer types from [rend](https://lib.rs/crates/rend): + +```rust! +#[repr(C, align(1))] +struct Foo { + x: rend:i32_le, + y: cosmos_proto::Option, + z: cosmos_proto::String, // String wraps a pointer to a string + bar: Bar +} + +#[repr(C, align(1))] +struct Bar { + abc: cosmos_proto::Enum, // the Enum wrapper allows us to distinguish undefined and defined values of ABC at runtime. 3 is specified as the max value of ABC. + baz: cosmos_proto::OneOf, // the OneOf wrapper allows distinguished undefined values of Baz at runtime. 2 is specified as the max field value of Baz. + xs: cosmos_proto::Repeated // Repeated wraps a pointer to repeated fields +} + +#[repr(u8)] +enum ABC { + A = 0, + B = 1, + C = 2, + D = 3, +} + +#[repr(C, u8)] +enum Baz { + Empty, // all oneof's have a case for Empty if they are unset + X(rend::u32_le), + Y(cosmos_proto::String) +} +``` + +Example usage (which does the exact same thing as the go example above) would be: + +```rust! +let mut root = Root::new(); +let mut foo = root.get_mut(); +foo.x = 1.into(); +foo.y = Some(2.into()); +foo.z.set(root.new_string("hello")?); // could return an allocation error + +foo.bar.baz = Baz::X(3.into()); + +foo.bar.xs.init_with_size(&mut root, 2)?; // could return an allocation error +foo.bar.xs[0] = 0.into(); +foo.bar.xs[1] = 2.into(); + +foo.bars.init_with_size(&mut root, 3)?; // could return an allocation error +foo.bars[0].baz = Baz::Y(root.new_string("hello")?); // could return an allocation error +foo.bars[1].abc = ABC::B; +foo.bars[2].baz = Baz::X(4.into()); +``` + +## Abandoned Ideas (Optional) + +## References + +## Discussion