Update to new assembly specification.

2023-10-03 13:03:40 +00:00 · 2017-01-03 15:19:14 +01:00 · 2017-01-03 15:19:14 +01:00 · 9683cfea6d
commit 9683cfea6d
parent 49ac2a1ee5
1 changed files with 378 additions and 28 deletions
--- a/docs/assembly.rst
+++ b/docs/assembly.rst
@ -27,6 +27,12 @@ arising when writing manual assembly by the following features:
 * assembly-local variables: ``let x := add(2, 3)  let y := mload(0x40)  x := add(x, y)``
 * access to external variables: ``function f(uint x) { assembly { x := sub(x, 1) } }``
 * labels: ``let x := 10  repeat: x := sub(x, 1) jumpi(repeat, eq(x, 0))``
+* loops: ``for { let i := 0 } lt(i, x) { i := add(i, 1) } { y := mul(2, y) }``
+* switch statements: ``switch x case 0: { y := mul(x, 2) } default: { y := 0 }``
+* function calls: ``function f(x) -> (y) { switch x case 0: { y := 1 } default: y := mul(x, f(sub(x, 1))) }``
+
+.. note::
+    Of the above, loops, function calls and switch statements are not yet implemented.

 We now want to describe the inline assembly language in detail.

@ -91,40 +97,165 @@ you really know what you are doing.
 Standalone Assembly
 ===================

-Grammar
-------
+This assembly language tries to achieve several goals:

-The assembly lexer follows the one defined by Solidity itself.
+1. Programs written in it should be readable, even if the code is generated by a compiler from Solidity.
+2. The translation from assembly to bytecode should contain as few "surprises" as possible.
+3. Control flow should be easy to detect to help in formal verification and optimization.

-Whitespace is used to delimit tokens and it consists of the characters
-Space, Tab and Linefeed. Comments as defined below, are interpreted in the
-same way as Whitespace.
-Furthermore, the following tokens exist:
+In order to achieve the first and last goal, assembly provides high-level constructs
+like ``for`` loops, ``switch`` statements and function calls. It should be possible
+to write assembly programs that do not make use of explicit ``SWAP``, ``DUP``,
+``JUMP`` and ``JUMPI`` statements, because the first two obfuscate the data flow
+and the last two obfuscate control flow. Furthermore, functional statements of
+the form ``mul(add(x, y), 7)`` are preferred over pure opcode statements like
+``7 x y add mul`` because in the first form, it is much easier to see which
+operand is used for which opcode.

-TODO: escapes inside strings, decimal literals, hex literals, hex string literals
+The second goal is achieved by introducing a desugaring phase that only removes
+the higher level constructs in a very regular way and still allows inspecting
+the generated low-level assembly code. The only non-local operation performed
+by the assembler is name lookup of user-defined identifiers (functions, variables, ...),
+which follow very simple and regular scoping rules and cleanup of local variables from the stack.

-``OneLineComment := "//" [^\n]*`
-``MultiLineComment := "/*" .*? "*/"`` 
+Scoping: An identifier that is declared (label, variable, function, assembly)
+is only visible in the block where it was declared (including nested blocks
+inside the current block). It is not legal to access local variables across
+function borders, even if they would be in scope. Shadowing is allowed, but
+two identifiers with the same name cannot be declared in the same block.
+Local variables cannot be accessed before they were declared, but labels,
+functions and assemblies can. Assemblies are special blocks that are used
+for e.g. returning runtime code or creating contracts. No identifier from an
+outer assembly is visible in a sub-assembly.

-``String := '"' [^"]* '"' | "'" [^']* "'"`` 
-``Identifier := [_$a-zA-Z][_$a-zA-Z0-9]*``
-``Opcodes :=
-"add" | "addmod" | "address" | "and" | "balance" | "blockhash" | "byte" | "call" |
-"callcode" | "calldatacopy" | "calldataload" | "calldatasize" | "caller" | "callvalue" |
-"codecopy" | "codesize" | "coinbase" | "create" | "delegatecall" | "difficulty" |
-"div" | "dup1" | "dup2" | "dup3" | "dup4" | "dup5" | "dup6" | "dup7" | "dup8" | "dup9" |
-"dup10" | "dup11" | "dup12" | "dup13" | "dup14" | "dup15" | "dup16" | "eq" | "exp" |
-"extcodecopy" | "extcodesize" | "gas" | "gaslimit" | "gasprice" | "gt" | "iszero" |
-"jump" | "jumpi" | "log0" | "log1" | "log2" | "log3" | "log4" | "lt" | "mload" | "mod" |
-"msize" | "mstore" | "mstore8" | "mul" | "mulmod" | "not" | "number" | "or" | "origin" |
-"pc" | "pop" | "return" | "sdiv" | "selfdestruct" | "sgt" | "sha3" | "signextend" |
-"sload" | "slt" | "smod" | "sstore" | "stop" | "sub" | "swap1" | "swap2" | "swap3" |
-"swap4" | "swap5" | "swap6" | "swap7" | "swap8" | "swap9" | "swap10" | "swap11" |
-"swap12" | "swap13" | "swap14" | "swap15" | "swap16" | "timestamp" | "xor"``
+If control flow passes over the end of a block, pop instructions are inserted
+that match the number of local variables declared in that block, unless the
+``}`` is directly preceded by an opcode that does not have a continuing control
+flow path. The stack height is reduced by the number of local variables
+regardless of that. This mean that labels in the next block will have the
+same height as before the block that just ended.

-TODO: Define functional instruction, label, assignment, functional assignment,
-variable declaration, ...
+If at the end of a block, the stack is not balanced, a warning is issued,
+unless the last instruction in the block did not have a continuing control flow path.

+Why do we use higher-level constructs like ``switch``, ``for`` and functions:
+
+Using ``switch``, ``for`` and functions, it should be possible to write
+complex code without using ``jump`` or ``jumpi`` manually. This makes it much
+easier to analyze the control flow, which allows for improved formal
+verification and optimization.
+
+Furthermore, if manual jumps are allowed, computing the stack height is rather complicated.
+The position of all local variables on the stack needs to be known, otherwise
+neither references to local variables nor removing local variables automatically
+from the stack at the end of a block will work properly. Because of that,
+every label that is preceded by an instruction that ends or diverts control flow
+should be annotated with the current stack layout. This annotation is performed
+automatically during the desugaring phase.
+
+Example:
+
+We will follow an example compilation from Solidity to desugared assembly.
+We consider the runtime bytecode of the following Solidity program::
+
+    contract C {
+      function f(uint x) returns (uint y) {
+        y = 1
+        for (uint i = 0; i < x; i++)
+          y = 2 * y;
+      }
+    }
+
+The following assembly will be generated::
+
+    {
+      mstore(0x40, 0x60) // store the "free memory pointer"
+      // function dispatcher
+      switch div(calldataload(0), exp(2, 226))
+        case 0xb3de648b: {
+          let (r,) = f(calldataload(4))
+          let ret := $allocate(0x20)
+          mstore(ret, r)
+          return(ret, 0x20)
+        }
+        default: { jump(invalidJumpLabel) }
+      // memory allocator
+      function $allocate(size) -> (pos) {
+        pos := mload(0x40)
+        mstore(0x40, add(pos, size))
+      }
+      // the contract function
+      function f(x) -> (y) {
+        y := 1
+        for { let i := 0 } lt(i, x) { i := add(i, 1) } {
+          y := mul(2, y)
+        }
+      }
+    }
+
+After the desugaring phase it looks as follows::
+
+    {
+      mstore(0x40, 0x60)
+      {
+        let $0 := div(calldataload(0), exp(2, 226))
+        jumpi($case1, eq($0, 0xb3de648b))
+        jump($caseDefault)
+        $case1:
+        {
+          // the function call - we put return label and arguments on the stack
+          $ret1 calldataload(4) jump($fun_f)
+          $ret1 [r]: // a label with a [...]-annotation resets the stack height
+                    // to "current block + number of local variables". It also
+                    // introduces a variable, r:
+                    // r is at top of stack, $0 is below (from enclosing block)
+          $ret2 0x20 jump($fun_allocate)
+          $ret2 [ret]: // stack here: $0, r, ret (top)
+          mstore(ret, r)
+          return(ret, 0x20)
+          // although it is useless, the jump is automatically inserted,
+          // since the desugaring process does not analyze control-flow
+          jump($endswitch)
+        }
+        $caseDefault:
+        {
+          jump(invalidJumpLabel)
+          jump($endswitch)
+        }
+        $endswitch:
+      }
+      jump($afterFunction)
+      $fun_allocate:
+      {
+        $start[$retpos, size]:
+        let pos := 0
+        {
+          pos := mload(0x40)
+          mstore(0x40, add(pos, size))
+        }
+        swap1 pop swap1 jump
+      }
+      $fun_f:
+      {
+        start [$retpos, x]:
+        let y := 0
+        {
+          let i := 0
+          $for_begin:
+          jumpi($for_end, iszero(lt(i, x)))
+          {
+            y := mul(2, y)
+          }
+          $for_continue:
+          { i := add(i, 1) }
+          jump($for_begin)
+          $for_end:
+        } // Here, a pop instruction is inserted for i
+        swap1 pop swap1 jump
+      }
+      $afterFunction:
+      stop
+    }

 Syntax
 ------
@ -159,6 +290,8 @@ In the following, ``mem[a...b)`` signifies the bytes of memory starting at posit

 The opcodes ``pushi`` and ``jumpdest`` cannot be used directly.

+In the grammar, opcodes are represented as pre-defined identifiers.
+
 +-------------------------+------+-----------------------------------------------------------------+
 | stop                    + `-`  | stop execution, identical to return(0,0)                        |
 +-------------------------+------+-----------------------------------------------------------------+
@ -508,3 +641,220 @@ first slot of the array and then only the array elements follow.
    Statically-sized memory arrays do not have a length field, but it will be added soon
    to allow better convertibility between statically- and dynamically-sized arrays, so
    please do not rely on that.
+
+
+Specification
+=============
+
+Assembly happens in four stages:
+
+1. Parsing
+2. Desugaring (removes switch, for and functions)
+3. Opcode stream generation
+4. Bytecode generation
+
+
+Parsing / Grammar
+-----------------
+
+The tasks of the parser are the following:
+
+- Turn the byte stream into a token stream, discarding C++-style comments
+  (a special comment exists for source references, but we will not explain it here).
+- Turn the token stream into an AST according to the grammar below
+- Register identifiers with the block they are defined in (annotation to the
+  AST node) and note from which point on, variables can be accessed.
+
+The assembly lexer follows the one defined by Solidity itself.
+
+Whitespace is used to delimit tokens and it consists of the characters
+Space, Tab and Linefeed. Comments are regular JavaScript/C++ comments and
+are interpreted in the same way as Whitespace.
+
+Grammar::
+
+    AssemblyBlock = '{' AssemblyItem* '}'
+    AssemblyItem =
+        Identifier |
+        AssemblyBlock |
+        FunctionalAssemblyExpression |
+        AssemblyLocalDefinition |
+        FunctionalAssemblyAssignment |
+        AssemblyAssignment |
+        LabelDefinition |
+        AssemblySwitch |
+        AssemblyFunctionDefinition |
+        AssemblyFor |
+        'break' | 'continue' |
+        SubAssembly | 'dataSize' '(' Identifier ')' |
+        LinkerSymbol |
+        'errorLabel' | 'bytecodeSize' |
+        NumberLiteral | StringLiteral | HexLiteral
+    Identifier = [a-zA-Z_$] [a-zA-Z_0-9]*
+    FunctionalAssemblyExpression = Identifier '(' ( AssemblyItem ( ',' AssemblyItem )* )? ')'
+    AssemblyLocalDefinition = 'let' IdentifierOrList ':=' FunctionalAssemblyExpression
+    FunctionalAssemblyAssignment = IdentifierOrList ':=' FunctionalAssemblyExpression
+    IdentifierOrList = Identifier | '(' IdentifierList ')'
+    IdentifierList = Identifier ( ',' Identifier)*
+    AssemblyAssignment = '=:' Identifier
+    LabelDefinition = Identifier ( '[' ( IdentifierList | NumberLiteral ) ']' )? ':'
+    AssemblySwitch = 'switch' FunctionalAssemblyExpression AssemblyCase*
+        ( 'default' ':' AssemblyBlock )?
+    AssemblyCase = 'case' FunctionalAssemblyExpression ':' AssemblyBlock
+    AssemblyFunctionDefinition = 'function' Identifier '(' IdentifierList? ')' '->'
+        ( '(' IdentifierList ')' AssemblyBlock
+    AssemblyFor = 'for' ( AssemblyBlock | FunctionalAssemblyExpression)
+        FunctionalAssemblyExpression ( AssemblyBlock | FunctionalAssemblyExpression) AssemblyBlock
+    SubAssembly = 'assembly' Identifier AssemblyBlock
+    LinkerSymbol = 'linkerSymbol' '(' StringLiteral ')'
+    NumberLiteral = HexNumber | DecimalNumber
+    HexLiteral = 'hex' ('"' ([0-9a-fA-F]{2})* '"' | '\'' ([0-9a-fA-F]{2})* '\'')
+    StringLiteral = '"' ([^"\r\n\\] | '\\' .)* '"'
+    HexNumber = '0x' [0-9a-fA-F]+
+    DecimalNumber = [0-9]+
+
+
+Desugaring
+----------
+
+An AST transformation removes for, switch and function constructs. The result
+is still parseable by the same parser, but it will not use certain constructs.
+If jumpdests are added that are only jumped to and not continued at, information
+about the stack content is added, unless no local variables of outer scopes are
+accessed or the stack height is the same as for the previous instruction.
+
+Pseudocode::
+
+    desugar item: AST -> AST =
+    match item {
+    AssemblyFunctionDefinition('function' name '(' arg1, ..., argn ')' '->' ( '(' ret1, ..., retm ')' body) ->
+      <name>:
+      {
+        $<name>_start [$retPC, $argn, ..., arg1]:
+        let ret1 := 0 ... let retm := 0
+        { desugar(body) }
+        swap and pop items so that only ret1, ... retn, $retPC are left on the stack
+        jump 
+      }
+    AssemblyFor('for' { init } condition post body) ->
+      {
+        init // cannot be its own block because we want variable scope to extend into the body
+        // find I such that there are no labels $forI_*
+        $forI_begin:
+        jumpi($forI_end, iszero(condition))
+        { body }
+        $forI_continue:
+        { post }
+        jump($forI_begin)
+        $forI_end:
+      }
+    'break' ->
+      {
+        // find nearest enclosing scope with label $forI_end
+        pop all local variables that are defined at the current point
+        but not at $forI_end
+        jump($forI_end)
+      }
+    'continue' ->
+      {
+        // find nearest enclosing scope with label $forI_continue
+        pop all local variables that are defined at the current point
+        but not at $forI_continue
+        jump($forI_continue)
+      }
+    AssemblySwitch(switch condition cases ( default: defaultBlock )? ) ->
+      {
+        // find I such that there is no $switchI* label or variable
+        let $switchI_value := condition
+        for each of cases match {
+          case val: -> jumpi($switchI_caseJ, eq($switchI_value, val))
+        }
+        if default block present: ->
+          { defaultBlock jump($switchI_end) }
+        for each of cases match {
+          case val: { body } -> $switchI_caseJ: { body jump($switchI_end) }
+        }
+        $switchI_end:
+      }
+    FunctionalAssemblyExpression( identifier(arg1, arg2, ..., argn) ) ->
+      {
+        if identifier is function <name> with n args and m ret values ->
+          {
+            // find I such that $funcallI_* does not exist
+            $funcallI_return argn  ... arg2 arg1 jump(<name>)
+            if the current context is `let (id1, ..., idm) := f(...)` ->
+              $funcallI_return [id1, ..., idm]:
+            else ->
+              $funcallI_return[m - n - 1]:
+              turn the functional expression that leads to the function call
+              into a statement stream
+          }
+        else -> desugar(children of node)
+      }
+    default node ->
+      desugar(children of node)
+    }
+
+Opcode Stream Generation
+------------------------
+
+During opcode stream generation, we keep track of the current stack height,
+so that accessing stack variables by name is possible.
+
+Pseudocode::
+
+    codegen item: AST -> opcode_stream =
+    match item {
+    AssemblyBlock({ items }) ->
+      join(codegen(item) for item in items)
+      if last generated opcode has continuing control flow:
+        POP for all local variables registered at the block (including variables
+        introduced by labels)
+        warn if the stack height at this point is not the same as at the start of the block
+    Identifier(id) ->
+      lookup id in the syntactic stack of blocks
+      match type of id
+        Local Variable ->
+          DUPi where i = 1 + stack_height - stack_height_of_identifier(id)
+        Label ->
+          // reference to be resolved during bytecode generation
+          PUSH<bytecode position of label>
+        SubAssembly ->
+          PUSH<bytecode position of subassembly data>
+    FunctionalAssemblyExpression(id ( arguments ) ) ->
+      join(codegen(arg) for arg in arguments.reversed())
+      id (which has to be an opcode, might be a function name later)
+    AssemblyLocalDefinition(let (id1, ..., idn) := expr) ->
+      register identifiers id1, ..., idn as locals in current block at current stack height
+      codegen(expr) - assert that expr returns n items to the stack
+    FunctionalAssemblyAssignment((id1, ..., idn) := expr) ->
+      lookup id1, ..., idn in the syntactic stack of blocks, assert that they are variables
+      codegen(expr)
+      for j = n, ..., i:
+      SWAPi where i = 1 + stack_height - stack_height_of_identifier(idj)
+      POP
+    AssemblyAssignment(=: id) ->
+      look up id in the syntactic stack of blocks, assert that it is a variable
+      SWAPi where i = 1 + stack_height - stack_height_of_identifier(id)
+      POP
+    LabelDefinition(name [id1, ..., idn] :) ->
+      JUMPDEST
+      // register new variables id1, ..., idn and set the stack height to
+      // stack_height_at_block_start + number_of_local_variables
+    LabelDefinition(name [number] :) ->
+      JUMPDEST
+      // adjust stack height by +number (can be negative)
+    NumberLiteral(num) ->
+      PUSH<num interpreted as decimal and right-aligned>
+    HexLiteral(lit) ->
+      PUSH32<lit interpreted as hex and left-aligned>
+    StringLiteral(lit) ->
+      PUSH32<lit utf-8 encoded and left-aligned>
+    SubAssembly(assembly <name> block) ->
+      append codegen(block) at the end of the code
+    dataSize(<name>) ->
+      assert that <name> is a subassembly ->
+      PUSH32<size of code generated from subassembly <name>>
+    linkerSymbol(<lit>) ->
+      PUSH32<zeros> and append position to linker table
+    }