Update to new assembly specification.

This commit is contained in:
chriseth 2017-01-03 15:19:14 +01:00
parent 49ac2a1ee5
commit 9683cfea6d

View File

@ -27,6 +27,12 @@ arising when writing manual assembly by the following features:
* assembly-local variables: ``let x := add(2, 3) let y := mload(0x40) x := add(x, y)``
* access to external variables: ``function f(uint x) { assembly { x := sub(x, 1) } }``
* labels: ``let x := 10 repeat: x := sub(x, 1) jumpi(repeat, eq(x, 0))``
* loops: ``for { let i := 0 } lt(i, x) { i := add(i, 1) } { y := mul(2, y) }``
* switch statements: ``switch x case 0: { y := mul(x, 2) } default: { y := 0 }``
* function calls: ``function f(x) -> (y) { switch x case 0: { y := 1 } default: y := mul(x, f(sub(x, 1))) }``
.. note::
Of the above, loops, function calls and switch statements are not yet implemented.
We now want to describe the inline assembly language in detail.
@ -91,40 +97,165 @@ you really know what you are doing.
Standalone Assembly
===================
Grammar
-------
This assembly language tries to achieve several goals:
The assembly lexer follows the one defined by Solidity itself.
1. Programs written in it should be readable, even if the code is generated by a compiler from Solidity.
2. The translation from assembly to bytecode should contain as few "surprises" as possible.
3. Control flow should be easy to detect to help in formal verification and optimization.
Whitespace is used to delimit tokens and it consists of the characters
Space, Tab and Linefeed. Comments as defined below, are interpreted in the
same way as Whitespace.
Furthermore, the following tokens exist:
In order to achieve the first and last goal, assembly provides high-level constructs
like ``for`` loops, ``switch`` statements and function calls. It should be possible
to write assembly programs that do not make use of explicit ``SWAP``, ``DUP``,
``JUMP`` and ``JUMPI`` statements, because the first two obfuscate the data flow
and the last two obfuscate control flow. Furthermore, functional statements of
the form ``mul(add(x, y), 7)`` are preferred over pure opcode statements like
``7 x y add mul`` because in the first form, it is much easier to see which
operand is used for which opcode.
TODO: escapes inside strings, decimal literals, hex literals, hex string literals
The second goal is achieved by introducing a desugaring phase that only removes
the higher level constructs in a very regular way and still allows inspecting
the generated low-level assembly code. The only non-local operation performed
by the assembler is name lookup of user-defined identifiers (functions, variables, ...),
which follow very simple and regular scoping rules and cleanup of local variables from the stack.
``OneLineComment := "//" [^\n]*`
``MultiLineComment := "/*" .*? "*/"``
Scoping: An identifier that is declared (label, variable, function, assembly)
is only visible in the block where it was declared (including nested blocks
inside the current block). It is not legal to access local variables across
function borders, even if they would be in scope. Shadowing is allowed, but
two identifiers with the same name cannot be declared in the same block.
Local variables cannot be accessed before they were declared, but labels,
functions and assemblies can. Assemblies are special blocks that are used
for e.g. returning runtime code or creating contracts. No identifier from an
outer assembly is visible in a sub-assembly.
``String := '"' [^"]* '"' | "'" [^']* "'"``
``Identifier := [_$a-zA-Z][_$a-zA-Z0-9]*``
``Opcodes :=
"add" | "addmod" | "address" | "and" | "balance" | "blockhash" | "byte" | "call" |
"callcode" | "calldatacopy" | "calldataload" | "calldatasize" | "caller" | "callvalue" |
"codecopy" | "codesize" | "coinbase" | "create" | "delegatecall" | "difficulty" |
"div" | "dup1" | "dup2" | "dup3" | "dup4" | "dup5" | "dup6" | "dup7" | "dup8" | "dup9" |
"dup10" | "dup11" | "dup12" | "dup13" | "dup14" | "dup15" | "dup16" | "eq" | "exp" |
"extcodecopy" | "extcodesize" | "gas" | "gaslimit" | "gasprice" | "gt" | "iszero" |
"jump" | "jumpi" | "log0" | "log1" | "log2" | "log3" | "log4" | "lt" | "mload" | "mod" |
"msize" | "mstore" | "mstore8" | "mul" | "mulmod" | "not" | "number" | "or" | "origin" |
"pc" | "pop" | "return" | "sdiv" | "selfdestruct" | "sgt" | "sha3" | "signextend" |
"sload" | "slt" | "smod" | "sstore" | "stop" | "sub" | "swap1" | "swap2" | "swap3" |
"swap4" | "swap5" | "swap6" | "swap7" | "swap8" | "swap9" | "swap10" | "swap11" |
"swap12" | "swap13" | "swap14" | "swap15" | "swap16" | "timestamp" | "xor"``
If control flow passes over the end of a block, pop instructions are inserted
that match the number of local variables declared in that block, unless the
``}`` is directly preceded by an opcode that does not have a continuing control
flow path. The stack height is reduced by the number of local variables
regardless of that. This mean that labels in the next block will have the
same height as before the block that just ended.
TODO: Define functional instruction, label, assignment, functional assignment,
variable declaration, ...
If at the end of a block, the stack is not balanced, a warning is issued,
unless the last instruction in the block did not have a continuing control flow path.
Why do we use higher-level constructs like ``switch``, ``for`` and functions:
Using ``switch``, ``for`` and functions, it should be possible to write
complex code without using ``jump`` or ``jumpi`` manually. This makes it much
easier to analyze the control flow, which allows for improved formal
verification and optimization.
Furthermore, if manual jumps are allowed, computing the stack height is rather complicated.
The position of all local variables on the stack needs to be known, otherwise
neither references to local variables nor removing local variables automatically
from the stack at the end of a block will work properly. Because of that,
every label that is preceded by an instruction that ends or diverts control flow
should be annotated with the current stack layout. This annotation is performed
automatically during the desugaring phase.
Example:
We will follow an example compilation from Solidity to desugared assembly.
We consider the runtime bytecode of the following Solidity program::
contract C {
function f(uint x) returns (uint y) {
y = 1
for (uint i = 0; i < x; i++)
y = 2 * y;
}
}
The following assembly will be generated::
{
mstore(0x40, 0x60) // store the "free memory pointer"
// function dispatcher
switch div(calldataload(0), exp(2, 226))
case 0xb3de648b: {
let (r,) = f(calldataload(4))
let ret := $allocate(0x20)
mstore(ret, r)
return(ret, 0x20)
}
default: { jump(invalidJumpLabel) }
// memory allocator
function $allocate(size) -> (pos) {
pos := mload(0x40)
mstore(0x40, add(pos, size))
}
// the contract function
function f(x) -> (y) {
y := 1
for { let i := 0 } lt(i, x) { i := add(i, 1) } {
y := mul(2, y)
}
}
}
After the desugaring phase it looks as follows::
{
mstore(0x40, 0x60)
{
let $0 := div(calldataload(0), exp(2, 226))
jumpi($case1, eq($0, 0xb3de648b))
jump($caseDefault)
$case1:
{
// the function call - we put return label and arguments on the stack
$ret1 calldataload(4) jump($fun_f)
$ret1 [r]: // a label with a [...]-annotation resets the stack height
// to "current block + number of local variables". It also
// introduces a variable, r:
// r is at top of stack, $0 is below (from enclosing block)
$ret2 0x20 jump($fun_allocate)
$ret2 [ret]: // stack here: $0, r, ret (top)
mstore(ret, r)
return(ret, 0x20)
// although it is useless, the jump is automatically inserted,
// since the desugaring process does not analyze control-flow
jump($endswitch)
}
$caseDefault:
{
jump(invalidJumpLabel)
jump($endswitch)
}
$endswitch:
}
jump($afterFunction)
$fun_allocate:
{
$start[$retpos, size]:
let pos := 0
{
pos := mload(0x40)
mstore(0x40, add(pos, size))
}
swap1 pop swap1 jump
}
$fun_f:
{
start [$retpos, x]:
let y := 0
{
let i := 0
$for_begin:
jumpi($for_end, iszero(lt(i, x)))
{
y := mul(2, y)
}
$for_continue:
{ i := add(i, 1) }
jump($for_begin)
$for_end:
} // Here, a pop instruction is inserted for i
swap1 pop swap1 jump
}
$afterFunction:
stop
}
Syntax
------
@ -159,6 +290,8 @@ In the following, ``mem[a...b)`` signifies the bytes of memory starting at posit
The opcodes ``pushi`` and ``jumpdest`` cannot be used directly.
In the grammar, opcodes are represented as pre-defined identifiers.
+-------------------------+------+-----------------------------------------------------------------+
| stop + `-` | stop execution, identical to return(0,0) |
+-------------------------+------+-----------------------------------------------------------------+
@ -508,3 +641,220 @@ first slot of the array and then only the array elements follow.
Statically-sized memory arrays do not have a length field, but it will be added soon
to allow better convertibility between statically- and dynamically-sized arrays, so
please do not rely on that.
Specification
=============
Assembly happens in four stages:
1. Parsing
2. Desugaring (removes switch, for and functions)
3. Opcode stream generation
4. Bytecode generation
Parsing / Grammar
-----------------
The tasks of the parser are the following:
- Turn the byte stream into a token stream, discarding C++-style comments
(a special comment exists for source references, but we will not explain it here).
- Turn the token stream into an AST according to the grammar below
- Register identifiers with the block they are defined in (annotation to the
AST node) and note from which point on, variables can be accessed.
The assembly lexer follows the one defined by Solidity itself.
Whitespace is used to delimit tokens and it consists of the characters
Space, Tab and Linefeed. Comments are regular JavaScript/C++ comments and
are interpreted in the same way as Whitespace.
Grammar::
AssemblyBlock = '{' AssemblyItem* '}'
AssemblyItem =
Identifier |
AssemblyBlock |
FunctionalAssemblyExpression |
AssemblyLocalDefinition |
FunctionalAssemblyAssignment |
AssemblyAssignment |
LabelDefinition |
AssemblySwitch |
AssemblyFunctionDefinition |
AssemblyFor |
'break' | 'continue' |
SubAssembly | 'dataSize' '(' Identifier ')' |
LinkerSymbol |
'errorLabel' | 'bytecodeSize' |
NumberLiteral | StringLiteral | HexLiteral
Identifier = [a-zA-Z_$] [a-zA-Z_0-9]*
FunctionalAssemblyExpression = Identifier '(' ( AssemblyItem ( ',' AssemblyItem )* )? ')'
AssemblyLocalDefinition = 'let' IdentifierOrList ':=' FunctionalAssemblyExpression
FunctionalAssemblyAssignment = IdentifierOrList ':=' FunctionalAssemblyExpression
IdentifierOrList = Identifier | '(' IdentifierList ')'
IdentifierList = Identifier ( ',' Identifier)*
AssemblyAssignment = '=:' Identifier
LabelDefinition = Identifier ( '[' ( IdentifierList | NumberLiteral ) ']' )? ':'
AssemblySwitch = 'switch' FunctionalAssemblyExpression AssemblyCase*
( 'default' ':' AssemblyBlock )?
AssemblyCase = 'case' FunctionalAssemblyExpression ':' AssemblyBlock
AssemblyFunctionDefinition = 'function' Identifier '(' IdentifierList? ')' '->'
( '(' IdentifierList ')' AssemblyBlock
AssemblyFor = 'for' ( AssemblyBlock | FunctionalAssemblyExpression)
FunctionalAssemblyExpression ( AssemblyBlock | FunctionalAssemblyExpression) AssemblyBlock
SubAssembly = 'assembly' Identifier AssemblyBlock
LinkerSymbol = 'linkerSymbol' '(' StringLiteral ')'
NumberLiteral = HexNumber | DecimalNumber
HexLiteral = 'hex' ('"' ([0-9a-fA-F]{2})* '"' | '\'' ([0-9a-fA-F]{2})* '\'')
StringLiteral = '"' ([^"\r\n\\] | '\\' .)* '"'
HexNumber = '0x' [0-9a-fA-F]+
DecimalNumber = [0-9]+
Desugaring
----------
An AST transformation removes for, switch and function constructs. The result
is still parseable by the same parser, but it will not use certain constructs.
If jumpdests are added that are only jumped to and not continued at, information
about the stack content is added, unless no local variables of outer scopes are
accessed or the stack height is the same as for the previous instruction.
Pseudocode::
desugar item: AST -> AST =
match item {
AssemblyFunctionDefinition('function' name '(' arg1, ..., argn ')' '->' ( '(' ret1, ..., retm ')' body) ->
<name>:
{
$<name>_start [$retPC, $argn, ..., arg1]:
let ret1 := 0 ... let retm := 0
{ desugar(body) }
swap and pop items so that only ret1, ... retn, $retPC are left on the stack
jump
}
AssemblyFor('for' { init } condition post body) ->
{
init // cannot be its own block because we want variable scope to extend into the body
// find I such that there are no labels $forI_*
$forI_begin:
jumpi($forI_end, iszero(condition))
{ body }
$forI_continue:
{ post }
jump($forI_begin)
$forI_end:
}
'break' ->
{
// find nearest enclosing scope with label $forI_end
pop all local variables that are defined at the current point
but not at $forI_end
jump($forI_end)
}
'continue' ->
{
// find nearest enclosing scope with label $forI_continue
pop all local variables that are defined at the current point
but not at $forI_continue
jump($forI_continue)
}
AssemblySwitch(switch condition cases ( default: defaultBlock )? ) ->
{
// find I such that there is no $switchI* label or variable
let $switchI_value := condition
for each of cases match {
case val: -> jumpi($switchI_caseJ, eq($switchI_value, val))
}
if default block present: ->
{ defaultBlock jump($switchI_end) }
for each of cases match {
case val: { body } -> $switchI_caseJ: { body jump($switchI_end) }
}
$switchI_end:
}
FunctionalAssemblyExpression( identifier(arg1, arg2, ..., argn) ) ->
{
if identifier is function <name> with n args and m ret values ->
{
// find I such that $funcallI_* does not exist
$funcallI_return argn ... arg2 arg1 jump(<name>)
if the current context is `let (id1, ..., idm) := f(...)` ->
$funcallI_return [id1, ..., idm]:
else ->
$funcallI_return[m - n - 1]:
turn the functional expression that leads to the function call
into a statement stream
}
else -> desugar(children of node)
}
default node ->
desugar(children of node)
}
Opcode Stream Generation
------------------------
During opcode stream generation, we keep track of the current stack height,
so that accessing stack variables by name is possible.
Pseudocode::
codegen item: AST -> opcode_stream =
match item {
AssemblyBlock({ items }) ->
join(codegen(item) for item in items)
if last generated opcode has continuing control flow:
POP for all local variables registered at the block (including variables
introduced by labels)
warn if the stack height at this point is not the same as at the start of the block
Identifier(id) ->
lookup id in the syntactic stack of blocks
match type of id
Local Variable ->
DUPi where i = 1 + stack_height - stack_height_of_identifier(id)
Label ->
// reference to be resolved during bytecode generation
PUSH<bytecode position of label>
SubAssembly ->
PUSH<bytecode position of subassembly data>
FunctionalAssemblyExpression(id ( arguments ) ) ->
join(codegen(arg) for arg in arguments.reversed())
id (which has to be an opcode, might be a function name later)
AssemblyLocalDefinition(let (id1, ..., idn) := expr) ->
register identifiers id1, ..., idn as locals in current block at current stack height
codegen(expr) - assert that expr returns n items to the stack
FunctionalAssemblyAssignment((id1, ..., idn) := expr) ->
lookup id1, ..., idn in the syntactic stack of blocks, assert that they are variables
codegen(expr)
for j = n, ..., i:
SWAPi where i = 1 + stack_height - stack_height_of_identifier(idj)
POP
AssemblyAssignment(=: id) ->
look up id in the syntactic stack of blocks, assert that it is a variable
SWAPi where i = 1 + stack_height - stack_height_of_identifier(id)
POP
LabelDefinition(name [id1, ..., idn] :) ->
JUMPDEST
// register new variables id1, ..., idn and set the stack height to
// stack_height_at_block_start + number_of_local_variables
LabelDefinition(name [number] :) ->
JUMPDEST
// adjust stack height by +number (can be negative)
NumberLiteral(num) ->
PUSH<num interpreted as decimal and right-aligned>
HexLiteral(lit) ->
PUSH32<lit interpreted as hex and left-aligned>
StringLiteral(lit) ->
PUSH32<lit utf-8 encoded and left-aligned>
SubAssembly(assembly <name> block) ->
append codegen(block) at the end of the code
dataSize(<name>) ->
assert that <name> is a subassembly ->
PUSH32<size of code generated from subassembly <name>>
linkerSymbol(<lit>) ->
PUSH32<zeros> and append position to linker table
}