mirror of
https://github.com/ethereum/solidity
synced 2023-10-03 13:03:40 +00:00
Update to new assembly specification.
This commit is contained in:
parent
49ac2a1ee5
commit
9683cfea6d
@ -27,6 +27,12 @@ arising when writing manual assembly by the following features:
|
||||
* assembly-local variables: ``let x := add(2, 3) let y := mload(0x40) x := add(x, y)``
|
||||
* access to external variables: ``function f(uint x) { assembly { x := sub(x, 1) } }``
|
||||
* labels: ``let x := 10 repeat: x := sub(x, 1) jumpi(repeat, eq(x, 0))``
|
||||
* loops: ``for { let i := 0 } lt(i, x) { i := add(i, 1) } { y := mul(2, y) }``
|
||||
* switch statements: ``switch x case 0: { y := mul(x, 2) } default: { y := 0 }``
|
||||
* function calls: ``function f(x) -> (y) { switch x case 0: { y := 1 } default: y := mul(x, f(sub(x, 1))) }``
|
||||
|
||||
.. note::
|
||||
Of the above, loops, function calls and switch statements are not yet implemented.
|
||||
|
||||
We now want to describe the inline assembly language in detail.
|
||||
|
||||
@ -91,40 +97,165 @@ you really know what you are doing.
|
||||
Standalone Assembly
|
||||
===================
|
||||
|
||||
Grammar
|
||||
-------
|
||||
This assembly language tries to achieve several goals:
|
||||
|
||||
The assembly lexer follows the one defined by Solidity itself.
|
||||
1. Programs written in it should be readable, even if the code is generated by a compiler from Solidity.
|
||||
2. The translation from assembly to bytecode should contain as few "surprises" as possible.
|
||||
3. Control flow should be easy to detect to help in formal verification and optimization.
|
||||
|
||||
Whitespace is used to delimit tokens and it consists of the characters
|
||||
Space, Tab and Linefeed. Comments as defined below, are interpreted in the
|
||||
same way as Whitespace.
|
||||
Furthermore, the following tokens exist:
|
||||
In order to achieve the first and last goal, assembly provides high-level constructs
|
||||
like ``for`` loops, ``switch`` statements and function calls. It should be possible
|
||||
to write assembly programs that do not make use of explicit ``SWAP``, ``DUP``,
|
||||
``JUMP`` and ``JUMPI`` statements, because the first two obfuscate the data flow
|
||||
and the last two obfuscate control flow. Furthermore, functional statements of
|
||||
the form ``mul(add(x, y), 7)`` are preferred over pure opcode statements like
|
||||
``7 x y add mul`` because in the first form, it is much easier to see which
|
||||
operand is used for which opcode.
|
||||
|
||||
TODO: escapes inside strings, decimal literals, hex literals, hex string literals
|
||||
The second goal is achieved by introducing a desugaring phase that only removes
|
||||
the higher level constructs in a very regular way and still allows inspecting
|
||||
the generated low-level assembly code. The only non-local operation performed
|
||||
by the assembler is name lookup of user-defined identifiers (functions, variables, ...),
|
||||
which follow very simple and regular scoping rules and cleanup of local variables from the stack.
|
||||
|
||||
``OneLineComment := "//" [^\n]*`
|
||||
``MultiLineComment := "/*" .*? "*/"``
|
||||
Scoping: An identifier that is declared (label, variable, function, assembly)
|
||||
is only visible in the block where it was declared (including nested blocks
|
||||
inside the current block). It is not legal to access local variables across
|
||||
function borders, even if they would be in scope. Shadowing is allowed, but
|
||||
two identifiers with the same name cannot be declared in the same block.
|
||||
Local variables cannot be accessed before they were declared, but labels,
|
||||
functions and assemblies can. Assemblies are special blocks that are used
|
||||
for e.g. returning runtime code or creating contracts. No identifier from an
|
||||
outer assembly is visible in a sub-assembly.
|
||||
|
||||
``String := '"' [^"]* '"' | "'" [^']* "'"``
|
||||
``Identifier := [_$a-zA-Z][_$a-zA-Z0-9]*``
|
||||
``Opcodes :=
|
||||
"add" | "addmod" | "address" | "and" | "balance" | "blockhash" | "byte" | "call" |
|
||||
"callcode" | "calldatacopy" | "calldataload" | "calldatasize" | "caller" | "callvalue" |
|
||||
"codecopy" | "codesize" | "coinbase" | "create" | "delegatecall" | "difficulty" |
|
||||
"div" | "dup1" | "dup2" | "dup3" | "dup4" | "dup5" | "dup6" | "dup7" | "dup8" | "dup9" |
|
||||
"dup10" | "dup11" | "dup12" | "dup13" | "dup14" | "dup15" | "dup16" | "eq" | "exp" |
|
||||
"extcodecopy" | "extcodesize" | "gas" | "gaslimit" | "gasprice" | "gt" | "iszero" |
|
||||
"jump" | "jumpi" | "log0" | "log1" | "log2" | "log3" | "log4" | "lt" | "mload" | "mod" |
|
||||
"msize" | "mstore" | "mstore8" | "mul" | "mulmod" | "not" | "number" | "or" | "origin" |
|
||||
"pc" | "pop" | "return" | "sdiv" | "selfdestruct" | "sgt" | "sha3" | "signextend" |
|
||||
"sload" | "slt" | "smod" | "sstore" | "stop" | "sub" | "swap1" | "swap2" | "swap3" |
|
||||
"swap4" | "swap5" | "swap6" | "swap7" | "swap8" | "swap9" | "swap10" | "swap11" |
|
||||
"swap12" | "swap13" | "swap14" | "swap15" | "swap16" | "timestamp" | "xor"``
|
||||
If control flow passes over the end of a block, pop instructions are inserted
|
||||
that match the number of local variables declared in that block, unless the
|
||||
``}`` is directly preceded by an opcode that does not have a continuing control
|
||||
flow path. The stack height is reduced by the number of local variables
|
||||
regardless of that. This mean that labels in the next block will have the
|
||||
same height as before the block that just ended.
|
||||
|
||||
TODO: Define functional instruction, label, assignment, functional assignment,
|
||||
variable declaration, ...
|
||||
If at the end of a block, the stack is not balanced, a warning is issued,
|
||||
unless the last instruction in the block did not have a continuing control flow path.
|
||||
|
||||
Why do we use higher-level constructs like ``switch``, ``for`` and functions:
|
||||
|
||||
Using ``switch``, ``for`` and functions, it should be possible to write
|
||||
complex code without using ``jump`` or ``jumpi`` manually. This makes it much
|
||||
easier to analyze the control flow, which allows for improved formal
|
||||
verification and optimization.
|
||||
|
||||
Furthermore, if manual jumps are allowed, computing the stack height is rather complicated.
|
||||
The position of all local variables on the stack needs to be known, otherwise
|
||||
neither references to local variables nor removing local variables automatically
|
||||
from the stack at the end of a block will work properly. Because of that,
|
||||
every label that is preceded by an instruction that ends or diverts control flow
|
||||
should be annotated with the current stack layout. This annotation is performed
|
||||
automatically during the desugaring phase.
|
||||
|
||||
Example:
|
||||
|
||||
We will follow an example compilation from Solidity to desugared assembly.
|
||||
We consider the runtime bytecode of the following Solidity program::
|
||||
|
||||
contract C {
|
||||
function f(uint x) returns (uint y) {
|
||||
y = 1
|
||||
for (uint i = 0; i < x; i++)
|
||||
y = 2 * y;
|
||||
}
|
||||
}
|
||||
|
||||
The following assembly will be generated::
|
||||
|
||||
{
|
||||
mstore(0x40, 0x60) // store the "free memory pointer"
|
||||
// function dispatcher
|
||||
switch div(calldataload(0), exp(2, 226))
|
||||
case 0xb3de648b: {
|
||||
let (r,) = f(calldataload(4))
|
||||
let ret := $allocate(0x20)
|
||||
mstore(ret, r)
|
||||
return(ret, 0x20)
|
||||
}
|
||||
default: { jump(invalidJumpLabel) }
|
||||
// memory allocator
|
||||
function $allocate(size) -> (pos) {
|
||||
pos := mload(0x40)
|
||||
mstore(0x40, add(pos, size))
|
||||
}
|
||||
// the contract function
|
||||
function f(x) -> (y) {
|
||||
y := 1
|
||||
for { let i := 0 } lt(i, x) { i := add(i, 1) } {
|
||||
y := mul(2, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
After the desugaring phase it looks as follows::
|
||||
|
||||
{
|
||||
mstore(0x40, 0x60)
|
||||
{
|
||||
let $0 := div(calldataload(0), exp(2, 226))
|
||||
jumpi($case1, eq($0, 0xb3de648b))
|
||||
jump($caseDefault)
|
||||
$case1:
|
||||
{
|
||||
// the function call - we put return label and arguments on the stack
|
||||
$ret1 calldataload(4) jump($fun_f)
|
||||
$ret1 [r]: // a label with a [...]-annotation resets the stack height
|
||||
// to "current block + number of local variables". It also
|
||||
// introduces a variable, r:
|
||||
// r is at top of stack, $0 is below (from enclosing block)
|
||||
$ret2 0x20 jump($fun_allocate)
|
||||
$ret2 [ret]: // stack here: $0, r, ret (top)
|
||||
mstore(ret, r)
|
||||
return(ret, 0x20)
|
||||
// although it is useless, the jump is automatically inserted,
|
||||
// since the desugaring process does not analyze control-flow
|
||||
jump($endswitch)
|
||||
}
|
||||
$caseDefault:
|
||||
{
|
||||
jump(invalidJumpLabel)
|
||||
jump($endswitch)
|
||||
}
|
||||
$endswitch:
|
||||
}
|
||||
jump($afterFunction)
|
||||
$fun_allocate:
|
||||
{
|
||||
$start[$retpos, size]:
|
||||
let pos := 0
|
||||
{
|
||||
pos := mload(0x40)
|
||||
mstore(0x40, add(pos, size))
|
||||
}
|
||||
swap1 pop swap1 jump
|
||||
}
|
||||
$fun_f:
|
||||
{
|
||||
start [$retpos, x]:
|
||||
let y := 0
|
||||
{
|
||||
let i := 0
|
||||
$for_begin:
|
||||
jumpi($for_end, iszero(lt(i, x)))
|
||||
{
|
||||
y := mul(2, y)
|
||||
}
|
||||
$for_continue:
|
||||
{ i := add(i, 1) }
|
||||
jump($for_begin)
|
||||
$for_end:
|
||||
} // Here, a pop instruction is inserted for i
|
||||
swap1 pop swap1 jump
|
||||
}
|
||||
$afterFunction:
|
||||
stop
|
||||
}
|
||||
|
||||
Syntax
|
||||
------
|
||||
@ -159,6 +290,8 @@ In the following, ``mem[a...b)`` signifies the bytes of memory starting at posit
|
||||
|
||||
The opcodes ``pushi`` and ``jumpdest`` cannot be used directly.
|
||||
|
||||
In the grammar, opcodes are represented as pre-defined identifiers.
|
||||
|
||||
+-------------------------+------+-----------------------------------------------------------------+
|
||||
| stop + `-` | stop execution, identical to return(0,0) |
|
||||
+-------------------------+------+-----------------------------------------------------------------+
|
||||
@ -508,3 +641,220 @@ first slot of the array and then only the array elements follow.
|
||||
Statically-sized memory arrays do not have a length field, but it will be added soon
|
||||
to allow better convertibility between statically- and dynamically-sized arrays, so
|
||||
please do not rely on that.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
Assembly happens in four stages:
|
||||
|
||||
1. Parsing
|
||||
2. Desugaring (removes switch, for and functions)
|
||||
3. Opcode stream generation
|
||||
4. Bytecode generation
|
||||
|
||||
|
||||
Parsing / Grammar
|
||||
-----------------
|
||||
|
||||
The tasks of the parser are the following:
|
||||
|
||||
- Turn the byte stream into a token stream, discarding C++-style comments
|
||||
(a special comment exists for source references, but we will not explain it here).
|
||||
- Turn the token stream into an AST according to the grammar below
|
||||
- Register identifiers with the block they are defined in (annotation to the
|
||||
AST node) and note from which point on, variables can be accessed.
|
||||
|
||||
The assembly lexer follows the one defined by Solidity itself.
|
||||
|
||||
Whitespace is used to delimit tokens and it consists of the characters
|
||||
Space, Tab and Linefeed. Comments are regular JavaScript/C++ comments and
|
||||
are interpreted in the same way as Whitespace.
|
||||
|
||||
Grammar::
|
||||
|
||||
AssemblyBlock = '{' AssemblyItem* '}'
|
||||
AssemblyItem =
|
||||
Identifier |
|
||||
AssemblyBlock |
|
||||
FunctionalAssemblyExpression |
|
||||
AssemblyLocalDefinition |
|
||||
FunctionalAssemblyAssignment |
|
||||
AssemblyAssignment |
|
||||
LabelDefinition |
|
||||
AssemblySwitch |
|
||||
AssemblyFunctionDefinition |
|
||||
AssemblyFor |
|
||||
'break' | 'continue' |
|
||||
SubAssembly | 'dataSize' '(' Identifier ')' |
|
||||
LinkerSymbol |
|
||||
'errorLabel' | 'bytecodeSize' |
|
||||
NumberLiteral | StringLiteral | HexLiteral
|
||||
Identifier = [a-zA-Z_$] [a-zA-Z_0-9]*
|
||||
FunctionalAssemblyExpression = Identifier '(' ( AssemblyItem ( ',' AssemblyItem )* )? ')'
|
||||
AssemblyLocalDefinition = 'let' IdentifierOrList ':=' FunctionalAssemblyExpression
|
||||
FunctionalAssemblyAssignment = IdentifierOrList ':=' FunctionalAssemblyExpression
|
||||
IdentifierOrList = Identifier | '(' IdentifierList ')'
|
||||
IdentifierList = Identifier ( ',' Identifier)*
|
||||
AssemblyAssignment = '=:' Identifier
|
||||
LabelDefinition = Identifier ( '[' ( IdentifierList | NumberLiteral ) ']' )? ':'
|
||||
AssemblySwitch = 'switch' FunctionalAssemblyExpression AssemblyCase*
|
||||
( 'default' ':' AssemblyBlock )?
|
||||
AssemblyCase = 'case' FunctionalAssemblyExpression ':' AssemblyBlock
|
||||
AssemblyFunctionDefinition = 'function' Identifier '(' IdentifierList? ')' '->'
|
||||
( '(' IdentifierList ')' AssemblyBlock
|
||||
AssemblyFor = 'for' ( AssemblyBlock | FunctionalAssemblyExpression)
|
||||
FunctionalAssemblyExpression ( AssemblyBlock | FunctionalAssemblyExpression) AssemblyBlock
|
||||
SubAssembly = 'assembly' Identifier AssemblyBlock
|
||||
LinkerSymbol = 'linkerSymbol' '(' StringLiteral ')'
|
||||
NumberLiteral = HexNumber | DecimalNumber
|
||||
HexLiteral = 'hex' ('"' ([0-9a-fA-F]{2})* '"' | '\'' ([0-9a-fA-F]{2})* '\'')
|
||||
StringLiteral = '"' ([^"\r\n\\] | '\\' .)* '"'
|
||||
HexNumber = '0x' [0-9a-fA-F]+
|
||||
DecimalNumber = [0-9]+
|
||||
|
||||
|
||||
Desugaring
|
||||
----------
|
||||
|
||||
An AST transformation removes for, switch and function constructs. The result
|
||||
is still parseable by the same parser, but it will not use certain constructs.
|
||||
If jumpdests are added that are only jumped to and not continued at, information
|
||||
about the stack content is added, unless no local variables of outer scopes are
|
||||
accessed or the stack height is the same as for the previous instruction.
|
||||
|
||||
Pseudocode::
|
||||
|
||||
desugar item: AST -> AST =
|
||||
match item {
|
||||
AssemblyFunctionDefinition('function' name '(' arg1, ..., argn ')' '->' ( '(' ret1, ..., retm ')' body) ->
|
||||
<name>:
|
||||
{
|
||||
$<name>_start [$retPC, $argn, ..., arg1]:
|
||||
let ret1 := 0 ... let retm := 0
|
||||
{ desugar(body) }
|
||||
swap and pop items so that only ret1, ... retn, $retPC are left on the stack
|
||||
jump
|
||||
}
|
||||
AssemblyFor('for' { init } condition post body) ->
|
||||
{
|
||||
init // cannot be its own block because we want variable scope to extend into the body
|
||||
// find I such that there are no labels $forI_*
|
||||
$forI_begin:
|
||||
jumpi($forI_end, iszero(condition))
|
||||
{ body }
|
||||
$forI_continue:
|
||||
{ post }
|
||||
jump($forI_begin)
|
||||
$forI_end:
|
||||
}
|
||||
'break' ->
|
||||
{
|
||||
// find nearest enclosing scope with label $forI_end
|
||||
pop all local variables that are defined at the current point
|
||||
but not at $forI_end
|
||||
jump($forI_end)
|
||||
}
|
||||
'continue' ->
|
||||
{
|
||||
// find nearest enclosing scope with label $forI_continue
|
||||
pop all local variables that are defined at the current point
|
||||
but not at $forI_continue
|
||||
jump($forI_continue)
|
||||
}
|
||||
AssemblySwitch(switch condition cases ( default: defaultBlock )? ) ->
|
||||
{
|
||||
// find I such that there is no $switchI* label or variable
|
||||
let $switchI_value := condition
|
||||
for each of cases match {
|
||||
case val: -> jumpi($switchI_caseJ, eq($switchI_value, val))
|
||||
}
|
||||
if default block present: ->
|
||||
{ defaultBlock jump($switchI_end) }
|
||||
for each of cases match {
|
||||
case val: { body } -> $switchI_caseJ: { body jump($switchI_end) }
|
||||
}
|
||||
$switchI_end:
|
||||
}
|
||||
FunctionalAssemblyExpression( identifier(arg1, arg2, ..., argn) ) ->
|
||||
{
|
||||
if identifier is function <name> with n args and m ret values ->
|
||||
{
|
||||
// find I such that $funcallI_* does not exist
|
||||
$funcallI_return argn ... arg2 arg1 jump(<name>)
|
||||
if the current context is `let (id1, ..., idm) := f(...)` ->
|
||||
$funcallI_return [id1, ..., idm]:
|
||||
else ->
|
||||
$funcallI_return[m - n - 1]:
|
||||
turn the functional expression that leads to the function call
|
||||
into a statement stream
|
||||
}
|
||||
else -> desugar(children of node)
|
||||
}
|
||||
default node ->
|
||||
desugar(children of node)
|
||||
}
|
||||
|
||||
Opcode Stream Generation
|
||||
------------------------
|
||||
|
||||
During opcode stream generation, we keep track of the current stack height,
|
||||
so that accessing stack variables by name is possible.
|
||||
|
||||
Pseudocode::
|
||||
|
||||
codegen item: AST -> opcode_stream =
|
||||
match item {
|
||||
AssemblyBlock({ items }) ->
|
||||
join(codegen(item) for item in items)
|
||||
if last generated opcode has continuing control flow:
|
||||
POP for all local variables registered at the block (including variables
|
||||
introduced by labels)
|
||||
warn if the stack height at this point is not the same as at the start of the block
|
||||
Identifier(id) ->
|
||||
lookup id in the syntactic stack of blocks
|
||||
match type of id
|
||||
Local Variable ->
|
||||
DUPi where i = 1 + stack_height - stack_height_of_identifier(id)
|
||||
Label ->
|
||||
// reference to be resolved during bytecode generation
|
||||
PUSH<bytecode position of label>
|
||||
SubAssembly ->
|
||||
PUSH<bytecode position of subassembly data>
|
||||
FunctionalAssemblyExpression(id ( arguments ) ) ->
|
||||
join(codegen(arg) for arg in arguments.reversed())
|
||||
id (which has to be an opcode, might be a function name later)
|
||||
AssemblyLocalDefinition(let (id1, ..., idn) := expr) ->
|
||||
register identifiers id1, ..., idn as locals in current block at current stack height
|
||||
codegen(expr) - assert that expr returns n items to the stack
|
||||
FunctionalAssemblyAssignment((id1, ..., idn) := expr) ->
|
||||
lookup id1, ..., idn in the syntactic stack of blocks, assert that they are variables
|
||||
codegen(expr)
|
||||
for j = n, ..., i:
|
||||
SWAPi where i = 1 + stack_height - stack_height_of_identifier(idj)
|
||||
POP
|
||||
AssemblyAssignment(=: id) ->
|
||||
look up id in the syntactic stack of blocks, assert that it is a variable
|
||||
SWAPi where i = 1 + stack_height - stack_height_of_identifier(id)
|
||||
POP
|
||||
LabelDefinition(name [id1, ..., idn] :) ->
|
||||
JUMPDEST
|
||||
// register new variables id1, ..., idn and set the stack height to
|
||||
// stack_height_at_block_start + number_of_local_variables
|
||||
LabelDefinition(name [number] :) ->
|
||||
JUMPDEST
|
||||
// adjust stack height by +number (can be negative)
|
||||
NumberLiteral(num) ->
|
||||
PUSH<num interpreted as decimal and right-aligned>
|
||||
HexLiteral(lit) ->
|
||||
PUSH32<lit interpreted as hex and left-aligned>
|
||||
StringLiteral(lit) ->
|
||||
PUSH32<lit utf-8 encoded and left-aligned>
|
||||
SubAssembly(assembly <name> block) ->
|
||||
append codegen(block) at the end of the code
|
||||
dataSize(<name>) ->
|
||||
assert that <name> is a subassembly ->
|
||||
PUSH32<size of code generated from subassembly <name>>
|
||||
linkerSymbol(<lit>) ->
|
||||
PUSH32<zeros> and append position to linker table
|
||||
}
|
||||
|
Loading…
Reference in New Issue
Block a user