Rough Interpreter Design

Oct 16, 2008 19:30

Been doing too much posting and not enough coding. I'm off to make some more progress on the compiler, which means I need to lay out the bytecode design, which in turn means I needed to summarize how the interpreter was going to run. So here are the notes on the interpreter; back when it's starting to work.



-------------------------------------------------------------------------------
INTERPRETER
-------------------------------------------------------------------------------

So the compiler takes the original script--just a text file--and pre-digests
it into a bytecode form. The interpreter reads that bytecode and executes
it, performing the steps of the script.

An application can naturally include both portions in its runtime. In the
app I'm working on at the moment, I only need the interpreter linked into
the final program; the compiling will happen early, and the compiled scripts
will be provided as add-on content for enriching the app. Lots of ways to
put all this together.

The compiler is, ultimately, just a means to an end: its output is this nice
pretty bytecode, that the interpreter uses as input. When the interpreter
runs, it keeps track of variables, namespaces, objects and so on; it follows
the bytecode in order to evaluate expressions and call methods. Importantly,
your application can provide additional methods that the script can invoke
just as it could invoke script-supplied methods. That's actually essential,
since those application-supplied methods are presumably very fast (being
native code, after all) while the script itself will be slow. Thus the
script provides the logic, but your application provides the power.

-------------------------------------------------------------------------------
STACKS AND PILES
-------------------------------------------------------------------------------

The difference between a stack and a pile is mostly one of interpretation. For
a stack, think of a spring-loaded plate holder at a buffet restaurant: put a
new plate on the top, and the whole thing sinks a bit; pull a plate off the
top and it raises a bit to keep that top plate at the same spot. The position
of an object in the stack is always measured relative to the top of the stack:
this plate is top-of-stack (TOS), the one below that is next-on-stack (NOS),
and although there's some other plate down there ten plates down, we don't
are about it because we can't reach it. When a new object is added to a
stack, everything's position in the stack changes.

A pile is mechanically just the same, but without the spring. In a pile we
measure from the fixed bottom of the pile upwards: that plate is on the
bottom and so has position zero, the next plate is at position one right above
that one, and so on. Adding more stuff to the pile doesn't change anything
else's location within the pile, and since the pile isn't hidden inside a
spring-loaded chamber we can actually reach any pile element we want at any
time--unlike a stack where we really only care about what's at or near the top.

-------------------------------------------------------------------------------
RUNTIME STACK THEORY
-------------------------------------------------------------------------------

The Dualscript interpreter uses a stack-based expression evaluator. That kind
of interpreter uses an execution stack where a modern computer would use
registers instead. It's slow but much simpler to write than a modern compiler
and CPU emulator.

For example, let's say a compiler hits this expression:

A = (5 + 3) * 2;

The compiler emits instructions like this to match:

push address-of-A
push 5
push 3
math add // pops 3 then 5, adds them, pushes 8
push 2
math multiply // pops 2 then 8, multiplies them, pushes 16
assign // pops 16 then address-of-A, assigns A=16, pushes 16
pop // expression complete; throw away the 16

How about a harder example?

A = B = 3 * function ( 5 + 8 );

Just for the sake of example, let's say "function" returns 99. I'm also
going to assume that the caller tells the function how many args are on the
stack for it, and the called function is responsbile for popping those args
and pushing the function result. So, the compiler would generate this:

push address-of-A // stack has A
push address-of-B // stack has B, A
push 3 // stack has 3, B, A
push 5 // stack has 5, 3, B, A
push 8 // stack has 8, 5, 3, B, A
math add // stack has 13, 3, B, A
push 1 // stack has 1, 13, 3, B, A (1=number of args)
call function // stack has 99, 3, B, A
math multiply // stack has 297, B, A
assign // stack has 297, A (B is now 297)
assign // stack has 297 (A is now 297 too)
pop // stack is empty

As you can see, this usage pattern is clearly stack-like and not pile-like:
at any given moment we're only messing with whatever's at the top of the
stack, and we don't really care how deep the stack is. We'll eventually work
our way back to whatever's futher down as we consolidate the values on the top
into simplified expressions.

The actual operations up there are the bytecode, and there's a separate doc
to talk about the bytecode operations that Dualscript uses. As I said,
evaluating that bytecode is actually pretty straight-forward; if the compiler
did its job right, then these instructions are each pretty trivial. It's the
object and memory management that's difficult to get right.

-------------------------------------------------------------------------------
DUALSCRIPT VALUES
-------------------------------------------------------------------------------

Every constant and every variable in a script--whether a primitive like 15 or
"hello" or an object created by the script from a user-defined type--is
represented by a value. To the interpreter a value is a pointer to a fixed-
size structure more or less like this (and yes, I know you can't put a
C++ std::string object inside a union):

struct dsvalue {
string name;
uint32_t type;
union {
dsobject *obj;
int64 inum;
double dnum;
string str;
} data;
};

Notice that every value has a name--in practice that's the name of a variable,
like "format" in the earlier example. Those names are useful when printing
error messages, but importantly they're also used during function invocation
since function parameter order is flexible in Dualscript.

The type field is little more than a discriminator for the data area: is
this value an integer, a floating point number, a string, or a user-defined
type (or maybe simply undefined)? If it's a user-defined type, which one?

On the other end of "data.obj" is a dynamically allocated structure that
includes a refcount, and that holds dsvalue structures for data members.
More about all that later.

-------------------------------------------------------------------------------
NAMESPACES
-------------------------------------------------------------------------------

A script can declare a namespace for its content:

namespace movies.fiction;
int svar; // script variable
method smethod { ... } // script method
realgenius {
global gvar = 12; // global variable
ivar = 5; // instance variable
method gmethod { ... }
method imethod requires this { ... } // instance method
}

A namespace is just a list of names that map to stuff: classes, methods,
variables, or importantly, other namespaces. In this example, the
namespace hierarchy looks like this:

(root) contains {
"movies" = child namespace
}
"movies" contains {
"fiction" = child namespace
}
"movies.fiction" contains {
"svar" = variable
"smethod" = subroutine
"realgenius" = namespace
}
"movies.fiction.realgenius" contains {
(instance) = implicit child
"gvar" = variable
"gmethod" = subroutine
}
"movies.fiction.realgenius.(instance)" contains {
"ivar" = variable
"imethod" = subroutine
}

The interpreter manages namespaces. When trying to find a name, the
interpreter checks the current namespace first; if it can't find the name,
it walks up one level towards the root and tries again. The "namespace"
command remains in effect until overridden with another command; just
"namespace;" returns to the root namespace.

-------------------------------------------------------------------------------
THE INTERPRETER PILE
-------------------------------------------------------------------------------

The interpreter's pile is just a place to store variables. The pile grows
when new variables are defined, and the pile shrinks when the script leaves
the scope where some variables were defined. Globals go on the pile too just
like local variables do--but, since we never leave global scope, they never
go away. (Each class definition likewise gets one global variable to contain
its class-global data members.) Every element in the pile is a full dsvalue
structure, not a pointer to one.

{
format = "Hello, World";
}

This script is going to eventaully need two values on the pile: one for the
"Hello, World" constant and another one for the local variable "format".

Since these are the first ones mentioned in the script, the pile is empty
before this line is encountered. When the compiler first hits this line,
it reads the word "format" and realizes it's defining a new variable.
(Since the compiler hadn't heard of the variable named "hello" before, if
it wasn't showing up as the left-hand-side of an assignment expression
then the compiler would issue a compile-time error.)

So the compiler hits that word "format" and emits two things into the bytecode
stream: first it emits a command to reserve space on TODO

variable as a left-hand-side, the compiler emits the address of this value
onto the stack (so, it emits 0--this is on the bottom of the pile) and
proceeds.

The line causes the
interpreter to add a value for hello to the pile, and then a value for the
constant. (I'll draw the first object on the bottom, like a pile should be.)
First, we reserve space for hello:

0: { name="hello", type=undefined, data=null }

We haven't done any work with the variable yet; it's uninitialized so far, and
therefore has no content associated with it.

Now the compiler hits the = operator, so it recursively evaluates the rest of
the expression and them emits an "assign" opcode. (There's a different doc
about how the recursive-descent parser works.) Evaluating that right-hand-
side, though, encounters the constant string "Hello, World". We need a value
to represent that, and since it's a primitive we know just how to set it up:

1: { name=none, type=string|readonly, data=0x8E239FB0 }
0: { name="hello", type=undefined, data=null }

The data=0x8E239FB0 pointer refers to a dynamically allocated, refcounted
structure like this:

struct stringdata {
size_t refcount = 1;
string ss = "Hello, World";
}

The reference count indicates that there's exactly one variable in the pile
who points to this thing, which is true right now. The compiler has ended
up emitting a sequence like this:

addpile null // tells interpreter to add the hello value to the pile
push 0 // pushes address of hello onto runtime stack
addpile string "Hello, World"
push 1 // now stack has &"HelloWorld" on top, &hello just below
assign // pops both values from stack, assigns, pushes one back
pop // semicolon pops that final value
removepile 2 // removes top two values from stack

The assignment leaves the following on the pile:

1: { name=none, type=string, data=0x8E239FB0 }
0: { name="hello", type=string, data=0x8E239FB0 }
...with 0x8E239FB0's refcount==2

The removepile instruction causes the interpreter to get rid of the top two
items on the stack. Removing each one decrements a reference from the target
stringdata structure; when its refcount drops to zero, the interpreter
immediately deletes the structure itself. (If the structure were for a user-
supplied type, it would invoke the destroy member first.)

-------------------------------------------------------------------------------
DUALSCRIPT TYPES
-------------------------------------------------------------------------------

Starting with something trivial: type=0 means unknown; type=1 means number;
type=2 means string. Anything higher than that means a user-constructed type.

The compiler works on a source-file-by-source-file basis: no prototypes or
externs required. So any given text script is compiled into something self-
contained: it doesn't know what types another script might have exposed.

So, let's say a script defines type Foo. When the compiler hits it, it could
pick identifier 10 to represent type Foo. Two days later the compiler loads
a different file and compiles it separately, and it again picks identifier 10.
It doesn't know any better; every script is self-contained.

So, the interpreter has to resolve ties. When the interpreter loads a
compiled script, it has to look through the script's data types and remap
each one to something not currently in use in memory. So, if the interpreter
has already loaded a dozen scripts and is using types 0..50, then when it
loads a new script referring to a new type name, it will remap that name to
type 51 instead of zero.

The name of the type is relevant here. If two otherwise-unrelated scripts
use the same type name, then it's a cinch they were made to work together
and are talking about the same type. So the interpreter will remap them to
the same constant.

The interpreter also has to keep in mind a lot of other stuff about each
type. Type constructs are intended to be additive: if script A defines
members a1 and a2 in a type, and script B defines members b1 and b2 in
that same type, then the interpreter will consider the type to have 4 members.
It's not secure, but who cares?

Previous post Next post
Up