Nathan Smith (white_door@dread.nl)
0. What?
I will show in this document (hopefully) how I made a script compiler, maybe you can learn from it. I'm not great with words, and I tend to forget things, and leave stuff out, so good luck. (You'll need it.)
To make a script compiler in the way I did (this is not the only way) you need to know this stuff:
- Linked Lists
- Stacks (like linked lists only accessed differently)
- some understanding of C++ classes
- optionally you could use hash tables or binary trees for the variable and function storage (faster access time), but linked lists can do the job okay.
You need to have these things:
- a c/c++ compiler
- Flex a gnu program designed to build a lexor.
- Bison a gnu program designed to build a parser.
- debugging skills (takes a while to get to a point where you can test the code, so you get alot of bugs at once [I did, but maybe you won't.])
If you want to run the demo I incuded you will need the allegro game library.
1. Why?
Why did I first try and learn about scripting?
I had a problem. I couldn't get the design of a game to work, and I always gave up on a game before I had even completed the engine for it. I couldn't figure out how to use the engine to create the game.
I found an article, I can't remember where now, but it said you need to divide your game in two like this:
The Engine The Game
Display Art on the screen Artwork
Play the music Sound & Music
Display character speech, and control the movements Story & Plot
The whole point is I needed the actions and speech of the characters to be in the game part and not in the engine part. To help this I decided to find or make a scripting engine that would take my character & plot scripts and run them, so I could change The Game's game source inside the map editor with out touching The Engine's source code.
I was not able to find a single script engine on the internet that both worked in on all the platforms that the rpg would and was easy to learn, understand, and link in to the The Engine. So I decided to write my own. And I did it, and it worked. (sort of. as it was my first there are some problems in the design)
Later on I wrote a script compilier in a weekend for the AI of a robot controlled space war style game. It was far more simple, and far less powerful, BUT it had a far better basic design. After finishing that, I decided to write this article, so the examples I will give here were designed for an arcade game, but I will also talk about how to expand it for a more advanced engine. (like for a rpg)
2. How?
Searching the internet, I found many articles on this subject, the best of all was one onwww.flipcode.comcalled "Implementing A Scripting Engine" by Jan Niestadt. Using his tutorial to learn I went on to build a more complete engine. While my work is based on his tutorial, there are a number of things I have done in a different way, and indeed my overall design is slightly more simple.
Here are the parts of my script engine:
Lexor -> Parser -> Compiler -> Virtual Machine
The lexor reads the script code in it, and converts it to a list of symbols (or tokens) that the parser can use. The parser uses this to output a syntax tree. The syntax tree is a class storing all the parts of the script in a easy to break down form. The syntax tree is sent to the compiler and it outputs the virtual asm. The Virtual Machine thens run this asm code.
The Virtual Machine works like this:
Variable Table <--> Stack <--> Code
/\
\--- Constants
Variables can be pushed or read off the stack, constants can be pushed on the stack, and Code can push, pop, and read from the stack.
so a = 5 translates to:
push 5 // Constant
pop a // Variable
and a = c + 6 to:
push c // Variable
push 6 // Constant
add // Code
pop a // Variable
(in truth there will not be a pop command, we will divide the task in to two halves: gettop <var> and discard)
Now to start it all...
3. Language?
You need to invent the language first.
In terms of a script langauge it's important the code is simple and easy and quick to write. Code speed is not a big deal, as for most games, scripts are short and very high level.
For this article I'm going to write a basic like langauge, but the language itself is flexible and very easy to change, so if you are following this, just change it to whatever you want.
label:
goto label
var=434
if var > 5 then turn(4)
if var < 0 then
...
else
...
endif
var = energy()
That's it, its enough to control a space ship, and remember I did this in a single weekend, so I didn't have time for other features or commands.
Variables are dynamic, so the variable table will add new variables to the table as it comes to them in the run-time, in the examples I will only include floating point numbers, adding more types is not that hard. However it is important to know at compile time what a variable is, so you either have to have declare variables and their types before you use them (like in c/c++ or pascal) or use a special symbol next to the variable (like basic does). I'm not sure of any other ways..
The next step is the lexor:
4. Lexor?
Taken from a dictionary I found on a book shelf in my father's office: A lexicon is 1: alphabetal list of words of a langauge or of a particular subject 2: a dictionary.
The lexor converts the source code into a list of word or symbols. Why? Because having a list of numbers (with each symbol is a number) is easier to deal with than a string of text.
Most lexors are basically the same, so to help write it I used a gnu program called flex. There are versions for windows, dos, and linux. So it shouldn't be to hard to find.
The lex file is layed out like this:
%{
Includes and C defines
%}
Symbol Defines
%%
The list of symbols
%%
Any extra c/c++ functions you need.
And here is the c/c++ includes and defines I used:
%{
#include <allegro.h>
#include <string.h>
#include <assert.h>
#include "syntax.h"
#include "parser.h"
#include "file2pack.h" // converts file to pack_file
extern "C" int yywrap ();
int line_num;
int identifier();
int float_const();
%}
Here are symbol defines:
LETTER [a-zA-Z_]
DIGIT [0-9]
ALPHANUM [a-zA-Z_0-9]
IDENT {LETTER}{ALPHANUM}*
FLOAT {DIGIT}+("."{DIGIT}+)?
WHITESPACE [ \t]+
LINEFEED "\r"
This stuff is simple logic for detecting text patterns, I had to read the flex info manual that came with the djgpp version of flex, to figure most of it out. But hopefully it sort of makes sense? '*' means as many of the last thing as you want (0 or more.) '?' that last thing is either here or not. '+' at least one of the last thing must be here but can be more. '[ ]' means any of the characters inside it.
LETTER checks for a single character in the range of A-Z or a-z or for the character '_'.
DIGIT checks for a single character in the range of 0-9
ALPHANUM checks for a single character that is either a letter or a digit.
IDENT checks for one LETTER and zero or more ALPHANUM after that. (examples: "a" "_p3" "hello_there")
FLOAT is more complex.. First it check for one or more digits then it checks for a optional "." and one or more digits after it. (examples: "434" "3.25")
WHITESPACE check for one or more of either " " (space) or "\t" (tab)
LINEFEED is the return character
So this is the list of symbols for my langauge:
"if" {return IF;}
"then" {return THEN;}
"else" {return ELSE;}
"endif" {return ENDIF;}
"goto" {return GOTO;}
"==" {return EQUAL;}
"=" {return '=';}
"<>" {return NOTEQUAL;}
"not" {return NOT;}
"+" {return '+';}
"-" {return '-';}
"*" {return '*';}
"/" {return '/';}
"^" {return '^';}
">" {return '>';}
"<" {return '<';}
">=" {return GTET;}
"<=" {return LTET;}
"and" {return AND;}
"or" {return OR;}
"(" {return '(';}
")" {return ')';}
":" {return ':';}
"," {return ',';}
{FLOAT} {return float_const();}
{IDENT} {return identifier ();}
{WHITESPACE} {}
{LINEFEED} {line_num++; return LINE_END;}
. {return ERROR_TOKEN;}
For each symbol you must return a integer value for it, you can also return a single 'character', but don't worry about this, as the parser will set up the symbol defines for you.
Also the order is important, the built-in words must come first and symbols like the constants and identifiers that have more complex logic next, finally remove the white space, and check for line feeds. The last line detects a syntax error, if it finds something it doesn't understand.
And lastly here are the extra functions:
int identifier () {
assert(strlen(yytext)>0);
yylval.string = new char[strlen(yytext)+1];
strcpy (yylval.string, yytext);
return ID;
}
int float_const() {
yylval.value=atof(yytext);
return FLOAT;
}
#include "file2pack2.h" // some more stuff to convert file to pack_file
yylval is for storing information for use in the parser. I'll explain it is the parser section. yytext is a string with the current symbol stored in it.
Problems with flex:
It uses basic C file operations to access files, this is okay, but in my game I needed special file functions so I could read a script from inside another file. So I wrote two headers with defines that would convert the file operations to use the allegro game libraries packfile operations. Very messy, but it shows that it can be done. :)
5. Syntax?
We need a tree holding the syntax of the script. The syntax is the structure of the program.
Example:
a = 4 + 5
if a == 5 then print(10) else a=0
tree
||
statement_list
/ \
assign statement_list
// \ / \
a add if_statement empty_statement
/ \ / | \
4 5 equal function assign
/ \ // \ // \
a 5 print 10 a 0
I decided to use a class for this, however a structure plus some functions could also do the trick.
Here is a enum of the different Node types: [ ] are for the branchs it should have ( )
enum NodeType {
// Statements
STMT_LIST, // list of statements [left-part, right-part] (left-part may be another list)
EMPTY_STMT, // empty statement [ ]
LABEL_STMT, // label statement [ ]
GOTO_STMT, // goto statement [ ]
EXPR_STMT, // single expression as statement [expression]
IFTHEN_STMT, // if statement [cond, if-part]
IFTHENELSE_STMT, // if statement with else [cond, if-part, else-part]
ERROR_STMT, // error statement
// Expressions
EQUAL_EXPR, // equality expression [op1, op2]
NOTEQUAL_EXPR, // not equal expression [op1, op2]
AND_EXPR, // and expression [op1,op2]
OR_EXPR, // or expression [op1,op2]
NOT_EXPR, // not expression [op1]
LT_EXPR, // < [op1,op2]
GT_EXPR, // > [op1,op2]
LTET_EXPR, // <= [op1,op2]
GTET_EXPR, // >= [op1,op2]
ASSIGN_EXPR, // assignment expression [variable, op2]
ADD_EXPR, // add expression [op1, op2]
SUB_EXPR, // sub expression [op1, op2]
MUL_EXPR, // multiply expression [op1, op2]
DIV_EXPR, // divide expression [op1, op2]
POW_EXPR, // power expression [op1, op2]
NEG_EXPR, // negitive expression [op1]
FUNCTION_EXPR, // function statement [varaibles passed to it]
EXPR_LIST, // list of expressions to be passed to a function [expr, expr_list]
IDENT_EXPR, // identifier or varaible [ ]
CONST_EXPR // a constant [ ]
};
and here is the class itself:
class TreeNode
{
public:
class TreeNode *child[3];
enum NodeType type;
double value;
char *identifier;
TreeNode(NodeType _type) {
child[0]=NULL;
child[1]=NULL;
child[2]=NULL;
type=_type;
identifier=NULL;
}
TreeNode(NodeType _type,TreeNode *c1) {
type=_type;
child[0]=c1;
child[1]=NULL;
child[2]=NULL;
identifier=NULL;
}
TreeNode(NodeType _type,TreeNode *c1,TreeNode *c2) {
type=_type;
child[0]=c1;
child[1]=c2;
child[2]=NULL;
identifier=NULL;
}
TreeNode(NodeType _type,TreeNode *c1,TreeNode *c2,TreeNode *c3) {
type=_type;
child[0]=c1;
child[1]=c2;
child[2]=c3;
identifier=NULL;
}
~TreeNode() {
if(!this) return;
delete child[0];
delete child[1];
delete child[2];
}
};
This just holds the syntax.. now we must write the parser which will build the tree.
6. Parser?
This converts the list of symbols in to a syntax tree. It parses the file and checks for simple errors. It holds the rules of the language.
For this I used a gnu program called bison. It is quite close in layout to the flex file:
{%
Defines & Headers
%}
symbol and rule defines, and other defines
%%
The rules.
%%
Extra c/c++ functions
Here is my C defines & headers:
%{
#include <allegro.h>
#include <string.h>
#include <malloc.h>
#include <stdio.h>
#include "lexor.h"
#include "syntax.h"
#include "functions.h"
TreeNode *tree; // the syntax tree itself
#define YYDEBUG 1 // this turns debugging on, and makes it as verbose as possible
#define YYERROR_VERBOSE
// Error-reporting function must be defined by the caller
void Error (char *format, ...);
// Forward references
void yyerror (char *msg);
int num_args=0;
%}
Here is the list of symbols or tokens as bison calls them:
%token ERROR_TOKEN
%token IF ELSE THEN GOTO ENDIF
%token LINE_END
%token ':' '(' ')' ','
%token <string> ID
%token <value> FLOAT
%right '='
%left AND OR
%left EQUAL NOTEQUAL LTET GTET '<' '>'
%left '+' '-'
%left '*' '/'
%left NEG NOT
%right '^'
The %token is for normal symbols %token <type> is for symbols that have values.. and %right and %left is used to decide the order the symbols are processed in.
In 2 + 4 * 5 - 4, the parser reads it like this: 4 * 5 first then + 2 then - 4, because *, and / are higher than +, and -.
%right means the operator works like this: a_single_thing <operator> a_list_of_things
%left means the operator works like this: a_list_of_things <operator> a_single_thing
Here are some other defines:
%start program
This sets the starting rule for our language as 'program'.
/* The structure for passing value between lexer and parser */
%union {
char *string;
double value;
TreeNode *tnode;
}
This sets up the different possible return values of each symbol & rule. (this turns in to the yylval struct)
Here are the defines for the rules:
%type <tnode> program statement_list statement
%type <tnode> goto_statement label_statement
%type <tnode> if_statement optional_else_statement optional_else_statement_list
%type <tnode> function_expression expression_list
%type <tnode> compare_expression assign_expression expression
So it work like this:
The program('program') is a list of statements('statement_list'). A single statement can be one of the following: a goto statement, a label statement, an if statement, or a variable assignment statement. An if statement contains an compare expression and it can contain a optional else statement too.
A rule is layed out like this:
program_part_name
: the first rule here {c/c++ code}
| optional rule(s) {c/c++ code}
| ...etc...
;
Here is the first rule:
program
: statement_list {tree=$1;}
;
So we define the program as a statement_list. In the C code we assign the return value of the statement_list to the Syntax tree variable. Each symbol or rule_name can have a return value as defined above. You access it by number, eg. $n where n is which symbol to access.
Here is the next rule:
statement_list
: statement LINE_END statement_list {$$ = new TreeNode (STMT_LIST, $1, $3);}
| label_statement statement_list {$$ = new TreeNode (STMT_LIST, $1, $2);}
| /* empty */ {$$ = new TreeNode (EMPTY_STMT);}
;
$$ is the return value. So the first rule says look for a statement a end of line and then another list of the statements after it, and if that doesn't apply look for a label_statement then a optional statement_list on the same line after it, and if that doesn't apply it is the end of the statement list.
TreeNode is the class for a single node in the syntax tree.
Here is the label statement:
label_statement
: ID ':'
{
$$ = new TreeNode (LABEL_STMT);
$$->identifier = $1;
}
;
This checks for the ID symbol and a ':', and then gets the string value of it as passed here from the identifier() function in the lexor.
So here is a basic statement:
statement
: /* empty */ {$$ = new TreeNode (EMPTY_STMT); }
| if_statement {$$ = $1;}
| assign_expression {$$ = new TreeNode (EXPR_STMT, $1);}
| goto_statement {$$ = $1;}
| function_expression {$$ = new TreeNode (EXPR_STMT, $1);}
| statement ':' statement {$$ = new TreeNode (STMT_LIST, $1, $3);}
| error expression {$$ = new TreeNode (ERROR_STMT);}
| error {$$ = new TreeNode (ERROR_STMT);}
;
A statement must be one of these things.
Note that two of the rules are expressions not pure statement. Expressions push values on the stack that must be removed. Example: int energy(); returns int, but if I say energy(); the return value must be cleared, so a Expression Statement node must be created.
The if_statement and goto_statement just return the child node, and don't need to create a new node on the tree.
Here is the goto statement:
goto_statement
: GOTO ID
{
$$ = new TreeNode (GOTO_STMT);
$$->identifier = $2;
}
;
Note this is almost the same as the label statement, however it checks for "goto identifier" instead of "label:".
Here is a function:
function_expression
: ID '(' expression_list ')'
{
if(!external_functions->is_function($1) {
$$ = new TreeNode (ERROR_STMT);
yyerror("Function not found.");
}
else if(external_functions->get_args($1)!=num_args) {
$$ = new TreeNode (ERROR_STMT);
yyerror("Incorrect number of args.");
}
else {
$$ = new TreeNode(FUNCTION_EXPR,$3);
$$->identifier=$1;
}
num_args=0;
}
: ID '(' ')'
{
if(!external_functions->is_function($1) {
$$ = new TreeNode (ERROR_STMT);
yyerror("Function not found.");
}
else if(external_functions->get_args($1)!=0) {
$$ = new TreeNode (ERROR_STMT);
yyerror("Incorrect number of args.");
}
else {
$$ = new TreeNode(FUNCTION_EXPR);
$$->identifier=$1;
}
num_args=0;
}
;
This is slightly more complex than a goto as we have to check the eternal_functions that the thing we are calling does exist and the right number of variables passed to it.
Here is the expression list: (works like statement list)
expression_list
: expression ',' expression_list {$$ = new TreeNode(EXPR_LIST,$1,$3);num_args++;}
| expression {$$ = $1;num_args++;}
;
Here is the assign expression:
assign_expression
: ID '=' expression
{
$$ = new TreeNode(ASSIGN_EXPR,$3);
$$ -> identifier = $1;
}
;
This allows variables to be assigned values.
Here is the if statement
if_statement
: IF expression THEN LINE_END statement_list optional_else_statement_list ENDIF
{
if ($6 != NULL)
$$ = new TreeNode (IFTHENELSE_STMT, $2, $5, $6);
else
$$ = new TreeNode (IFTHEN_STMT, $2, $5);
}
| IF expression THEN statement optional_else_statement
{
if ($6 != NULL)
$$ = new TreeNode (IFTHENELSE_STMT, $2, $4, $5);
else
$$ = new TreeNode (IFTHEN_STMT, $2, $4);
}
;
optional_else_statement
: /* empty */ {$$=NULL;}
| ELSE statement {$$=$2;}
;
optional_else_statement_list
: /* empty */ {$$=NULL;}
| ELSE statement_list {$$=$2;}
;
These allow either a single line or multi-line if statement and a optional else statement block.
Here are the expression rules:
expression
: assign_expression {$$=$1;}
| compare_expression {$$=$1;}
| function_expression {$$=$1;}
| expression '+' expression {$$=new TreeNode(ADD_EXPR,$1,$3);}
| expression '-' expression {$$=new TreeNode(SUB_EXPR,$1,$3);}
| expression '*' expression {$$=new TreeNode(MUL_EXPR,$1,$3);}
| expression '/' expression {$$=new TreeNode(DIV_EXPR,$1,$3);}
| expression '^' expression {$$=new TreeNode(POW_EXPR,$1,$3);}
| '-' expression %prec NEG {$$=new TreeNode(NEG_EXPR,$2);}
| ID
{
$$ = new TreeNode (IDENT_EXPR);
$$->identifier = $1;
}
| FLOAT
{
$$ = new TreeNode (CONST_EXPR);
$$->value = $1;
}
| '(' expression ')' {$$=$2;}
;
That should be farely straight forward, one thing to point out is %prec NEG, this allows '-' to be used for two different symbols.. either as a = -b or as a = 5 - b, but NEG has higher precedence than a normal subtract does.
And last of all the compare_expression:
compare_expression
: expression EQUAL expression {$$ = new TreeNode (EQUAL_EXPR, $1, $3);}
| expression NOTEQUAL expression {$$ = new TreeNode (NOTEQUAL_EXPR, $1, $3);}
| expression '<' expression {$$ = new TreeNode (LT_EXPR, $1, $3);}
| expression '>' expression {$$ = new TreeNode (GT_EXPR, $1, $3);}
| expression LTET expression {$$ = new TreeNode (LTET_EXPR, $1, $3);}
| expression GTET expression {$$ = new TreeNode (GTET_EXPR, $1, $3);}
| NOT expression {$$ = new TreeNode (NOT_EXPR, $2);}
| expression AND expression {$$ = new TreeNode (AND_EXPR, $1, $3);}
| expression OR expression {$$ = new TreeNode (OR_EXPR, $1, $3);}
;
I could of included this in the expression rules, but I decide it would be easier to understand if seperate.
7. Compiler?
The compiler convert the syntax tree in to the virtual asm code, that will be run on our Virtual Machine.
At this point our syntax tree could be converted in to any form of asm code, I have chosen a simple virtual asm code, based on a virtual stack machine.
Here are the opcodes:
nop // does nothing (my favorite opcode)
push <value or variable> // pushs the operand on to the stack
gettop <variable> // reads the top value on the stack and stores it in the variable
discard // removes the top value from the stack (popping it off)
Those deal with variables, here are some more:
add
sub
mul
div
pow
Those remove the two top values on the stack, does something with them together, then pushs the result on.
neg
This removes the top value on the stack and pushs a negative value of it back on.
jmp label // jumps to the target label
jmpt label // pops the top value off the stack and jumps if it is true
jmpf label // pops the top value off the stack and jumps if it is false
call function // calls an external function
or label // jump if true and don't pop the stack else pop it
and label // jump if false and don't pop the stack else pop it
Those are the jump statements.
equal //pops two top values off the stack and pushs the result of the comparision (a boolean)..
less //pops two top values off the stack and pushs if the first is less than the second or not
greater //pops two top values off the stack and pushs if the first is greater than the second or not
not //pops the top value off the stack and pushs the not value of it
And the compare statements.
finally:
jumptarget label
This is the location of a label, and it not a real opcode.
Here are the asm codes in a enum:
enum Opcode {
OP_NOP,
OP_PUSH,
OP_GETTOP,
OP_DISCARD,
OP_JMP,
OP_JMPF,
OP_JMPT,
OP_CALL,
OP_OR,
OP_AND,
OP_EQUAL,
OP_LESS,
OP_GREATER,
OP_NOT,
OP_ADD,
OP_SUB,
OP_MUL,
OP_DIV,
OP_POW,
OP_NEG,
JUMPTARGET // not an opcode but a jump target;
};
Here is the class I used to store the asm code.
class asmcode
{
public:
asmcode () {opcode=OP_NOP; next=NULL; data_type=0;}
asmcode (Opcode _opcode) {opcode=_opcode; next=NULL; data_type=0;}
asmcode (Opcode _opcode, double val) {opcode=_opcode; next=NULL; data.value=val; data_type=1;}
asmcode (Opcode _opcode, char *val) {opcode=_opcode; next=NULL; data.identifier=val;data_type=2;}
asmcode (Opcode _opcode, int val) {opcode=_opcode; next=NULL; data.target=val; data_type=3;}
~asmcode() {if(!this) return; if(data_type==2) delete []data.identifier;}
int Len()
{
asmcode *i = next;
int cnt = 1;
while (i) {cnt++; i=i->next;}
return cnt;
}
Opcode opcode;
union {
double value;
char *identifier;
int target;
} data;
int data_type;
asmcode *next;
};
Opcode is just a enum of with all the opcodes in it.
to build the source code we step through the syntax and build as we go..
// this joins two blocks of code together.
asmcode *Concatenate (asmcode *blk1, asmcode *blk2) {
asmcode *search = blk1;
while (search->next != NULL) search = search->next;
search->next = blk2;
return blk1;
}
// make a new copy of a string
char *new_string(char *str)
{
char *temp = new char [strlen(str)+1];
strcpy(temp,str);
return temp;
}
// returns the name of a new label and hope it doesn't conflict with any others...
// note: you must make a copy of the string this returns
char *make_new_label()
{
static int i=0;
static char label[20];
sprintf(label,"jmpt%010d",++i);
return label;
}
// the heart of the compiler
asmcode *genasmcode(Treenode *tree)
{
if(!tree) return new asmcode(OP_NOP);
Treenode *root=tree;
asmcode *blk1,*blk2,
*cond,*jump2else,*thenpart,*jump2end,*elsepart,*endif;
char *label1,*label2;
switch(root->type) {
case STMT_LIST: // for statement lists we just join the two children together.
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return blk1;
case EMPTY_STMT: // a blank line turn in to nop
return new asmcode (OP_NOP);
case LABEL_STMT: // a label is turned in to jumptarget
return new asmcode (JUMPTARGET,root->identifier);
case GOTO_STMT: // a goto is converted to jmp
return new asmcode (OP_JMP,root->identifier);
case EXPR_STMT: // at the end of a expression statement you have to discard the top of the stack
blk1 = genasmcode (root->child[0]);
blk2 = new asmcode (OP_DISCARD);
return Concatenate (blk1, blk2);
case IFTHEN_STMT: // an if statement
// First, create the necessary code parts
label1 = create_new_label(); // create a unique label
cond = genasmcode (root->child[0]);
jump2end = new asmcode (OP_JMPF,new_string(label1));
thenpart = genasmcode (root->child[1]);
endif = new asmcode (JUMPTARGET,new_string(label1));
// Now, concatenate them all
Concatenate (cond, jump2end);
Concatenate (jump2end, thenpart);
Concatenate (thenpart, endif);
return cond;
case IFTHENELSE_STMT: // an if then else statement
// First, create the necessary code parts
label1 = create_new_label();
label2 = create_new_label();
cond = genasmcode (root->child[0]);
jump2else = new asmcode (OP_JMPF,new_string(label1));
thenpart = genasmcode (root->child[1]);
jump2end = new asmcode (OP_JMP,new_string(label2));
elsepart = new asmcode (JUMPTARGET,new_string(label1));
Concatenate (elsepart, GenasmCode (root->child[2]));
endif = new asmcode (JUMPTARGET,new_string(label2));
// Now, concatenate them all
Concatenate (cond, jump2else);
Concatenate (jump2else, thenpart);
Concatenate (thenpart, jump2end);
Concatenate (jump2end, elsepart);
Concatenate (elsepart, endif);
return cond;
case EQUAL_EXPR: // equal and most of the other simple opcodes are the same as this....
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return Concatenate (blk1, new asmcode (OP_EQUAL));
case NOTEQUAL_EXPR: // for not equal we do equal and then add a not on the end
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
Concatenate (blk1, new asmcode (OP_EQUAL));
return Concatenate (blk1, new asmcode (OP_NOT));
case LT_EXPR: // less than
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return Concatenate (blk1, new asmcode (OP_LESS));
case LTET_EXPR: // for <= we do a > then a not
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
Concatenate (blk1, new asmcode (OP_GREATER));
return Concatenate (blk1, new asmcode (OP_NOT));
case GT_EXPR: // greater than
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return Concatenate (blk1, new asmcode (OP_GREATER));
case GTET_EXPR: // for >= we do a < then a not
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
Concatenate (blk1, new asmcode (OP_LESS));
return Concatenate (blk1, new asmcode (OP_NOT));
case AND_EXPR: // and is a lot like an if statement
// First, create the necessary code parts
label1 = create_new_label1();
cond = genasmcode (root->child[0]);
jump2end = new asmcode (OP_AND,new_string(label1));
thenpart = genasmcode (root->child[1]);
endif = new asmcode (JUMPTARGET,new_string(label1));
// Now, concatenate them all
Concatenate (cond, jump2end);
Concatenate (jump2end, thenpart);
Concatenate (thenpart, endif);
return cond;
case OR_EXPR: // or is a lot like an if statement too
// First, create the necessary code parts
label1 = create_new_label1();
cond = genasmcode (root->child[0]);
jump2end = new asmcode (OP_OR,new_string(label1));
thenpart = genasmcode (root->child[1]);
endif = new asmcode (JUMPTARGET,new_string(label1));
// Now, concatenate them all
Concatenate (cond, jump2end);
Concatenate (jump2end, thenpart);
Concatenate (thenpart, endif);
return cond;
case NOT_EXPR: // a not expression
blk1 = genasmcode (root->child[0]);
return Concatenate (blk1, new asmcode (OP_NOT));
case ASSIGN_EXPR: // an assign expression
blk1 = genasmcode (root->child[0]);
blk2 = new asmcode (OP_GETTOP, root->identifier);
return Concatenate (blk1, blk2);
case ADD_EXPR: // math opcode
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return Concatenate (blk1, new asmcode (OP_ADD));
case SUB_EXPR: // math opcode
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return Concatenate (blk1, new asmcode (OP_SUB));
case MUL_EXPR: // math opcode
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return Concatenate (blk1, new asmcode (OP_MUL));
case DIV_EXPR: // math opcode
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return Concatenate (blk1, new asmcode (OP_DIV));
case POW_EXPR: // math opcode
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return Concatenate (blk1, new asmcode (OP_POW));
case NEG_EXPR: // negate the prev value
blk1 = genasmcode (root->child[0]);
return Concatenate (blk1, new asmcode (OP_NEG));
case FUNCTION_EXPR: // a function call
blk1=new asmcode (OP_CALL, root->identifier);
if(root->child[0]) {
blk2 = genasmcode (root->child[0]);
return Concatenate (blk2,blk1 );
}
return blk1;
case EXPR_LIST: // a list of expressions
blk1 = genasmcode (root->child[0]);
blk2 = genasmcode (root->child[1]);
Concatenate (blk1, blk2);
return blk1;
case IDENT_EXPR: // pushes the variable on the stack
return new asmcode (OP_PUSH, root->identifier);
case CONST_EXPR: // pushes the constant value
return new asmcode (OP_PUSH, root->value);
case ERROR_STMT: // well.. return nop I guess..
return new asmcode (OP_NOP);
}
// error unknown opcode!
return new asmcode(OP_NOP);
}
This produces a linked list that can be sent to the virtual machine, you could could write it to a file at this time... but I find it compiles the code fast enough to be used dynamically.
8. Virtual?
Before the asm can be run by the virtual machine, there is one more step. All the label statements must be converted to nops, and the label operands on the jmp and other operators need to be converted into numbers (how many intructions to jump by). Also I like to change the linked link into a array.
This class will also allow some very simple task switching, but don't expect much.
class vm
{
private:
asmcode **program;
int length; // length of program
int pc; // program counter
stack *st; // This program's stack
vtable *local; // the local variables
int time; // how many cycles do we have left.
int loop; // does this script loop?
public:
static int speed; // number of cycles each process gets per update.
vm() {
program=NULL;
length=0;
pc=0;
time=0;
st = new stack;
local = new vtable;
}
~vm() {
int i;
if(program) {
for(i=0;i<length;i++) {
delete program[i];
}
delete [] program;
}
delete st;
delete local;
}
void set_loop(int _loop) {loop=_loop;}
void init(asmcode *prog); // use this asmcode and get it ready to be run.
int load(char *filename); // load the file
void reset(); // reset the program to the start
void update(); // run this program for a while.
};
int vm::speed;
// Get the program script ready to be run.
void vm::init(asmcode *prog)
{
int i,j;
if(!prog) return;
length = prog->len();
program = new(asmcode *)[length];
asmcode * temp = prog;
for(i=0;i < length;i++) // set the array
{
program[i] = temp;
temp=temp->next;
}
for(i=0;i < length;i++) // now fix the jumps
if(program[i]->opcode==OP_JMP || program[i]->opcode==OP_JMPT || program[i]->opcode==OP_JMPF || program[i]->opcode==OP_AND || program[i]->opcode==OP_OR )
for(j=0;j<length;j++)
if(program[j]->opcode==JUMPTARGET && strcmp(program[i]->data.identifier, program[j]->data.identifier)==0) {
delete [ ] program[i]->data.identifier;
program[i]->data_type=3;
program[i]->data.target=j-i;
break;
}
}
for(i=0;i < length;i++) // finally convert all the labels in to nop opcodes
{
if(program[i]->opcode==JUMPTARGET) {
delete [ ] program[i]->data.identifier;
program[i]->opcode=OP_NOP;
program[i]->data_type=0;
}
}
}
int vm::load(char *filename) // here we load from the file from the disk.
{
// start the error log
start_error_log("error.log",filename);
p_yyin = pack_fopen (filename, F_READ); // set up the file
pack_fseek(p_yyin,0); // these two lines shouldn't be needed, but I found things worked more smoothly
yyrestart(p_yyin);
yyparse (); // call the parser here
pack_fclose (p_yyin); // close the file
end_error_log();
if(!errors) { // if there were no errors create the asmcode from the syntax tree
asmcode *e = genasmcode (tree);
init(e); // set up the asmcode for running.
}
delete tree; // remove the syntax tree from memory
tree=NULL;
if(program) return 1;
return 0;
}
void vm::reset()
{
st->clear(); // clear the stack out
local->clear(); // remove the local variables
pc=0; // set the program counter back to 0
}
void vm::update()
{
int ipc; // amount to increment the program counter by.
double i,j; // two temp values
int num_args,k;
double args[12]; // up to 12 args can be passed to a function
time+=speed;
while(time>0&&pc<length) {
ipc=1; // set ipc to 1
switch(program[pc]->opcode) {
case OP_NOP: break; // do nothing
case OP_PUSH:
if(program[pc]->data_type==1) { // push a constant on the stack
st->push(program[pc]->data.value);
}
else if(program[pc]->data_type==2) { // push a variable on to the stack, but we have to fetch it first
st->push(local->fetch(program[pc]->data.identifier));
}
else st->push(0);
break;
case OP_GETTOP: // read the top of the stack and store it in a variable
local->set(program[pc]->data.identifier,st->get_top());
break;
case OP_DISCARD: // pop off the top value of the stack and discard it.
st->pop();
break;
case OP_JMP: // set ipc to the amount to jump by
ipc=program[pc]->data.target;
break;
case OP_AND:
if(!int(st->get_top())) // if the top of the stack is not true, then jump otherwise pop it off the stack
ipc = program[pc]->data.target;
else st->pop();
break;
case OP_OR:
if(int(st->get_top())) // same as and, but it checks for a value of true
ipc = program[pc]->data.target;
else st->pop();
break;
case OP_JMPT:
if(int(st->pop())) // pop the top value of the stack and jump if it is true
ipc = program[pc].data.target;
break;
case OP_JMPF:
if(!int(st->pop())) // pop the top value of the stack and jump if is false
ipc = program[pc].data.target;
break;
case OP_LESS: // compare two variables and push the result on the stack
j=st->pop();i=st->pop();
st->push(double(i<j));
break;
case OP_GREATER: // compare two variables and push the result on the stack
j=st->pop();i=st->pop();
st->push(double(i>j));
break;
case OP_EQUAL: // compare two variables and push the result on the stack
j=st->pop();i=st->pop();
st->push(double(i==j));
break;
case OP_NOT: // nots the top of the stack
i=st->pop();
st->push(double(!int(i)));
break;
case OP_ADD: // adds the top two values and pushes the result back on the stack
j=st->pop();i=st->pop();
st->push(i+j);
break;
case OP_SUB: // subtracts the top two values and pushes the result back on the stack
j=st->pop();i=st->pop();
st->push(i-j);
break;
case OP_MUL: // multiplies the top two values and pushes the result back on the stack
j=st->pop();i=st->pop();
st->push(i*j);
break;
case OP_DIV: // divides the top two values and pushes the result back on the stack
j=st->pop();i=st->pop();
st->push(i/j);
break;
case OP_POW: // calculates the power of the top two values and pushes the result back on the stack
j=st->pop();i=st->pop();
st->push(pow(i,j));
break;
case OP_NEG: // negates the top value on the stack and pushes the results back on.
i=st->pop();
st->push(-i);
break;
case OP_CALL: // call a function and return the value by pushing it on the stack.
num_args=functions->get_args(program[i]->data.identifier);
for(k=0;k<num_args;k++)
args[k]=st->pop();
st->push(functions->call(program[i]->data.identifier,args));
case JUMPTARGET: // this shouldn't happen...
break;
}
pc+=ipc; // increment the program counter
if(pc>=length&&loop) reset(); // if looping we reset when we get to the end of the program
time-=1; // I should really grab the amount of time the last opcode took..
// but I assume all the opcodes run at about the same speed. (they don't though)
}
}
That covers the virtual machine itself, now we just need three parts: the variable table, the stack, and the function table.
9. Stack?
A simple stack is all that is required.
struct stack_link {
double val;
stack_link *next;
};
class stack
{
private:
stack_link *top;
public:
stack() {top=NULL;}
~stack() {clear();}
void clear()
{
stack_link *temp;
while(top) {
temp=top->next;
delete top;
top=temp;
}
top=NULL;
}
double get_top() {if(!top) return 0; return top->val;}
double pop()
{
if(!top) return 0;
double r=top->val;
stack_link *temp=top->next;
delete top;
top=temp;
return r;
}
void push(double val)
{
stack_link *temp=new stack_link;
temp->val=val;
temp->next=top;
top=temp;
}
};
clear(); clears the stack
push(); puts a value on the top of the stack
pop(); returns the value of the top of the stack and removes it from the stack
get_top(); returns the value of the top of the stack, but leaves it there.
And that is it.
10. Variable Table
A hash table would be a good thing to use here as it would speed up fetch and store operations, but to make things more simple I'll use a linked list instead.
struct var_link
{
char *name;
double val;
var_link *next;
};
class vtable
{
private:
var_link *top;
public:
vtable() {top=NULL;}
~vtable() {clear();}
void clear()
{
var_link *temp;
while(top) {
temp=top->next;
delete top;
top=temp;
}
}
double fetch(char *name)
{
var_link *temp=top;
while(temp) {
if(strcmp(temp->name,name)==0)
return temp->val;
temp=temp->next;
}
return 0;
}
void set(char *name,double value)
{
var_link *temp=top;
while(temp) {
if(strcmp(temp->name,name)==0) {
temp->val=value;
return;
}
temp=temp->next;
}
temp=new var_link;
temp->next=top;
top=temp;
top->val=value;
top->name=name;
}
};
A class with three important functions clear() - empties the class set() - set a variable to a value and fetch() - returns the value of a variable.
As you can see every time a fetch or set is done, it must search the whole linked list, even a binary would be faster, but the class could easily be changed without changing the rest of the engine.
11. Functions?
Here is the final part of the script compiler. We need a flexible method of plugging in external functions. I don't know everything about this topic, and I'm still trying to get this working better.
First: functions is a global class that holds all the function information.
ftable *functions;
Here is a dummy function (remember only doubles are allowed in this scripting language):
double foo(double bar)
{
return bar+1;
}
Now somewhere we need to register the function, before the script is loaded.
functions->register("foo(bar)",function);
In many ways the function table is the same as the variable table, so you can use your variable storage method of choice. However some the functions are more complex. Indeed I have yet to find a tutorial that says how to do this stuff. It wasn't until I was reading in the source code for the script library Seer that I saw a method of function registering that is quite nice. So thanks to Theur - Przemyslaw Podsiadly for releasing his source code.
Anyway on to the class:
struct func_link
{
char *name;
int num_args;
void *function;
};
class ftable
{
private:
func_link *top;
func_link *find_function(char *name);
public:
ftable() {top=NULL;}
~ftable() {clear();}
void clear()
{
func_link *temp;
while(top) {
temp=top->next;
delete top;
top=temp;
}
}
void register(char *name,void *function); // registers a function
double call(char *name,double *args); // calls a registered function
int is_function(char *name); // returns true if the function is found
int get_args(char *name); // returns the number of args to this function
};
func_link *ftable::find_function(char *name)
{
func_link *search=top;
while(!search) {
if(strcmp(name,search->name)==0)
return search;
search=search->next;
}
return NULL;
}
int ftable::is_function(char *name)
{
return (find_function(name)==NULL);
}
int ftable::get_args(char *name)
{
func_link *temp;
temp = find_function(name);
if(!temp) return -1;
return temp->num_args;
}
char *get_function_name(char *name)
{
static char temp[30];
sscanf(name,"%[A-Za-z0-9_]",temp);
return temp;
}
char *get_function_args(char *name) // gets the number of function args return -1 if the format is bad
{
int i, j, tol;
for(i=0;name[i]&&name[i]!='(';i++);
if(!name[i]) return -1;
i++;
for(j=0;name[j+i]&&name[j+i]!=')';j++)
if(temp[j+i]==',') tol++;
if(!name[j+i]) return -1;
if(temp[j+i]==')' && j>0) tol++;
return tol;
}
void ftable::register(char *name,void *function)
{
char *true_name=get_function_name(name);
ftable *temp = find_function(true_name);
if(temp==NULL) {
temp=new func_link;
temp->next=top;
top=temp;
temp->name=new char[strlen(true_name)+1];
strcpy(temp->name,true_name);
}
temp->function=function;
temp->num_args=get_function_args(name);
}
// Here is a helper designed to aid in calling a function
#define d_func(x) ((double (*)x)function)
double call_func(void *func,double *pars,double *args)
{
if(pars==NULL) return NULL;
if(func==NULL) return NULL;
switch(args) {
case 0: return d_func(())();
case 1: return d_func((double))
(pars[0]);
case 2: return d_func((double,double))
(pars[0],pars[1]);
case 3: return d_func((double,double,double))
(pars[0],pars[1],pars[2]);
case 4: return d_func((double,double,double,double))
(pars[0],pars[1],pars[2],pars[3]);
case 5: return d_func((double,double,double,double,double))
(pars[0],pars[1],pars[2],pars[3],pars[4]);
case 6: return d_func((double,double,double,double,double,double))
(pars[0],pars[1],pars[2],pars[3],pars[4],pars[5]);
case 7: return d_func((double,double,double,double,double,double,double))
(pars[0],pars[1],pars[2],pars[3],pars[4],pars[5],pars[6]);
case 8: return d_func((double,double,double,double,double,double,double,double))
(pars[0],pars[1],pars[2],pars[3],pars[4],pars[5],pars[6],pars[7]);
case 9: return d_func((double,double,double,double,double,double,double,double,double))
(pars[0],pars[1],pars[2],pars[3],pars[4],pars[5],pars[6],pars[7],pars[8]);
case 10: return d_func((double,double,double,double,double,double,double,double,double,double))
(pars[0],pars[1],pars[2],pars[3],pars[4],pars[5],pars[6],pars[7],pars[8],pars[9]);
case 11: return d_func((double,double,double,double,double,double,double,double,double,double,double))
(pars[0],pars[1],pars[2],pars[3],pars[4],pars[5],pars[6],pars[7],pars[8],pars[9],pars[10]);
case 12: return d_func((double,double,double,double,double,double,double,double,double,double,double,double))
(pars[0],pars[1],pars[2],pars[3],pars[4],pars[5],pars[6],pars[7],pars[8],pars[9],pars[10],pars[11]);
default: return NULL;
}
}
double ftable::call(char *name,double *args)
{
func_link *temp=find_function(name);
return call_func(temp->function,args,temp->num_args);
}
12. More?
To be honest a script engine with only support for doubles is quite useless.
In Anthem of Dread's scripting engine I used a more c-like parser, had only three variable types, string: a$="hello"; int: a=5; float: a&=3.1415;
Also the lack of loops is quite painful. Here is a while loop and its asm version:
while a==0
...stuff...
wend
beginloop:
push a
push 0
equal
jmpf endloop
...stuff...
jmp beginloop
endloop:
Also I hate many things about basic's syntax. In a real script engine, a c-like syntax is would be so much easier. :) In fact it is easier to program a c-like parser. Just use ';' instead of LINE_END for your line ends, and allow '{' statement_list '}' to be used in place of a single statement. I do like basics variable system though.. its ideal for scripts where you don't want to write too much code. For more complex stuff like functions you should think carefully about the stack..
Well that is it. Just add a main to link the engine to and you have a script compiler demo.
I've put together a small demo with the script engine in action. It is horrible. Its also giftware... so use it, ignore it, or trash it as you see fit. Its my gift to you, just remember you get what you pay for.
13. End?
Yes, this is the end of the document..
Majorly suggested reading:
The flex & bison manual to figure out how to create the langauge that best fits your needs.
Search the internet for other articles, the more you read the better your overall understanding.
Look at the source code for Seer and some of the other scripting languages floating around for ideas, and other cool stuff.
Stuff you could do with your script compiler for fun:
- Write an optimizer to make the asm code faster.
- Add more variable types. (strings and integers are useful)
- Write a proper multitasker or use the operating system to multi thread the script processes.
- Invent a better way of calling, and registering functions...
- Embed your scripts in to your other data files (like monster or map data)
- Global Variables! (that can be accessed between scripts)
- Allow functions declared inside a script.
- Added classes might be fun too.
- Try to take over the world. (using your script compiler of course.)
There is heaps of other cool stuff you can do. Have fun! And don't forget to email me about how horrible this tutorial was. Oh and check out Anthem of Dread on www.dread.nl, it is good.
Thanks to:
Jan Niestadt for his tutorial on www.flipcode.com
Przemyslaw Podsiadly for his Seer scripting library (and for releasing it open source)
All the people in #allegro on EFnet for helping me with ideas and stuff.