Dino is a high-level, dynamic scripting language that
has been designed for simplicity, uniformity, and expressiveness.
Dino is similar to such well known scripting languages as Perl,
TCL, and Python.
As most programmers know the C language, Dino resembles C where possible.
Dino is an extensible, object oriented language that has garbage collection. It supports parallelism description, exception handling, and dynamic loading of libraries written on other languages. Although Dino is a multiplatform language, its main platform is Linux.
This document is a concise introduction to the new Dino scripting language, but is not a programmer's manual.
Most of us do not remember how programmers wrote programs for old computers that had only a few kilobytes of memory. Long ago I read about an Algol 60 compiler that worked on a computer that had only 420 20-bits words. In an old book "Compiler Construction for Digital Computers", Gries describes an Algol compiler working on 1024 42-bits words. How did they achieve this? One of the ways is to use an interpreter for a specialized language; a program in a specialized language is usually smaller. Let's implement an assembler for syntactic parsers. The assembler output will be a syntactic parser interpreter in C. The assembler instructions have the following format:
Here, the constructions in brackets are optional. For convenience we will allow comments that start with ; and finish at the end of the line.[label:] [code [operand]]
The assembler will have the following instructions:
goto label | Transfer control to the instruction marked label. |
gosub label | Transfer control to the instruction marked label and save the next instruction address. |
return | Transfer control to the instruction following the latest executed gosub instruction. |
skipif symbol | If the current token is symbol, the following instruction is skipped. Otherwise, transfer control to the following instruction. |
match symbol | The current token should be symbol, otherwise a syntax error is set. After matching, the next token is read and become the current token. |
next | The next token is read and become the current token. |
The following assembler fragment recognizes Pascal designators.
; ; Designator = Ident { "." Ident | "[" { expr / ","} "]" | "@" } ; start: Designator: match Ident point: skipif Point goto lbrack next ; skip . match Ident goto point lbrack: skipif LBracket goto at next ; skip [ next: gosub expr skipif Comma goto rbrack next ; skip , goto next rbrack: match RBracket goto point at: skipif At return next ; skip @ goto point
These files are described in detail below.
Line 1 describes an abstract node of an IR. A node of such class has the variable lno (which is the source line of the corresponding assembler instruction). The variable is also a class parameter. That means that you should define its value when creating a class instance or object (see line 7). Inside class irn, classes describing each assembler instruction are defined. Each Dino object (and not only objects) stores information about its context. So if you create an object of class next (see line 40 in file input.d) by calling a class that is accessed through an object of class irn, and then you take value of the variable lno through the object of the class next, you actually take the value of the variable of the object of the class irn. This is a more simple and more general mechanism of implementing a single inheritance.
An object of the class ir (lines 9-13) contains information about the entire program:
Lines 15-23 contain a function to output errors. The function accepts a variable number of parameters whose values will be elements of the vector in the implicitly defined variable args. Any function or class can be called with any number of actual parameters. If the number of formal parameters is more than the number of actual parameters, the rest of formal parameters will have the default value nil. If the number of actual parameters is more than the number of formal parameters, the rest of the actual parameters will be ignored unless the last formal parameter is "...".1. class irn (lno) { 2. class goto (lab) {} class skipif (sym) {} 3. class match (sym) {} class gosub (lab) {} 4. class next () {} class ret () {} 5. } 6. 7. var an = irn (0); 8. 9. class ir () { 10. // all ir nodes, label->node index, node index -> vector of labels. 11. var ns = [], l2i = {}, i2l = {}; 12. var ss = {}, mind, maxd; 13. } 14. 15. func err (...) { 16. var i; 17. fput (stderr, argv[0], ": "); 18. for (i = 0; i ? #args; i++) 19. if (args[i] != nil) 20. fput (stderr, args[i]); 21. fputln (stderr); 22. exit (1); 23. } 24. 25. func tab2vect (tab) { 26. var i, vect = [#tab:""]; 27. for (i in tab) 28. vect [tab {i}] = i; 29. return vect; 30. }
The other elements are:
There are many other predefined functions, classes, and variables in Dino. On line 18 you can see the operator #, which returns the number of elements in a vector or an associative table.
Lines 25-30 contain a function that transforms a table into a vector. The table's keys are a sequence of integers that start with 0. The result is a vector whose elements are the table elements placed in the vector according their keys. First we create a vector of the table size and initialize it with empty strings (line 26). Then we fill each element of the vector, iterating by the keys of the table (lines 27-28).
The first line contains an include-clause that specifies a source file without the suffix .d (all Dino source files should have this suffix). The file is given as a string in the clause; this means that the entire file is inserted in place of the clause. As result, we could check the file by calling the Dino interpreter with input.d on a command line. There are several rules that define which directories are searched for the included file. One such directory is the directory of the file that contains the include-clause. Thus, we can place all the assembler files in that one directory and forget about the other rules.
The file is inserted only once in a given block (this is the construction that starts with { and finishes with }). This is important for our program because each file will contain an inclusion of the file ir.d, and eventually all the files will be included into the main program file. Unconditional inclusion in this case would result in many error messages about repeated definitions. By the way, there is also special form of the include-clause that permits unconditional file inclusion.
On lines 6-13 we define some variables. We use regular expressions to assign them strings that describe the correct assembler lines. The regular expressions are extended regular expressions that are described in POSIX 10003.2. To concatenate the strings (vectors), we use the operator @.
Lines 16 to 52, 53 form a try-block that is used to process exceptional situations in the Dino program. The Dino interpreter can generate a lot of predefined exceptions. A Dino programmer can also describe and generate other exceptions. The exceptions are objects of the predefined class except, or they are objects of a class defined inside the class except. Dino has special constructions (extension blocks) to add something into a class, and functions when the class or the function is already defined. In our example, the exception we catch is "reaching the end of a file", which is generated by the predefined function fgetln (reading a new line from a file). If we do not catch the exception, the program finishes with a diagnostic about reaching the end of the file. In the clause catch, we write a class of exceptions that we want to catch. The value of the predefined variable invcalls is the class invcall, in which class eof is defined. In turn, the class invcall is defined inside the class except. If the exception is of a class given in the catch-clause or of a class defined somewhere inside a class given in the catch-clause, a block corresponding to the catch-clause is executed. The variable e is implicitly defined in the block that contains the exception. The exception is propagated further, unless the catch-clause corresponding to the exception is found.
The predefined function fgetln returns the next line from the file. After this, the line is matched with the pattern on line 20. The predefined function match returns the value nil if the input line does not correspond to the pattern, otherwise it returns a vector of integer pairs. The first pair is the first and the last character indexes in the line. The first pair defines the substring that corresponds to the whole pattern. The following pairs of indexes correspond to constructions in parentheses in the pattern. They define substrings that are matched to the constructions in the parentheses. If a construction is not matched (for example, because an alternative is rejected), the indexes have the value -1.
The statement on line 23 extracts a label. The predefined function subv is used to extract the sub-vectors (sub-strings).
On lines 24 and 25, we use an empty vector to initialize a table element that corresponds to the current assembler instruction. On lines 26-31, we process a label, if it is defined on the line. On lines 27-28, we check that the label is not defined repeatedly. On line 29, we define how to map the label name into number of the assembler instruction to which the label is bound. We make that mapping with the aid of associative table pr.l2i. On line 30, we add the label name to the vector that is the element of associative table pr.i2l that has a key equal to the number of the assembler instruction. Predefined function ins (insertion of element into vector) is used with index -1, which means addition of the element at the vector end. Dino has extensible vectors. There are also predefined functions to delete elements in vectors (and associative tables).
On lines 36-49 we check the current assembler instruction and create the corresponding IR node (an object of a class inside the class ir -- see file ir.d). And finally, we insert the node at the end of the vector pr.ns (line 50).1. include "ir"; 2. 3. func get_ir (f) { 4. var ln, lno = 0, code, lab, op, v; 5. // Patterns 6. var p_sp = "[ \t]*"; 7. var p_code = p_sp @ "(goto|skipif|gosub|match|return|next)"; 8. var p_id = "[a-zA-Z][a-zA-Z0-9]*"; 9. var p_lab = p_sp @ "((" @ p_id @ "):)?"; 10. var p_str = "\"[^\"]*\""; 11. var p_op = p_sp @ "(" @ p_id @ "|" @ p_str @ ")?"; 12. var p_comment = p_sp @ "(;.*)?"; 13. var pattern = "^" @ p_lab @ "(" @ p_code @ p_op @ ")?" @ p_comment @ "$"; 14. 15. var pr = ir (); 16. try { 17. for (;;) { 18. ln = fgetln (f); 19. lno++; 20. v = match (pattern, ln); 21. if (v == nil) 22. err ("syntax error on line ", lno); 23. lab = (v[4] >= 0 ? subv (ln, v[4], v[5] - v[4]) : nil); 24. if (!(#pr.ns in pr.i2l)) 25. pr.i2l {#pr.ns} = []; 26. if (lab != nil) { 27. if (lab in pr.l2i) 28. err ("redefinition lab ", lab, " on line ", lno); 29. pr.l2i {lab} = #pr.ns; 30. ins (pr.i2l {#pr.ns}, lab, -1); 31. } 32. code = (v[8] >= 0 ? subv (ln, v[8], v[9] - v[8]) : nil); 33. if (code == nil) 34. continue; // skip comment or absent code 35. op = (v[10] >= 0 ? subv (ln, v[10], v[11] - v[10]) : nil); 36. var node = irn (lno); 37. if (code == "goto" || code == "gosub") { 38. if (op == nil || match (p_id, op) == nil) 39. err ("invalid or absent lab `", op, "' on line ", lno); 40. node = (code == "goto" ? node.goto (op) : node.gosub (op)); 41. } else if (code == "skipif" || code == "match") { 42. if (op == nil || match (p_id, op) == nil) 43. err ("invalid or absent name `", op, "' on line ", lno); 44. node = (code == "skipif" ? node.skipif (op) : node.match (op)); 45. } else if (code == "return" || code == "next") { 46. if (op != nil) 47. err (" non empty operand `", op, "' on line ", lno); 48. node = (code == "next" ? node.next (op) : node.ret ()); 49. } 50. ins (pr.ns, node, -1); 51. } 52. } catch (invcalls.eof) { 53. } 54. return pr; 55. }
1. include "ir"; 2. 3. func check (pr) { 4. var i; 5. for (i = 0; i ? #pr.ns; i++) { 6. if (inside (pr.ns[i], an.goto) || inside (pr.ns[i], an.gosub)) { 7. if (!(pr.ns[i].lab in pr.l2i)) 8. err ("undefined label `", pr.ns[i].lab, "' on line ", 9. pr.ns[i].lno); 10. if (pr.maxd == nil || pr.maxd ? pr.l2i {pr.ns[i].lab} - i) 11. pr.maxd = pr.l2i {pr.ns[i].lab} - i; 12. if (pr.mind == nil || pr.mind > pr.l2i {pr.ns[i].lab} - i) 13. pr.mind = pr.l2i {pr.ns[i].lab} - i; 14. } else if (inside (pr.ns[i], an.match) 15. || inside (pr.ns[i], an.skipif)) { 16. if (!(pr.ns[i].sym in pr.ss)) 17. pr.ss {pr.ns[i].sym} = #pr.ss; 18. } 19. } 20. }
The generated interpreter requires the external functions yylex and yyerror (line 34). The function yylex is used by the interpreter to read and to get the code of the current token. Function yyerror should output error diagnostics. (The interface is a simplified version of the Yacc Unix Utility interface.)
The compiled assembler program is presented by a C array of chars or short integers with the name program. Each element of the array is an encoded instruction of the source program. On lines 11-15, we evaluate the start code for each kind of assembler instruction and define the type of array elements. On lines 16-33, we output the array program. On lines 37-61, we output the function yyparse. Finally, on lines 62-63 we close the two output files with the aid of the predefined function close.
1. include "ir"; 2. 3. func gen (pr, bname) { 4. var h = open (bname @ ".h", "w"), c = open (bname @ ".c", "w"); 5. var i, vect; 6. vect = tab2vect (pr.ss); 7. for (i = 0; i ? #vect; i++) 8. fputln (h, "#define ", vect[i], "\t", i + 1); 9. fputln (h); 10. fputln (c, "#include \"", bname, ".h\"\n\n"); 11. var match_start = 3, skipif_start = match_start + #pr.ss, 12. goto_start = skipif_start + #pr.ss, 13. gosub_start = goto_start + (pr.maxd - pr.mind) + 1, 14. max_code = gosub_start + (pr.maxd - pr.mind); 15. var t = (max_code ? 256 ? "unsigned char" : "unsigned short"); 16. fputln (c, "\nstatic ", t, " program [] = {"); 17. for (i = 0; i ? #pr.ns; i++) { 18. if (inside (pr.ns[i], an.goto)) 19. fput (c, " ", goto_start + pr.l2i{pr.ns[i].lab} - i - pr.mind, ","); 20. else if (inside (pr.ns[i], an.match)) 21. fput (c, " ", match_start + pr.ss{pr.ns[i].sym}, ","); 22. else if (inside (pr.ns[i], an.next)) 23. fput (c, " 1,"); 24. else if (inside (pr.ns[i], an.ret)) 25. fput (c, " 2,"); 26. else if (inside (pr.ns[i], an.skipif)) 27. fput (c, " ", skipif_start + pr.ss{pr.ns[i].sym}, ","); 28. else if (inside (pr.ns[i], an.gosub)) 29. fput (c, " ", gosub_start + pr.l2i{pr.ns[i].lab} - i - pr.mind, ","); 30. if ((i + 1) % 20 == 0) 31. fputln (c); 32. } 33. fputln (c, " 0, 0\n};\n\n"); 34. fputln (h, "extern int yylex ();\nextern int yyerror ();\n"); 35. fputln (h, "\nextern int yyparse ();\n"); 36. fputln (h, "#ifndef YYSTACK_SIZE\n#define YYSTACK_SIZE 50\n#endif"); 37. fputln (c, "\nint yyparse () {\n int yychar=yylex (), pc=0, code;\n ", 38. t, " call_stack [YYSTACK_SIZE];\n ", t, " *free=call_stack;"); 39. fputln (c, "\n *free++=sizeof (program) / sizeof (program [0]) - 1;"); 40. fputln (c, " while ((code=program [pc]) != 0 ?? yychar > 0) {"); 41. fputln (c, " pc++;\n if (code == 1)\n yychar=yylex ();"); 42. fputln (c, " else if (code == 2) /*return*/\n pc=*--free;"); 43. fputln (c, " else if ((code -= 2) ? ", #pr.ss, ") {/*match*/"); 44. fputln (c, " if (yychar == code)\n pc++;\n else {"); 45. fputln (c, " yyerror (\"Syntax error\");"); 46. fputln (c, " return 1;\n }"); 47. fputln (c, " } else if ((code -= ", #pr.ss, ") ? ", #pr.ss, ") {"); 48. fputln (c, " if (yychar == code)\n pc++; /*skipif*/"); 49. fputln (c, " } else if ((code -= ", #pr.ss, ") ?= ", pr.maxd-pr.mind, 50. ") /*goto*/\n pc += code + ", pr.mind, ";"); 51. fputln (c, " else if ((code -= ", pr.maxd - pr.mind + 1, ") ?= ", 52. pr.maxd - pr.mind, ") { /*gosub*/"); 53. fputln (c, " if (free >= call_stack + YYSTACK_SIZE) {"); 54. fputln (c, " yyerror (\"Call stack overflow\");"); 55. fputln (c, " return 1;\n }\n pc += code + ", pr.mind, 56. ";\n *free++=pc;\n } else {"); 57. fputln (c, " yyerror(\"Internal error\");\n return 1;\n }"); 58. fputln (c, " }\n if (code != 0 || yychar > 0) {"); 59. fputln (c, " if (code != 0)\n yyerror (\"Unexpected EOF\");"); 60. fputln (c, " else\n yyerror(\"Garbage after end of program\");"); 61. fputln (c, " return 1;\n }\n return 0;\n}"); 62. close (h); 63. close (c); 64. }
1. include "ir"; 2. include "input"; 3. include "check"; 4. include "gen"; 5. 6. if (#argv != 1) 7. err ("Usage: sas file"); 8. 9. var pr = get_ir (open (argv[0], "r")); 10. check (pr); 11. gen (pr, sub ("^(.*/)?([^.]*)(\\..*)?$", argv[0], "\\2"));
For comparison, we would have about 15Kb for a YACC generated parser. Not bad. But we could make the parser even less than 1Kb by using short and long goto and gosub instructions. Actually, what we generate is not a parser, it is only a recognizer. But the assembler language could be easily extended to write parsers. Just add the instructions:gcc -c -Os oberon2.c; size oberon2.o text data bss dec hex filename 382 934 0 1316 524 oberon2.o
to call semantic functions for the generation of parsed code. In any case, most of a compiler's code would be in C. To further decrease the compiler size (not only its parser), an interpreter of C that is specialized to the compiler could be written.call C-function-name
Of course, it is not easy work to write a parser on the assembler. So we could generate assembler code from a high-level syntax description, for example, from a Backus-Naur form. Another area for improvement is the implementation of error recovery, but this was not our purpose. Our goal was just to demonstrate the Dino language.
http://www.linuxstart.com/~vladimir_makarov/dinoload.html
http://www.freespeech.org/vmakarov/dinoload.html
Special thanks to Michael Behm (a member of the Cygnus documentation group) for his help in editing this article.