This is our implementation of a compiler for Project D, a dynamically-typed interpreted programming language. The compiler is written in Rust and follows the traditional compiler pipeline: lexing, parsing, semantic analysis, and interpretation.
The compiler reads source code written in the Project D language and executes it directly by building an abstract syntax tree (AST) and walking through it. We've implemented four main phases:
- Lexical Analysis - Breaks source code into tokens
- Parsing - Constructs an abstract syntax tree from tokens
- Semantic Analysis - Checks for errors and optimizes the AST
- Interpretation - Executes the program by traversing the AST
The language supports variables, functions (with closures), arrays, tuples, conditionals, loops, and various operators. It's dynamically typed and uses lexical scoping.
dlang/
├── src/
│ ├── lexer.rs - Tokenizes the input source code
│ ├── parser.rs - Builds the AST from tokens
│ ├── analyzer.rs - Semantic checks and optimizations
│ ├── interpreter.rs - Executes the AST
│ ├── ast.rs - Defines the AST structure
│ ├── token.rs - Defines all token types
│ └── main.rs - Main entry point
├── test_programs/ - Sample programs for testing
└── tests/ - Unit tests
You'll need Rust installed. Build the project with:
cd dlang
cargo buildTo run a program from a file:
cargo run test_programs/good_program.txtOr run without arguments to see a bunch of demo programs:
cargo runThe output shows each phase of compilation:
- The original AST
- Semantic analysis results
- Optimized AST (if changes were made)
- Program execution output
To run all unit tests with detailed output, use:
cargo test -- --nocapture --test-threads=1This command shows the full output of each test, making it easier to debug and understand test failures. The --nocapture flag ensures that all print statements and logs are displayed, and --test-threads=1 runs tests sequentially for clearer output.
Variables are declared with var and assigned with :=:
var x := 10
var y := 20
print x + yThe language supports integers, reals, booleans, strings, and none (similar to null).
If/else statements:
if age >= 18 then
print "Adult"
else
print "Minor"
endWhile loops:
var i := 1
while i <= 5 loop
print i
i := i + 1
endFor loops can iterate over arrays or ranges:
for i in 1..5 loop
print i
end
var arr := [10, 20, 30]
for num in arr loop
print num
endFunctions are first-class values. They can be defined with an arrow (=>) for single expressions:
var add := func(x, y) => x + y
print add(5, 3)Or with a block (is ... end) for multiple statements:
var factorial := func(n) is
if n <= 1 then
return 1
else
return n * factorial(n - 1)
end
endFunctions capture their lexical environment (closures), so nested functions work as expected.
Arrays are 1-indexed:
var arr := [1, 2, 3, 4, 5]
print arr[1] // prints 1
arr[3] := 99 // modifies the third elementTuples store named or indexed fields:
var point := {x := 10, y := 20}
print point.x
print point.y
var t := {a := 1, 2, c := 3}
print t.a // by name
print t.2 // by index
print t.c // by nameThe is operator checks types at runtime:
var x := 42
if x is int then
print "x is an integer"
endreturn- Returns from a function (only allowed inside functions)exit- Exits from a loop (only allowed inside loops)print- Prints one or more values to stdout
The lexer (lexer.rs) reads through the source code character by character and produces tokens. It handles:
- Keywords (var, if, while, for, func, etc.)
- Identifiers and literals (numbers, strings, booleans)
- Operators and punctuation
- Comments (both
//single-line and/* */multi-line) - Whitespace and newlines
The lexer tracks line and column numbers for error reporting. When it encounters an unexpected character, it produces an error token with location information.
The parser (parser.rs) uses a recursive descent approach. It builds an AST by consuming tokens according to the language's grammar rules. The parser handles operator precedence and associativity correctly, so expressions like 2 + 3 * 4 are parsed as 2 + (3 * 4).
The parser is pretty forgiving - it skips semicolons, newlines, and comments automatically. This makes the language easier to write without worrying about exact formatting.
The semantic analyzer (analyzer.rs) has two jobs: checking for errors and optimizing the code.
Before optimization, we verify:
-
Declarations before usage - Variables and functions must be declared before they're used. We track declarations in a symbol table as we walk the AST.
-
Return statement placement -
returnstatements can only appear inside function bodies, not in the global scope. -
Array bounds checking - For array literals with constant indexes, we check at compile time if the index is out of bounds.
If any semantic errors are found, we report them and skip optimization.
We run several optimization passes that modify the AST:
-
Constant folding - Evaluates constant expressions at compile time. For example,
5 + 3becomes8. Works for arithmetic, comparisons, and boolean operations. -
Conditional simplification - Removes branches that can never execute. If we see
if true then ... else ... end, we replace it with just thethenbranch. -
Unreachable code removal - After a
returnorexitstatement, any code in the same block is removed since it can never execute. -
Unused variable removal - Variables that are declared but never used are removed entirely.
The optimizer runs these passes in a loop until no more changes occur. This means constant folding can enable other optimizations - for example, if constant folding makes a condition evaluate to true, then conditional simplification can remove the else branch.
The interpreter (interpreter.rs) walks the AST and executes statements as it goes. It maintains:
- An environment (symbol table) for variable bindings, with proper scoping
- A stack to track whether we're inside a function or loop (for validating
return/exit) - Function closures that capture the environment at definition time
Values are represented as a Value enum that can hold integers, reals, booleans, strings, arrays, tuples, functions, or none. Since the language is dynamically typed, type checking happens at runtime.
The interpreter handles all the language rules:
- Arrays are 1-indexed (valid range is 1 to length)
- Division by zero raises an error
- Array index out of bounds raises an error
- Undefined variables raise an error
- Function argument counts must match
Scoping follows lexical rules - inner scopes can shadow outer variables, and functions capture the environment they were defined in.
The compiler provides clear error messages at different stages:
- Lexical errors - "Unexpected character '@' at line 5, column 3"
- Parse errors - "Expected identifier, got '42' at line 10, column 5"
- Semantic errors - "Variable 'x' used before declaration"
- Runtime errors - "Division by zero" or "Array index 10 out of bounds (array size: 3)"
When an error occurs, we print a message and stop execution at that phase (semantic errors prevent optimization, runtime errors stop execution).
The test_programs/ directory contains several examples:
good_program.txt- Basic functionality with constant foldingsemantic_error.txt- Demonstrates semantic error detectionunreachable_code.txt- Shows unreachable code removalunused_var.txt- Tests unused variable removalconditional_simplify.txt- Demonstrates conditional simplification
Building this compiler helped us understand:
- How lexers break text into tokens
- How parsers build trees from sequences of tokens
- How semantic analysis catches errors that parsing can't
- How optimizations can simplify code before execution
- How interpreters execute programs step by step
- The importance of proper scoping and closures
The code is structured to be maintainable - each phase is separated into its own module, and the AST serves as a clean interface between phases. Error handling is consistent throughout, making debugging easier.