A compiler is a program , Role is Translate one language into another .
for example babel It's a compiler , It will es6 Version of js Translate into es5 Version of js. From this point of view , Software translated into English will also be translated into English .
General procedure ,CPU It can't be executed directly , because CPU Only machine instructions can be recognized . So if you want to execute a program , The first step is to translate the program written in high-level language into assembly code (Java One more step , Translate high-level language into bytecode ), Then translate the assembly code into machine instructions , such CPU To identify and execute .
Because assembly language and machine language correspond one to one , And assembly language is more readable . Therefore, the textbook of computer principle will use assembly language instead of machine language to explain machine instructions .
This paper will write four operations compiler need to be 1 + 1
These four expressions are translated into machine instructions and executed . Please see the example for the specific process :
// CPU Can't recognize
10 + 5
// Translated into assembly language
push 10
push 5
add
// Finally translated into machine instructions , Assembly code corresponds to machine instructions one by one
// Machine instructions are given by 1 and 0 form , The following instructions are not real instructions , Just for demonstration
0011101001010101
1101010011100101
0010100111100001
Four compiler operations , Although the function is very simple , Only four Operational expressions can be compiled . But the front-end part of the principle of compilation almost involves : Lexical analysis 、 Syntax analysis . In addition, there is the code generation of the back-end part of the compilation principle . Whether it's simple 、 Complex compilers , The compilation steps are similar , It's just that complicated compiler implementations are more difficult .
Someone might ask , What's the advantage of learning to compile ?
I think that mastering the internal principles of the compilation process will make you a better advanced programmer . In addition, I quote Zhihu net friend - What you want Answer , More specific :
- It is easier to understand which writing methods are equivalent in a language , What are the differences
- We can compare the differences between different languages more objectively
- It is not easy to be fooled by the propagandist of a particular language
- Learning a new language is more efficient
- In fact, from the language a Switch to language b It's a general requirement , Learning the principles of compilation will make you more comfortable when dealing with such requirements
Okay , Let's take a look at how to write a quad arithmetic compiler .
Lexical analysis
A program is actually a series of characters stored in a text file , The function of lexical analysis is to decompose a series of characters into characters according to some rules (token, Also called terminator ), Ignore spaces and comments .
Example :
// Program code
10 + 5 + 6
// The result of lexical analysis is token
10
+
5
+
6
Terminator
Terminator is the basic element used in language , It can't be broken down anymore .
The terminators in the four operations include symbols and integer constants ( Unary operators and floating point operations are not supported ).
- Symbol :
+ - * / ( )
- Integer constant :12、1000、111...
Lexical analysis code implementation
function lexicalAnalysis(expression) {
const symbol = ['(', ')', '+', '-', '*', '/']
const re = /\d/
const tokens = []
const chars = expression.trim().split('')
let token = ''
chars.forEach(c => {
if (re.test(c)) {
token += c
} else if (c == ' ' && token) {
tokens.push(token)
token = ''
} else if (symbol.includes(c)) {
if (token) {
tokens.push(token)
token = ''
}
tokens.push(c)
}
})
if (token) {
tokens.push(token)
}
return tokens
}
console.log(lexicalAnalysis('100 + 23 + 34 * 10 / 2'))
// ["100", "+", "23", "+", "34", "*", "10", "/", "2"]
The grammatical rules of four operations ( Grammar rules are hierarchical )
x*
, Express x Zero or more timesx | y
, Express x or y Will appear( )
parentheses , Used to group words in language
The following rules look from left to right , The expression on the left can be further subdivided into the expression on the right , It's subdivided until it can't be subdivided again .
- expression: addExpression
- addExpression: mulExpression (op mulExpression)*
- mulExpression: term (op term)*
- term: '(' expression ')' | integerConstant
- op:
+ - * /
addExpression
Corresponding +
-
expression ,mulExpression
Corresponding *
/
expression .
If you don't understand the above rules , Then put it down first , Keep looking down . See how to implement syntax analysis in code .
Syntax analysis
A process in which input text is analyzed according to grammatical rules and its grammatical structure is determined , It's called parsing .
The output of general parsing is abstract syntax tree (AST) Or parse tree (parse tree). But because the four operations are relatively simple , So the solution here is to do code generation and error reporting in real time , In this way, you don't need to save the whole program structure in memory .
Let's take a look at how to analyze a four Operational expression 1 + 2 * 3
.
The first match is expression
, Due to the present expression
There's only one way to go down , namely addExpression
, So it's broken down into addExpression
.
By analogy , The next order is mulExpression
、term
、1
(integerConstant)、+
(op)、mulExpression
、term
、2
(integerConstant)、*
(op)、mulExpression
、term
、3
(integerConstant).
As shown in the figure below :
There may be questions here , Why is an expression so complicated ,expression
There is addExpression
,addExpression
And then there is mulExpression
.
In fact, this is to consider the priority of operators ,mulExpr
Than addExpr
Expression operation level should be high .
1 + 2 * 3
compileExpression
| compileAddExpr
| | compileMultExpr
| | | compileTerm
| | | |_ matches integerConstant push 1
| | |_
| | matches '+'
| | compileMultExpr
| | | compileTerm
| | | |_ matches integerConstant push 2
| | | matches '*'
| | | compileTerm
| | | |_ matches integerConstant push 3
| | |_ compileOp('*') *
| |_ compileOp('+') +
|_
There are many algorithms for building parsing trees , There are only two algorithms .
Recursive descent analysis
Recursive descent analysis , Also known as top-down analysis . Analyze step by step recursively according to the rules of grammar token flow , If you encounter a non Terminator , Then go on to analyze , Until the Terminator .
LL(0) analysis
Recursive descent analysis is a simple and efficient algorithm ,LL(0) On this basis, there is one more step , When the first one token When it's not enough to determine the element type , Take... For the next character “ Check in advance ”, It's possible to solve this uncertainty .
The above is a brief introduction to these two algorithms , See the following code implementation for specific implementation .
Expression code generation
The four Operational expressions we usually use are infix expressions , But for computers, infix expressions are not easy to calculate . So in the code generation phase , To convert infix expression to suffix expression .
Postfix expression
Postfix expression , Also known as reverse Polish , Does not contain brackets , The operator is placed after two operands , All calculations are in the order in which the operators appear , Strictly from left to right ( The precedence rules for operators are no longer considered ).
Example :
Infix expression : 5 + 5
Convert to suffix expression :5 5 +
, Then generate code according to the suffix expression .
// 5 + 5 Convert to 5 5 + Generate code again
push 5
push 5
add
Code implementation
The theoretical knowledge of compiler principles is like the book of heaven , It's often seen in the clouds , But really do it , You'll find that , Actually, it's quite simple .
If the above theoretical knowledge is not very clear , No problem , First look at the code implementation , And then combine it with theoretical knowledge to see .
Be careful : Here we need to introduce the lexical analysis code .
// Assembly code generator
function AssemblyWriter() {
this.output = ''
}
AssemblyWriter.prototype = {
writePush(digit) {
this.output += `push ${digit}\r\n`
},
writeOP(op) {
this.output += op + '\r\n'
},
// Output assembly code
outputStr() {
return this.output
}
}
// parsers
function Parser(tokens, writer) {
this.writer = writer
this.tokens = tokens
// tokens Array index
this.i = -1
this.opMap1 = {
'+': 'add',
'-': 'sub',
}
this.opMap2 = {
'/': 'div',
'*': 'mul'
}
this.init()
}
Parser.prototype = {
init() {
this.compileExpression()
},
compileExpression() {
this.compileAddExpr()
},
compileAddExpr() {
this.compileMultExpr()
while (true) {
this.getNextToken()
if (this.opMap1[this.token]) {
let op = this.opMap1[this.token]
this.compileMultExpr()
this.writer.writeOP(op)
} else {
// There is no corresponding operator on the match There's no match here + -
// take token Index back one bit
this.i--
break
}
}
},
compileMultExpr() {
this.compileTerm()
while (true) {
this.getNextToken()
if (this.opMap2[this.token]) {
let op = this.opMap2[this.token]
this.compileTerm()
this.writer.writeOP(op)
} else {
// There is no corresponding operator on the match There's no match here * /
// take token Index back one bit
this.i--
break
}
}
},
compileTerm() {
this.getNextToken()
if (this.token == '(') {
this.compileExpression()
this.getNextToken()
if (this.token != ')') {
throw ' Missing right bracket :)'
}
} else if (/^\d+$/.test(this.token)) {
this.writer.writePush(this.token)
} else {
throw ' FALSE token: The first ' + (this.i + 1) + ' individual token (' + this.token + ')'
}
},
getNextToken() {
this.token = this.tokens[++this.i]
},
getInstructions() {
return this.writer.outputStr()
}
}
const tokens = lexicalAnalysis('100+10*10')
const writer = new AssemblyWriter()
const parser = new Parser(tokens, writer)
const instructions = parser.getInstructions()
console.log(instructions) // Output generated assembly code
/*
push 100
push 10
push 10
mul
add
*/
Simulation execution
Now let's simulate CPU The execution of machine instructions , Because assembly code corresponds to machine instructions one by one , So we can create a simulator that directly executes assembly code .
Before creating the emulator , Let's first explain the operation of the relevant instructions .
Stack
In memory , The characteristic of stack is that it can only insert and delete at the same end , That is, only push and pop Two kinds of operations .
push
push The purpose of the instruction is to push an operand onto the stack .
pop
pop The function of the instruction is to pop an operand out of the stack .
add
add The function of an instruction is to execute it twice pop operation , Pop up two operands a and b, And then execute a + b, Then the result push Go to the stack .
sub
sub The function of an instruction is to execute it twice pop operation , Pop up two operands a and b, And then execute a - b, Then the result push Go to the stack .
mul
mul The function of an instruction is to execute it twice pop operation , Pop up two operands a and b, And then execute a * b, Then the result push Go to the stack .
div
sub The function of an instruction is to execute it twice pop operation , Pop up two operands a and b, And then execute a / b, Then the result push Go to the stack .
All the instructions of the four operations have been explained , Do you think it's very simple ?
Code implementation
Be careful : Need to introduce lexical analysis and syntax analysis code
function CpuEmulator(instructions) {
this.ins = instructions.split('\r\n')
this.memory = []
this.re = /^(push)\s\w+/
this.execute()
}
CpuEmulator.prototype = {
execute() {
this.ins.forEach(i => {
switch (i) {
case 'add':
this.add()
break
case 'sub':
this.sub()
break
case 'mul':
this.mul()
break
case 'div':
this.div()
break
default:
if (this.re.test(i)) {
this.push(i.split(' ')[1])
}
}
})
},
add() {
const b = this.pop()
const a = this.pop()
this.memory.push(a + b)
},
sub() {
const b = this.pop()
const a = this.pop()
this.memory.push(a - b)
},
mul() {
const b = this.pop()
const a = this.pop()
this.memory.push(a * b)
},
div() {
const b = this.pop()
const a = this.pop()
// Floating point operations are not supported , So round it up here
this.memory.push(Math.floor(a / b))
},
push(x) {
this.memory.push(parseInt(x))
},
pop() {
return this.memory.pop()
},
getResult() {
return this.memory[0]
}
}
const tokens = lexicalAnalysis('(100+ 10)* 10-100/ 10 +8* (4+2)')
const writer = new AssemblyWriter()
const parser = new Parser(tokens, writer)
const instructions = parser.getInstructions()
const emulator = new CpuEmulator(instructions)
console.log(emulator.getResult()) // 1138
A simple four arithmetic compiler has been implemented . Let's write a test function and run it , Let's see if the results are what we expected :
function assert(expression, result) {
const tokens = lexicalAnalysis(expression)
const writer = new AssemblyWriter()
const parser = new Parser(tokens, writer)
const instructions = parser.getInstructions()
const emulator = new CpuEmulator(instructions)
return emulator.getResult() == result
}
console.log(assert('1 + 2 + 3', 6)) // true
console.log(assert('1 + 2 * 3', 7)) // true
console.log(assert('10 / 2 * 3', 15)) // true
console.log(assert('(10 + 10) / 2', 10)) // true
All tests are correct . Also attached Complete source code , It is suggested that the students who don't understand it should read it twice more .
Take it to the next level
For industrial compilers , The four arithmetic compilers belong to toys in toys . But you can't be a fat man at one bite , So it's better to learn compiler principles step by step . Let's introduce a more advanced compiler , This compiler can compile a Jack Language ( class Java Language ), Its grammar is like this :
class Generate {
field String str;
static String str1;
constructor Generate new(String s) {
let str = s;
return this;
}
method String getString() {
return str;
}
}
class Main {
function void main() {
var Generate str;
let str = Generate.new("this is a test");
do Output.printString(str.getString());
return;
}
}
The output of the above code is :this is a test
.
Do you want to implement such a compiler ?
This compiler comes from a Book 《 Computer system elements 》, It starts from the 6 Chapter begins , All the way to 11 Chapter describes the assembly compiler ( Convert assembly language to machine language )、VM compiler ( Will be similar to bytecode VM Language is translated into assembly language )、Jack Language compiler ( High level language Jack Translate into VM Language ). Each chapter has detailed explanation and experiment of knowledge points , As long as you do the experiment step by step , We can finally implement such a compiler .
If the compiler is finished , Finally, where does machine language execute ?
This book has been considered for you , It starts from the 1 The first chapter 5 Chapter , There are five chapters . Teach you to start with logic gates , Step by step, the arithmetic logic unit is formed ALU、CPU、 Memory , And finally build a modern computer . And then let the program that you compile with a compiler run on this computer .
in addition , The first of this book 12 Chapter will teach you how to write various library functions of the operating system , for example Math library ( It involves all kinds of mathematical operations )、Keyboard library ( How to press the keyboard is output to the screen )、 Memory management and so on .
I want to have a look at the whole book 12 What happened after Zhang's experiment ? I'm here to provide a few pictures of this analog computer running program DEMO GIF, For your reference .
The top right corner of these pictures is “ Computer ” The screen of , The other parts are “ Computer ” Stack area and instruction area of .
I have done all the experiments in this book ( Spend every day 3 Hours , It can be done in two months ), The answer is in my github On , If you are interested, you can have a look at .