Compiling principles written for front end

zxg_ God says light 2021-05-04 16:53:39
compiling principles written end

Hao Hao is a front-end engineer , Recently, it involves the field of engineering , Want to know something about compiling . It happens that I studied earlier than he did , So I introduced what I knew to him , So here's the conversation .

What is compilation ?

Haohao : Recently, I want to know something about compiling , light , What is compiler ?

I : Compiling is a kind of transformation technology , From one programming language to another , From high-level language to low-level language , Or from high-level language to high-level language , Such conversion technology .

Haohao : What is advanced language , What is low level language ?

I Low level language It's about machines , When it comes to registers 、cpu Instruction, etc , special “ low ”, Describe the execution process on the machine , Like machine language 、 assembly language 、 Bytecode, etc . High-level language There is no such thing as concrete implementation , It's mainly used to express logic , And it provides the conditions 、 loop 、 function 、 Object oriented and so on , And then through compiling, these described high-level language logic can be automatically converted into low-level language instructions , In this way, we can express logic conveniently , Without affecting the specific implementation . It's not right to say that it doesn't affect execution , Because if you write assembly directly , Write the most efficient code , But if it is a high-level language, it will be automatically converted to a low-level language by compiling , Then it is difficult to guarantee the execution efficiency of the generated code , Various compilation optimizations are required , This is the difficulty in compiling .

In fact, think of , We put ideas in our heads , Translate the plan into high-level language code , Is this process also a transformation , Can it be automated , This involves ai 了 . Now we have the research direction of intelligent technology for understanding requirements document and generating code .


Haohao : How is it converted ?

I : To transform, we must first understand the two sides of the transformation , What is to be converted , To what . Such as high-level language to high-level language , It's a string to convert , Organized in a certain format , These formats are called morphology 、 grammar , The whole is called grammar , The target to be changed , If the target is also a high-level language, understand the format of the target language , If the target is a low-level language , For example, assembly , It's about understanding what each command does . Then we need to transform the semantic equivalence , Pay attention to this “ Semantic equivalence ”, Explain one language to another , You can't lose or add some semantics , It has to be consistent .

Know what both sides of the conversion are , You can do the conversion , First, let the computer understand what to transform , What do you mean “ Computer understanding “ Well ? It's just to put the words that we prescribe 、 The syntax format tells the computer , How to tell ? It's data structure , According to a certain data structure, the results of source string parsing should be organized , The computer can handle it . This process is called parse, First of all, participle , Reconstruct the idiom tree .

In fact, it's not just the compilation field that needs “ understand ”, There are many other fields that need to be “ understand ”:

Full text search engine also need to first search the string through the word segmentation machine segmentation , Then according to these words, use the same word segmentation machine to segment words and index them in the database , Score and sort the matching results of words , This is full text search .

The field of artificial intelligence deals with natural language , He also has to follow the morphology 、 grammar 、 And so on “ understand ”, After becoming a certain data structure , The computer can understand and process , Then there are various processing algorithms .

Word segmentation is based on the state machine ( Finite state machine DFA), What's this for , Why does participle need it , I know you must have questions . Because morphology describes the format of the smallest word , For example, an identifier cannot begin with a number , And then followed by alphanumeric underscores and so on , such , And keywords if、while、continue etc. , These can't be subdivided any more , It doesn't make sense to subdivide . Word segmentation is to change a string into the smallest unit of words that can no longer be split , Also called token, Because different words have different formats , You can't write if else Let's deal with different formats . Actually, it can be ,wenyan Namely if else, Make complaints about it . But when there is 100 The format of words in English should be handled , It's all written in if else, My day, , Can you still read the code . So treat the processing of each word as a state , When we deal with different word formats, we jump to different states , The way to jump is naturally based on the character currently being processed , Process a string from the start state to a different state to process , So that's state automata , Every token After identification, it can be thrown out , The final product is a token Array .

In fact, the state is more than one level , Think about it, like a html The beginning tag of the tag , It can be treated as a state , But in this state, we have to deal with properties 、 Start tagging, etc , This is the secondary state , Attributes can be subdivided into several States to handle , It's a three-level state , This is the idea of partition , Layer by layer .

After word segmentation, we got the words one by one , And then we're going to assemble these words , Generate ast, Why do you have to ast Well ? I know you want to ask . In fact, high-level language code is nested , You see low-level languages like assembly , It's just instructions , Linear structure , But high level languages , Have function 、if、else、while And so on , Blocks can be nested . So it's natural to organize a tree like data structure for the computer to understand , Namely Abtract Syntaxt Tree, Grammar tree 、 And it's abstract , That is, some meaningless separators are ignored , such as html Of <、>、</ Equal character ,js Of { }() [] ; It's the details , Don't care , Comments are also ignored , Notes are just participles that will come out , But don't put it in ast Inside .

How to assemble it , Or nested assembly , Is that recursive assembly , Yes , You're right. You need recursion , It's not just here ast Assembly requires recursion , There is also a lot of recursion in the following processing , Except when it comes to linear code , It's like assembly , What are you doing , There is no nested structure to recurse .

We just analyzed the morphology , It's a string format , Grammar , It's the assembly format , It's the combination of words . That's why we're just going to participle , If you assemble it directly from a string ast, So we're dealing with the string level , And from token It starts at the word level , It's like building a castle out of building blocks , But building blocks also need to be made of mud , How do you make it , You can build the blocks first , And then assemble it into a castle , You can also build blocks and assemble them at the same time . But for cars, you can make building blocks at the same time , Edge assembly , Castle level building blocks are assembled at the same time. Can you figure out what building blocks to build , It's hard , So it still depends . Do it both ways parser Both have , Simple analysis can be side word method , Analyze the hot words and immediately assemble them into ast in , such as html、css such , But like js、c++ such , If you don't segment first , Just start with a string ast, I can only say it's too fierce .

Talking about building blocks and assembly for a long time , So how to assemble it , Processing from left to right token, Meet a token How do you know his grammar , It's like how to know which part a building block belongs to . There are also two ways of thinking , One is to make sure that the building block belongs to that part , And find the drawing of that part , Assemble according to the drawing , The other is that you assemble first , After assembly, let's see what this part is . There are two ways , First, determine which part is based on one or two building blocks , Then assemble the part according to the drawing , This is a ll The way , Assemble first , After assembly, let's see what the parts are , This is a lr The way .ll The way to determine what is assembled ast The nodes need to look down a few , According to the number to see what is assembled, they are LL(1),LL(2) And so on .ll That's recursion descent , This is the easiest way to assemble , Of course, some people think that lr It's also very simple .ll There's a problem that has to be solved lr solve , That is, recursion falls down, and recursion on the left side doesn't reach the end , To eliminate left recursion , That's when you can't assemble according to the drawings , Let's first assemble and then see what's assembled . It's actually very similar to life , One way is to look down two steps and decide what to do right now , Another way is to go first , We'll see where we go . Actually, I belong to the second kind , There's no planning .

After morphology 、 After the analysis of grammar, it came into being ast. Use a tree like data structure to describe the source code , From here on, computers can understand , It can be explained and executed later 、 The transformation can be compiled . Both interpretation and compilation need to be done first parse, That is, let the computer understand what it is first , And then decide what to do with it .

Put the tree in the back ast Convert to another ast, And then print it as a string of object code , This is the translator , hold ast Interpretation and execution or special linear intermediate code and then interpretation and execution , This is the interpreter , hold ast Turn to linear intermediate code , And then generate assembly code , Then do the assembly and link , Generate machine code , This is the compiler .

How does the compiler handle AST Of ?

Haohao : light , How does the compiler handle ast ah ?

I : With ast after , Computers can understand high-level language code , But compilers have to produce low-level languages , For example, assembly code , Directly from ast It's a long way to start . Because one is nested 、 Tree shaped , One is linear 、 Sequential , So! , It needs to be converted to a linear code first , And then generate low-level code . I think ast It's also a tree shape IR,IR yes immediate representation The meaning of the middle . First of all AST It's linear IR, And then generate the assembly 、 Bytecode, etc .


How to translate , How can the tree structure become linear ? Obviously recursion , Recursion according to grammatical structure ast, Translate each node , This is called grammar guided translation , With linearity IR To translate AST Properties of a node . How to translate each node ,if How to translate 、while How to translate can go to see the relevant information , Search for intermediate code generation .

however ast If you can't come up, just turn to the intermediate code .

Haohao : Why? ,ast Can't represent the source code information , Why can't it be translated directly into linear ir?

I : Because we haven't done semantic checking yet , Structure doesn't mean right , It's like “ Haohao is a pig ”, This is grammatically correct , But the semantics is obviously wrong , It's a curse , So we need to do semantic checking first . There is also the need to derive some information to , In order to do the follow-up translation .

Semantic analysis should check out semantic errors , Such as whether the types match 、 Whether the referenced variable exists 、break Whether in while secondary , The main thing to do Scope analysis Reference resolution Type derivation and checking Correctness check etc. .

Scope analysis is analyzing functions 、 Block, etc , What are the variables in these scopes , What is the connection between scopes , In fact, the scope is a tree , From the top scope to the sub scope, a tree data structure can be generated . I remember one who did scope Analytical webpack plug-in unit , He linked the modules as well , Formed a big scope graph, And then do the analysis .

There are various declarations in the scope , We need to sort them out 、 Initial value 、 Access modifiers and other information are recorded , The structure that holds this information is called the symbol table , This is equivalent to a cache , Later, when processing this symbol, just look up the symbol table , Don't start from ast Come looking for .

Reference resolution is to check whether definitions can be found for each symbol , If you can't find it, report an error . You are familiar with the type ,js It's impossible to write all types in the source code of , Many places can be deduced directly , according to ast You can get a declaration of the type , Record it in the symbol table , And then go through ast, Check the type of each node when it takes out the declaration , If you don't agree, you report an error . There are other trivial checks , such as continue、break Can only appear in while Medium and so on .

Haohao : Semantic analysis, I get it , Is to check for errors and record some of the analyzed information to the symbol table , After semantic analysis ?

After semantic analysis, it means that the program has no syntax and semantic errors , It's safe to do all kinds of subsequent conversions , There will be no more developer mistakes . Then translate it into linear IR, And then for linearity IR To optimize , Need to optimize is because automatically generated code inevitably has a lot of redundancy , We need to get rid of all kinds of unnecessary treatment . But keep the semantics unchanged . Like dead code deletion 、 Common subexpression delete 、 Constant propagation and so on .

linear IR The analysis of the flow chart to establish , It's the control flow diagram , control flow It is based on if、while、 Program jump caused by function call, etc , Connecting the code executed in sequence with the code that jumps to is a graph , Sequential execution of code as a whole , It's called basic fast . Then do data flow analysis according to the flow chart , That is to analyze the code that a variable flows through , And then do all kinds of optimizations based on these .

This part is called program analysis , perhaps Static analysis , It's a special direction , Can be used for static checking of code vulnerabilities , Can be used for compilation optimization , This is more difficult . There are fewer doctors studying this . In China, only Peking University and Nanjing University offer program analysis courses .

The optimized linearity IR You can generate assembly code , And then through the assembler into machine code , Link to some standard libraries , such as v8 You can see builtins Catalog , Here are all kinds of compiled machine code files , Can be statically linked into an executable file .

Haohao : Oh , I don't think the front end can touch the assembly and link .

I : Right , because js It's interpreted language , Explain and execute directly from the source code , Don't say js 了 ,java And the bytecode doesn't need static links . image c、c++ Only those who generate executable files need to make code into machine code through assembler, and then link it into a file . And if the target platform has these libraries , So there's no need to link statically together , You can dynamically link . You may have heard of .dll and .so These are respectively windows and linux For runtime dynamic loading to save the machine code file .

You're right , The front-end domain basically doesn't need assembly and linking , Even if it is wasm, Also generate wasm Bytecode , Then explain and execute . The front end is mainly translator .

How does the translator handle AST Of ?

Haohao : The translator is ast And what did you do after that ?

I : The target code of the translator is also a high-level language , It's also a nested structure , So from high-level language to high-level language is from tree structure to tree structure , It's not like translating into a language organized in a low-level way , It has to be translated into linear first IR, High level to high level language conversion , It only needs ast, Yes ast After all kinds of transformations , You can do code generation .

Haohao : I said , I didn't hear about it babel There's a linearity in it IR The concept of .

I : Right , Whether it's cross language conversion , such as ts turn rust, It's the same language conversion js turn js You don't need a linear structure , What kind of linear intermediate code does the conversion of two trees need . So most translators are parsetransformgenerate this 3 Stages .


parse In a broad sense, it includes morphology 、 The analysis of grammar and semantics , The narrow sense parse Grammatical analysis . This doesn't have to be tangled up .

transform That's right ast Addition, deletion and modification of , after generator And then ast Print it as a string , We analyze ast When the time is right []{} () The bisector has been removed ,generate Add the details back when you need to .

In fact, the front-end domain is mainly translator , Because the mainstream js The engine executes the source code , But the source code is not the same as the one we wrote , So many source to source translators at the front end do this kind of conversion , such as babel、typescript、terser、eslint、postcss、prettier etc. .

babel It's a higher version es The code is converted to a lower version , And injection polyfill.typescript It's type checking and turning into js Code .eslint It's checking according to the specifications , but --fix You can also generate the repaired code .prettier Also used to format code , Than eslint More to deal with , It's not limited to js.postcss Mainly dealing with css Of ,posthtml Used for processing html. I believe you have used it a lot .taro This kind of small program translator is based on babel Packaged .

How the interpreter handles AST Of ?

Haohao : Oh , light , I probably know that both compilers and translators are right ast What have you done to deal with , Both of them generate code , And the interpreter ?

I : Yes , First of all, translator is also a kind of compiler , It's just something special , be called transpiler, A general compiler is called compiler. The difference between an interpreter and a compiler is really whether to generate code , A machine code compiled in advance is called AOT compiler , What compiles into machine code at run time is called JIT compiler ,

The interpreter doesn't generate machine code , So how does it work ? I know you have questions .

In fact, the interpreter uses one high-level language to explain another high-level language , such as c++, Commonly used c++ To write an interpreter , Because you can do memory management . use c++ To write js Interpreter , image v8、spidermonkey Are all . We have ast And after semantic analysis, you can traverse ast, And then use c++ To execute different nodes , This is called. tree walker Interpreter , Direct interpretation execution ast,v8 The engine is 17 That's what I did years ago . But in 17 Bytecode was introduced two years later , Because bytecode can be cached , So next time you execute bytecode directly, you don't need to parse 了 . Bytecode is a linear structure , Also do ast To linear ir Transformation , After the vm Execute bytecode on .

General explanation of linear code, such as assembly code 、 Bytecode and other programs are called virtual machine , Because machine code is linear , Actually from ast It can be explained from the beginning , But not vm, I think that's why , The interpreter of linear code, which is similar to machine code, is called vm.

Whether it's explanation ast Good , It's better to translate it into bytecode and explain it again , It's not particularly efficient , Because it uses other high-level languages to execute the code of the current language , So to improve efficiency, we have to compile it into machine code , This kind of runtime compilation is JIT compiler , Compiling is time consuming , So it's not all code JIT, Do heat statistics , You don't do it until you reach the threshold JIT. Then cache the machine code , Of course, it could be cached assembly code , When it is used, it can be converted into machine code by assembler , Because machine code takes up a lot of space .

You can contrast v8 To understand the ,v8 Yes parser、ignation Interpreter 、turbofan compiler , also gc.

ignation The interpreter is to put parse Out of ast Into bytecode , Then explain the execution bytecode , When the heat reaches the threshold, it will be handed over to turbofan Compile into assembly code and generate machine code , To speed up .gc It's independent of memory management .

turbofan It's the turbocharger , The name can reflect JIT The meaning of . but JIT Improved execution speed , There are also shortcomings. , For example, it will make js The engine is bigger , More memory , So lightweight js The engine does not contain jit, That's the speed and the packet size 、 The trade-off between memory space . Architecture design often needs to be done. Both sides are OK , But to make a choice trade off, We call it scheme blending .

When it comes to trade-offs , I think of rn Of js engine hermes Instead, it supports direct bytecode execution , Put... During compilation js Code compiled into bytecode , And then execute the bytecode directly , This is in the cross domain js Engine trade off.

Where does the front-end domain use compilation knowledge ?

Haohao : Oh , light , I understand the interpreter 、 compiler 、 What do translators do , Where does the front-end domain use the knowledge of compiler principles ?

I : In fact, you must have a general understanding , But it's not clear , Let me make a list of what I know .

Various translators in the field of engineering : babel、typescript、eslint、terser、prettier、postcss、posthtml、taro、vue template compiler etc.

js engine : v8、javascriptcore、quickjs、hermes etc.

wasm: llvm Can generate wasm Bytecode , therefore c++、rust Etc. can be converted into llvm ir You can do it in any language wasm Development

ide Of lsp: Syntax of programming languages 、 Smart tips 、 Error checking, etc language service protocol Protocol to communicate , and lsp The server is mainly based on parser Analyze the text being edited

How do you realize a language ?

Haohao : I learned the principle of compilation, can I realize a language ?

I : In fact, programming language is mainly design , If it is realized, it should be realized first parser And semantic analysis , There are two roads in the back , One is interpreter coordination for interpretation and execution JIT The way of compiler , One is compiled into assembly code , Then generate the machine code and link it to the executable file .

parser Part of it is cumbersome , It can be used antlr such parser Generator to generate , Semantic analysis should be written by oneself , This is not too difficult , It's mainly about ast All kinds of treatment . And then if you want to build a compiler , It can be used llvm This kind of general optimizer and code generator ,clang、rust、swift All based on it , So it's very reliable , It can be used directly . If you're an interpreter, you can write tree walker Interpreter , Or to further generate linear bytecode , And then write one vm To explain bytecode .JIT Compilers can also use llvm To do it . To put ast Turn into llvm ir, It's also a tree structure to a linear structure , This is a very common operation in the field of compilation .

In fact, the compiler principle just tells you how to implement it , Language design doesn't care about implementation , A language can be either compiled or interpreted , It can also be made java That kind of compile before explain , You see hermes(react native Realized js engine ) No, it's just putting js Compile into bytecode and then explain the bytecode . Language does not translate , This concept should have ,c There are also interpreters ,js There are also compilers , The main way to explain is to translate or translate .

Programming languages can be divided into GPL and DSL Two kinds of .

GPL It's a universal programming language , It's Turing's complete , That is, it can describe any computable problem , image c++、java、python、go、rust These languages are Turing's complete , So whatever one language can do, another language can do , It's just that it's very difficult to implement . such as go Language built-in coroutine implementation , So it's easy to write high concurrency programs ,java There is no language level coroutine , Then it's up to the top . You may have heard that design patterns are a complement to language defects, which means , Different languages have different design ideas , The built-in things are also different , Sometimes it needs runtime to make up for .、

Programming languages have different design ideas , The big direction is the programming paradigm , Like imperative 、 declarative 、 Functional expression 、 Logic, etc , These big ideas lead to the grammar of language , The built-in implementations are different , The ability to express is also different . This basically sets the tone of the language , It's hard to make up for it later , It's like js It's functional , You can't restrict people from using imperative , It's hard to write pure functional code .

DSL Not Turing complete , But in exchange for a stronger expression ability in a certain field , such as html、css、 Regular expressions ,jq And so on , It's more like pseudo code , Strong expression in specific areas , But it's not Turing's complete and can't describe all computable problems .

Compiling principles are the steps to realize programming languages , There's a lot to learn from the higher level of language design , It's better to be familiar with the features of multiple programming languages .

How can I learn compiler principles ?

Haohao : light , How can I learn compiler principles ?

I : First of all, you need to understand what compiler learns , Look at the compiler above me 、 translation 、 The explanation of popular science may have an impression , Then check the relevant information . When you know what you can do, write first parser, Because whatever you do, you have to parse become ast Can be “ understand ” And subsequent processing , Learn the finite state machine to segment words and recursive descent construction ast. It's recommended to have a look at vue template compiler Of parser, such xml Of parser Relatively simple , Suitable for entry . At the language level parser There are many details , We still have to find one debug see . But I don't think it's necessary , Generally, I just write html parser, If the language is , It can be used antlr Generate . The translator must understand babel, This is a very good translator in the front-end field .

js The engine can try to use babel do parser, Do your own semantic analysis , Explain to perform ast try , After that, it further generates bytecode or other linear ir, And then write one vm To explain bytecode .

You can also learn wasm Related technology , That is related to other languages compiling to wasm Bytecode process .

I'm writing a 《babel Plug in clearance secrets 》 A small volume of , There will be js Interpreter 、 Translator 、type cheker、linter wait , A lot of knowledge about compiler principles , Maybe it can help you get started with compiling principles .

When you're done with compiling principles , You probably know how to implement a programming language , Later, if you want to go deep into language design, you can learn more languages of other programming paradigms , Learn about language features , How to design a strong expressive gpl perhaps dsl.

You can also learn more about the operating system and architecture , Because the compiled code still needs to run as a process on the operating system , So how to design the runtime depends on understanding the operating system . then cpu How instruction sets are implemented in circuits , If you want to go deeper, you can take a look at the computer architecture .

however , Front end engineers don't need to reach that depth , But there's no harm in having an open mind .

本文为[zxg_ God says light]所创,转载请带上原文链接,感谢

  1. Gallop workflow engine design series 01 process element design
  2. VUE移动端音乐APP学习【十六】:播放器歌词显示开发
  3. Vue Mobile Music App learning [16]: player lyrics display development
  4. jquery cookie
  5. jquery cookie
  6. 体面编码之JavaScript
  7. JavaScript for decent coding
  8. React17 系统精讲 结合TS打造旅游电商平台
  9. React17 system combined with TS to build tourism e-commerce platform
  10. 2021-05-04 hot news
  11. HttpSession对象与Cooike的关系 以及 Cookie对象构造函数问题
  12. gRPC-Web:替代REST的gRPC的Javascript库包
  13. The relationship between httpsession object and cooike and the construction of cookie object
  14. Grpc Web: a JavaScript library package to replace rest grpc
  15. Building reactive rest API with Java - kalpa Senanayake
  16. PDF转HTML工具——用springboot包装pdf2htmlEX命令行工具
  17. Pdf to HTML tool -- Wrapping pdf2htmlex command line tool with springboot
  18. PDF转HTML工具——用springboot包装pdf2htmlEX命令行工具
  19. Pdf to HTML tool -- Wrapping pdf2htmlex command line tool with springboot
  20. Vue.js比jQuery更容易学习
  21. Node.js的Reactor模式 与异步编程
  22. Vue. JS is easier to learn than jQuery
  23. Reactor mode of node.js and asynchronous programming
  24. 详解JavaScript中的正则表达式
  25. Explain regular expressions in JavaScript
  26. 详解JavaScript中的正则表达式
  27. Explain regular expressions in JavaScript
  28. JS: closure
  29. Write your own promise in promises / A + specification
  30. Analysis of the core mechanism of webpack from loader, plugin to egg
  31. On the import and export of webpack
  32. Interpretation of lodash source code (2)
  33. Hexo series (5) writing articles
  34. 有人用过JMeter或用HttpUnit写过测试吗????
  35. Has anyone ever used JMeter or written tests in httpUnit????
  36. JavaScript异步编程4——Promise错误处理
  37. Leetcode 1846. Reduce and rearrange the largest element of an array
  38. JavaScript asynchronous programming 4 -- promise error handling
  39. SQLite是一种经典的无服务器Serverless
  40. 通过Spring Boot Webflux实现Reactor Kafka
  41. SQLite is a classic server less
  42. Realization of reactor Kafka through spring boot Webflux
  43. Focus on the basic knowledge of JS
  44. Node.js与Spring Boot比较? - Ryan Gleason
  45. Compare node.js with spring boot- Ryan Gleason
  46. 「HTML+CSS」自定义加载动画【049】
  47. 「HTML+CSS」自定义加载动画【048】
  48. 「HTML+CSS」--自定义加载动画【047】
  49. "HTML + CSS" custom loading animation [049]
  50. "HTML + CSS" custom loading animation [048]
  51. "HTML + CSS" -- custom loading animation [047]
  52. 使用Akka实现Reactive DDD
  53. Prototype与JQuery对比
  54. Using akka to realize reactive DDD
  55. Comparison between prototype and jquery
  56. Please elaborate the diff algorithm of Vue
  57. On the combination of ecarts and Baidu map, in the Intranet environment to develop offline map, to achieve point, line, range value.
  58. 使用Slonik框架基于Node.js和PostgreSQL处理大量数据
  59. Using slonik framework to process large amount of data based on node.js and PostgreSQL
  60. Netflix使用React制作高性能电视用户界面