brief introduction ： What is the front-end intelligent reasoning engine and how to build and apply it ？
Before the front-end intelligent reasoning engine , Let's start with what is ” End intelligence ”.
End intelligence （On-Device Machine Learning） It refers to putting the application of machine learning on the end side . there “ End side ”, It's relative to cloud services . It can be a cell phone , It can also be IOT Equipment etc. .
Traditional machine learning , Because of the size of the model 、 The problem of computing power of machines , Many of them are done on the server side . such as Amazon AWS Yes “Amazon Rekognition Service”,Google Yes “Google Cloud Vision Service”. With the improvement of computing power of end-to-side devices represented by mobile phones , And the evolution of model design itself , Smaller size 、 More powerful models are gradually able to deploy to the end to run .
-- Reference from 《 https://www.infoq.cn/article/m5m93qyadscnyil3kprv》
Compared to cloud deployment ,APP The end has more direct user characteristics , At the same time, it has the following advantages ：
These are the advantages of end-to-end intelligence , But it's not a panacea , There are still some limitations ：
Empathy , Front end intelligence refers to putting machine learning applications on the front end (web、h5、 Small program etc. ).
therefore , What is the front-end intelligent reasoning engine ？
Here's the picture ：
The front-end intelligent reasoning engine is actually the thing that uses the front-end computing power to execute the model .
Here are three common reasoning engines
For an end-to-end reasoning engine , What's the most important ？ Performance, of course ！ The better the performance , It also means that there will be more application scenarios on the end , Let's take a look at the performance comparison of these three reasoning engines ：
( The following data usage model is MobileNetV2 Classification model )
You can see , In pure JS Computing in an environment , Just once you do a classification, you have to 1500ms above . Imagine if a camera needs to make a real-time classification prediction of the objects it takes ( For example, we can predict whether the subject is a cat or a dog ), So every prediction needs 1500ms, This kind of performance is intolerable .
stay WASM In the environment , The best performance ONNX.js Reached 135ms Performance of , That is to say 7fps about , It's barely working . and tfjs But it's bad 1501ms. This is because onnx.js Take advantage of worker Multithread acceleration , So the best performance .
And finally GPU Environmental Science , You can see tfjs and ONNXjs The performance of the system has reached a relatively good level , and WebDNN Worse performance .
In addition to the above three engines , At present, there are also Baidu's paddle.js And Taobao's mnn.js etc. , No discussion here .
Of course , When choosing an appropriate inference engine , Besides performance , And Ecology 、 Engine maintenance and so on . In a comprehensive way ,tfjs It is the most suitable front-end reasoning engine in the current market . because tfjs Can rely on tensorflow The powerful ecology of 、google Full time maintenance of official team, etc . by comparison ONNX The framework is relatively small , And ONNXjs It has not been maintained for nearly a year .WebDNN Performance and ecology are not competitive .
As you can see from the last chapter , It's common to do high-performance computing on the front end WASM And based on WebGL Of GPU Calculation , Of course, there are asm.js No discussion here .
WASM We should be familiar with , Here's just a brief introduction ：
WebAssembly Is a new type of code running in modern web browser , And provide new performance features and effects . It's not designed for handwritten code, it's designed for things like C、C++ and Rust And other low-level source languages provide an efficient compilation target .
For the network platform , It has great significance —— This is for the client app It provides a way to run code written in multiple languages in a way close to local speed on the network platform ; before this , client app It's impossible .
what ？WebGL It's not for graphic rendering ？ It's not doing 3D Yes, I don't know ？ Why can we do high performance computing ？
Maybe some students have heard of gpgpu.js This library , This library is to use webgl For general purpose calculation , What is the specific principle ？( To be able to read on , Please take a quick look at this article first )：《 utilize WebGL2 Realization Web Front-end GPU Calculation 》.
Okay , At present, we know two high-performance computing methods on the front end , So if the existing framework (tfjs、onnxjs) Performance is not to meet our needs ？ How to further improve engine performance , And the production environment ？
The answer is ： By hand , Optimize performance . Yes , It's so simple and rude . With tfjs For example ( The other frameworks are consistent in principle ), Here's how to optimize engine performance with different postures .
At the beginning of last year , Our team and google Of tfjs The team had an in-depth communication ,google There is a clear indication that tfjs The following development direction is WASM Calculation oriented 、webgl Calculation doesn't do new feature Focus on maintenance . But at this stage, browsers 、 The app is right WASM Our support is not complete ( for example SIMD、Multi-Thread Other characteristics ), therefore WASM It can't be implemented in the production environment for the time being . therefore , At this stage, we still need to rely on webgl Computing power . Bad is , here tfjs Of webgl The performance on the mobile terminal is still unsatisfactory , In particular, the performance of low-end computers can not meet our business requirements . Can't , We have to go in and optimize the engine ourselves . So the following is all about webgl Calculation .
Pose a ： Compute vectorization
Computational vectorization means , utilize glsl Of vec2/vec4/matrix Data types are calculated , Because for GPU Come on , The biggest advantage is parallel computing , Parallel computing can be achieved as much as possible through vector computing .
For example, a matrix multiplication ：
c = a1 * b1 + a2 * b2 + a3 * b3 + a4 * b4;
It can be changed to
c = dot(vec4(a1, a2, a3, a4), vec4(b1,b2,b3,b4));
Vectorization should also be combined with the optimization of memory layout ;
Position 2 ： Memory layout optimization
If you read the above 《 utilize WebGL2 Realization Web Front-end GPU Calculation 》 The students of this article should understand that , stay GPU All of the data in is stored through Texture Of , and Texture Itself is a Long n * wide m * passageway (rgba)4 Things that are , If we want to save one 3 * 224 * 224 * 150 What should we do if we go into the four-dimensional matrix of the matrix ？ It's going to involve matrix coding , That is to store the high-dimensional matrix into the characteristic shape in a certain format Texture Inside , and Texture In addition, the data layout will affect the read and memory performance in the calculation process . for example , Take a simpler example ：
If it's a normal memory layout , The calculation needs to traverse the matrix once by row or case , and GPU Of cache yes tile Type of , namely n*n Cache of type , Depending on the chip n Somewhat different . So this way of traversal will cause frequent cache miss, It becomes the bottleneck of performance . therefore , We need to optimize the performance through memory layout . Like the image below ：
Pose three ： Graph optimization
Because a model is made up of operators one by one , And in the GPU Each operator is designed to be a webgl program, Every switch program It will cause a lot of performance loss . So if there's a way to reduce the number of models program Number , The performance improvement is also very considerable . Here's the picture ：
We fuse some nodes that can be fused on the graph structure (nOP -> 1OP), Based on the new computing node, the new OP. This greatly reduces OP The number of , And it reduces Program The number of , So it improves reasoning performance . Especially on low-end phones .
Position 4 ： Calculation of mixing accuracy
All of the above calculations are based on the conventional floating-point calculation , That is to say float32 Single precision floating point calculation . that , stay GPU Whether the mixed precision calculation can be realized in this paper ？ for example float16、float32、uint8 Calculation of mixed precision . The answer is yes , stay GPU The value of realizing mixed precision calculation in this paper is to improve GPU Of bandwidth. because webgl Of texture Each pixel contains rgba Four channels , And the maximum of each channel is 32 position , We can do it in 32 Store as much data as possible in the bit . If the accuracy is float16, So you can store two float16,bandwidth That's what happened before 2 times , Empathy uint8 Of bandwidth It was before 4 times . This performance improvement is huge . Let's talk about the picture above ：
There are many ways to optimize , Here is not a list .
at present , The engine based on our deep optimization has been implemented in many application scenarios of ant group and Ali economy , A typical example is the pet recognition demonstrated at the beginning of the article , And card identification 、 Broken screen camera and so on .
Before the industry has a relatively hot virtual make-up app, etc .
Friends who read this article can also open your brain holes , Dig out more and more interesting intelligent scenes .
With the upgrading of models and the in-depth optimization of engines on the market , I Believe tfjs It will shine on more interactive scenes , For example, have AI The front end game of ability 、AR、VR Scenes, etc. . Now all we have to do is calm down , Stand on the shoulders of giants and keep polishing our engines , Wish to wait for the flowers to bloom .
author ： Green wall
This article is the original content of Alibaba cloud , No reprint without permission