R & D solution e-Car front end monitoring system

Wu Baiqing 2021-02-23 00:28:28
solution e-car car end monitoring


background

Self developed tools are developed to solve internal problems , I hope these questions will resonate with you :

  1. Do you know the important business , This page can serve users normally ?
  2. Can it be before the problem erupts on a large scale , Quickly perceive business anomalies ?
  3. Why not go to the user's computer can be intuitive to see the problem , So as to overlook the overall situation of the project ; Can we drill down all the way from macro to micro to quickly locate the alarm information on the line ?
  4. Bring out reasonable evidence when communicating with other departments , To tell him that the interface is inaccessible during this period of time , And tell us that the parameters are passed correctly , Help server The check problem .
  5. Product and design students want to improve the user experience , R & D continuously iterates over functional versions . These are the optimization points we think , What's the effect ? How to measure ?
  6. Which advertising space , Which resource is more valuable ? How can we more accurately touch the pain points of users , Enabling business ?

We see these questions , Need to be Data indicators The support of . From the perspective of solving these problems , The problems that occur repeatedly or cannot be explained to other departments , Build products that can help us solve problems .

So in this scenario , Easy car · Front end monitoring came into being .
It is mainly multi scene, multi-dimensional real-time monitoring of the market , Realize the whole link monitoring of browser client , It is convenient for the team to trace and rectify after the event , Turn to early warning and quick root cause determination .

After detailed planning , We divide the front-end monitoring into four phases , Respectively : Abnormal monitoring ( The first phase of )、 Performance monitoring ( Phase two )、 Data burying point ( Three issues )、 Behavior collection ( Four issues ), On 2020 year 6 month 23 R & D was officially launched on the 12th , Now it's in the second phase .

Key structures

To achieve the above requirements , The monitoring system is mainly divided into four stages ; Namely : Index collection 、 Index storage 、 Statistics and analysis 、 Visual display .

Index collection stage : Through front-end integration SDK Collect requests 、 performance 、 Abnormal index information ; It's a simple process on the client side , And then report to the server .
Index storage stage : It is used to receive the collection information reported by the front end , The main purpose is data landing .
Statistics and analysis stage : Automatic analysis , Through the statistics of data , Let the program find the problem and trigger the alarm . Artificial analysis , It's through the visual data panel , Let users see the specific log data , So as to find the root of the abnormal problem .
Visualization stage : Through a visual platform ; In these indicators (API monitor 、 Abnormal monitoring 、 Resource monitoring 、 Performance monitoring ) in , Tracking user behavior to locate problems .

Overall architecture

With the increasing demand for statistics and the launch of front-end applications , The amount of data from the early days of the day 100 More than 10000 pieces of data ; Up to now, every day is about 7000 Ten thousand data . There are three iterations in architecture . This is the latest version of the architecture , The main process is 6 Layer handles .

Acquisition layer :PC and H5 Used a set of SDK Monitoring event collection index , Then the monitored indicators are passed through REST I'm going to Logback Push data .Logback In the form of a long connection , These different types of indicator data will be pushed to Flume In the cluster .Flume The cluster will take this data , Distribute to Kafka Topic For storage .
Processing layer : from Flink To consume in real time ;Flink There are three types of consumption , Namely : Offline data landing 、 real time ETL+ Manual 、 Detail log .
Storage layer : Offline data is stored in HDFS in ; real time ETL+ The map data will be stored in MySQL in ; Details will fall into ES in .
Statistical layer : offline (DW、DM)、 real time ( Minutes of class -> Ten minutes -> Hour scale ) The way , Summarize and count the indicators .
application layer : Finally, the interface is used to summarize tables and details ES Look up the data .
Presentation layer : And then the front-end outputs the chart 、 report form 、 Detailed 、 Links and other information .

Technical solution

Data collection

The initial vision of acquisition was to hope for the business No invasive , Business systems don't need to be revamped , Just embed a piece of code . So these collections , All are SDK Automated processing .

SDK It will monitor several events in the whole world , Respectively : Error monitoring 、 Resource exception monitoring 、 Monitoring of page performance 、API Listening for calls .

Through these monitoring , The final summary is 3 Collection of indicators .
Abnormal collection : call error/unhandledrejection event , Used to capture JS、 picture 、CSS And so on .**
Performance acquisition : Call browser native performance.timing API Capture page performance metrics .
Interface collection : adopt Object.definePropety Acting as a global XHR For capturing browsers XHR/FETCH Request .

Acquisition terminal SDK framework

SDK It is mainly divided into two parts :
The first part :SDK Mainly SDK The driver , contain : entrance 、 Core tools and inference of common types .
The second part : It's also called the plug-in part ( The blue area ), It mainly realizes the collection of the above three data indexes .

Next, we will introduce the second part in detail , The collection scheme of each index .

Exception collection scheme

By monitoring error error , You can capture all (JS error 、 Image loading 、CSS load 、JS load 、Promise etc. ) abnormal ; It also supports InternalError、ReferenceError etc. 7 It's a kind of error capture .

Here's the key code .

Monitoring events

/**
* monitor error、unhandledrejection Method to handle exception information
*
* @param {YicheMonitorInstance} instance SDK example
*/
export default function setupErrorPlugin(instance: YicheMonitorInstance) {
// JS Error or static resource loading error
on('error', (e: Event, url: any, lineno: any) => {
handleError(instance, e, url, lineno);
});
// Promise error ,IE I won't support it
on('unhandledrejection', (e: any) => {
handleError(instance, e);
});
}

Determine the type of exception

/**
* W3C Mode support ErrorEvent, All exceptions from ErrorEvent Here take
*
* @param {MutationEvent} error Resource error 、 Code error
*/
function handleW3C(event: any) {
switch (event.type) {
// Judging script error , Or resource error
case 'error':
event instanceof ErrorEvent
? reportJSError(instance, event)
: reportResourceError(instance, event);
break;
// Promise Is there any uncapped data reject Error of
case 'unhandledrejection':
reportPromiseError(instance, event);
break;
}
}

Capture abnormal data

/**
* Report JS abnormal
*
* @param {YicheMonitorInstance} instance SDK example
* @param {ErrorEvent} event
*/
export default function reportJSError(
instance: YicheMonitorInstance,
event: ErrorEvent,
): void {
// Set up reporting data
const report = new ReportDataStruct('error', 'js');
const errorInfo = event.error
? event.error.message
: ` Unknown error :${event.message}`;
// Set error message , Compatible Remote scripts do not set Script error The resulting anomaly
report.setData({
det: errorInfo.substring(0, 2000),
des: event.error ? event.error.stack : '',
defn: event.filename,
deln: event.lineno,
delc: event.colno,
rre: 1,
});
}

Handle IE compatibility

When catching an exception, handle it IE It's just a matter of compatibility ,IE The solution is as follows :

/**
* IE 8 Error item for , So for IE 8 browser , We just need to get it wrong .
*
* 1. Error message
* 2. Error page
* 3. Wrong line number ( Because files are usually compressed , So statistics IE8 It doesn't make any sense )
*
* @param {string} error Error message
* @param {string | undefined} url Anomalous URL
* @param {number | undefined} lineno Number of exception lines ,IE There are no columns
*/
export function handleIE8Error(
error: string,
url?: string | undefined,
lineno?: number | undefined,
) {
return {
colno: 0,
lineno: lineno,
filename: url,
message: error,
error: {
message: error,
stack: `IE8 Error:${error}`,
},
} as ErrorEvent;
}
/**
* IE 9 Error of , Need to be in target It's got
*
* @param { Element | any } error IE9 Abnormal elements
*/
export function handleIE9Error(error: any) {
// obtain Event
const event = error.currentTarget.event;
return {
colno: event.errorCharacter,
lineno: event.errorLine,
filename: event.errorUrl,
message: event.errorMessage,
error: {
message: event.errorMessage,
stack: `IE9 Error:${event.errorMessage}`,
},
} as ErrorEvent;
}

Performance acquisition scheme

Browser page loading process

How to get performance index

We use browser native Navigation Timing API We can get the above information Page loading process The performance index data in , For performance analysis , It's time in nanoseconds .

Of course, with the help of PerformanceObserver API And so on are used to measure FCPLCPFIDTTITBTCLS And other key indicators .

Detailed calculation formula

indicators meaning Calculation formula
ttfb First byte time timing.responseStart - timing.requestStart
domReady Dom Ready Time timing.domContentLoadedEventEnd - timing.fetchStart
pageLoad Page full load time timing.loadEventStart - timing.fetchStart
dns DNS Query time timing.domainLookupEnd - timing.domainLookupStart
tcp TCP Connection time timing.connectEnd - timing.connectStart
ssl SSL Connection time timing.secureConnectionStart > 0 ? timing.connectEnd - timing.secureConnectionStart) : 0
contentDownload Content delivery time timing.responseEnd - timing.responseStart
domParse DOM Parsing time timing.domInteractive - timing.responseEnd
resourceDownload Resource load time timing.loadEventStart - timing.domContentLoadedEventEnd
waiting Request and response timing.responseStart - timing.requestStart
fpt White screen time , The old timing.responseEnd - timing.fetchStart
tti It's interactive for the first time timing.domInteractive - timing.fetchStart
firstByte First package time timing.responseStart - timing.domainLookupStart
domComplete DOM Completion time timing.domComplete - timing.domLoading
fp White screen time , New index performance.getEntriesByType('paint')[0]
fcp First effective content rendering performance.getEntriesByType('paint')[1]
lcp First screen big content drawing time PerformanceObserver('largest-contentful-paint')"
Faster than Page full load time ≤ For a certain period of time ( Such as 2s) Of sampling PV / Total sampling PV * 100%
Slow drive ratio Page full load time ≥ For a certain period of time ( Such as 5s) Of sampling PV / Total sampling PV * 100%

Network request collection scheme

Network request , adopt Object.definePropety In the right way XHR Acting as an agent . The key code is as follows .

rewrite XMLHttpRequest

This part can be referred to directly ajax-hook Implementation principle of .

export function hook(proxy) {
window[realXhr] = window[realXhr] || XMLHttpRequest
XMLHttpRequest = function () {
const xhr = new window[realXhr];
for (let attr in xhr) {
let type = "";
try {
type = typeof xhr[attr]
} catch (e) {
}
if (type === "function") {
this[attr] = hookFunction(attr);
} else {
Object.defineProperty(this, attr, {
get: getterFactory(attr),
set: setterFactory(attr),
enumerable: true
})
}
}
const that = this;
xhr.getProxy = function () {
return that
}
this.xhr = xhr;
}
return window[realXhr];
}

Intercept all requests

Normally, a page will request multiple interfaces , If there is 20 A request ;
We expect that after all the requests in the phase have ended , Summary into a record, merge and report , This can effectively reduce the concurrency of requests .

The key code is as follows :

/**
* Ajax Request plug-ins
*
* @author wubaiqing <wubaiqing@vip.qq.com>
*/
// All data requests , And the total amount
let allRequestRecordArray: any = [];
let allRequestRecordCount: any = [];
// Success data ,200,304 The data of
let allRequestData: any = [];
// Abnormal data , Overtime ,405 And so on
let errorData: any = [];
/**
* monitor Ajax Request information
*
* @param {YicheMonitorInstance} instance SDK example
*/
export default function setupAjaxPlugin(instance: YicheMonitorInstance) {
let id = 0;
proxy({
onRequest: (config, handler) => {
// Filter out the listening clouds 、 Sherlock Holmes 、APM
if (filterDomain(config)) {
// Queue to add request records
allRequestRecordArray.push({
id,
timeStamp: new Date().getTime(), // Record request duration
config, // contain : Request address 、body The content such as
handler, // XHR Entity
});
// Record the total number of requests
allRequestRecordCount.push(1);
id++;
}
handler.next(config);
},
// It will trigger once when it fails
onError: (err, handler) => {
if (allRequestRecordArray.length === 0) {
handler.next(err);
return;
}
for (let i = 0; i < allRequestRecordArray.length; i++) {
// Current data
const currentData = allRequestRecordArray[i];
if (
currentData.handler.xhr.status === 0 && // Not sent
currentData.handler.xhr.readyState === 4
) {
errorData.push(
JSON.stringify(handleReportDataStruct(instance, currentData)),
);
allRequestRecordArray.splice(i, 1);
}
}
sendAllRequestData(instance);
handler.next(err);
},
onResponse: (response, handler) => {
// Return without request Null
if (allRequestRecordArray.length === 0) {
handler.next(response);
return;
}
for (let i = 0; i < allRequestRecordArray.length; i++) {
// Current data
const currentData = allRequestRecordArray[i];
// As long as the request load is complete , Whether it's success or failure , It's all a request
if (currentData.handler.xhr.readyState === 4) {
// A normal request
if (
(currentData.handler.xhr.status >= 200 &&
currentData.handler.xhr.status < 300) ||
currentData.handler.xhr.status === 304
) {
allRequestData.push(
JSON.stringify(handleReportDataStruct(instance, currentData)),
);
} else {
if (currentData.handler.xhr.status > 0) {
// With status code
// Wrong request
errorData.push(
JSON.stringify(handleReportDataStruct(instance, currentData)),
);
}
}
// Delete the value of the current array
allRequestRecordArray.splice(i, 1);
}
}
// send data
sendAllRequestData(instance);
handler.next(response);
},
});
}
function sendAllRequestData(instance) {
if (
allRequestData.length + errorData.length ===
allRequestRecordCount.length
) {
// Processing normal requests
if (allRequestData.length > 0 || errorData.length > 0) {
handleAllRequestData(instance);
}
// Handle exception requests
if (errorData.length > 0) {
handleErrorData(instance);
}
// All data requests , And the total amount
allRequestRecordArray = [];
allRequestRecordCount = [];
// Success data ,200,304 The data of
allRequestData = [];
// Abnormal data , Overtime ,405 And so on
errorData = [];
}
}

Probe loading scheme

There are two ways to load the probe , They have some advantages and disadvantages respectively :
Synchronous loading : collection SDK Put it in all JS In front of the request head ; Because of the loading order , If you put it in other JS After the request , Previous JS Something is wrong , You can't capture it . Because to load ahead of time JS resources , Will have a certain impact on performance .
Load asynchronously : collection SDK Through execution JS After injection into the page ; If we can guarantee the first time JS No abnormality , It can also be loaded asynchronously SDK, It's good for first screen optimization .

Now we're using the first one Synchronous loading The way .

A screenshot of the product

home page

The home page will show all the application information , Abnormal data of each application can be found intuitively on the home page .

Market page

If you want to check an application item , Will enter the application's big page ;

It will mainly show the application , The importance of the front end , Data status in the last hour .
At present, there are mainly page performance 、 Resource exception 、JS abnormal 、API Interface success rate and other important indicators as a measure .

Details page

Details page , You can see the data details of the application . It's convenient for the team to trace after the event 、 Rectification , Early warning and rapid root cause determination .

Problems encountered

SDK After the indicators are collected, the data will be reported , Will do some filtering In front of operation , Such as :

  • Block out some blacklists .
  • Peak cutting and valley filling of indicators .
  • Transformation of application information .
  • client IP obtain .
  • Token Validation of the .

There is a drawback to preprocessing , Because the server will go through the parsing and conversion process ; When the amount of data reaches daily 7000 All around , The reporting server can't handle it .
So we put the data In front of Handle , Data landing After Handle ; Post processing is in the process of data cleaning , After filtering out blacklists and abnormal indicators . This reduces the pressure on the reporting server .
And the warehouse will keep all the original data , If something goes wrong , It's also convenient for us to trace the source , Recover data .

The overall planning

We have four phases , It is still in the second phase of performance monitoring .

plan The goal is priority Support platform The main problems to be solved are
The first phase of Abnormal monitoring high PC、Mobile、 Applet The impact of exception impact on users , Resource loading exception awareness , Network request exception awareness , Code error exception awareness , The details of the code error (SourceMap) analysis
Phase two Performance monitoring high Performance values ( First byte 、DOMReady、 The page is fully loaded 、 Redirect 、DNS、TCP、 Request response, etc ),API monitor ( The success rate 、 It takes time to succeed 、 Failure times, etc ), Page reference resource statistics , And the share of resources (JS、CSS、 picture 、 typeface 、iFrame、Ajax etc. ), Number comparison ,95% Users of 、99% Users of 、 Average user
Three issues Data burying point in operating system 、 The resolution of the 、 browser , Event classification ( Click event 、 Scroll Events ), Specific specified event types ( Click on Banner chart ), Event time , The location of the trigger event ( mouse X、Y, Thermal maps can be generated ), Visitor signs , User ID , Link acquisition
Four issues Behavior collection low Enter the page , Leave the page , Click on the element , Scrolling pages , Operation link , Customize ( Such as , Click on the picture of the ad space ),Chrome Plug in intuitive view of the buried point

Other

Since the research APM The system is convenient to get through and integrate with the inside ; For example, after the application is released, it can be pushed directly SourceMap file ; And it can automatically analyze the page performance after online publishing .
If there is no need to build such a system at the present stage of development , But the business needs the ability to , You can also consider some third-party products .

Commercial product analysis

Easy car Listen to the clouds Alibaba cloud ARMS Fundebug Yueying FrontJS
Page performance monitoring The function is all ready Basic function The function is all ready weak The function is all ready The function is all ready
Abnormal monitoring Basic function Basic function The function is all ready The function is all ready The function is all ready The function is all ready
API monitor The function is all ready Basic function The function is all ready Basic function Basic function Basic function
Page loading waterfall nothing The function is all ready Basic function nothing nothing The function is all ready
Interactivity good commonly good blurring good good

The importance index is right for Ali ARMS contrast

Easy car · Front end monitoring and Alibaba cloud ARMS Made a comparison of some important indicators , mean value It's floating up and down 5%-8% about ;


版权声明
本文为[Wu Baiqing]所创,转载请带上原文链接,感谢
https://qdmana.com/2021/02/20210222163236924O.html

  1. 【微前端】微前端最终章-qiankun指南以及微前端整体探索
  2. Vue-Cli 创建vue3项目
  3. Go in the front of the progress of u boot v7.0 U disk boot disk production tools
  4. 使用NTLM的windows身份验证的nginx反向代理
  5. Rust教程:针对JavaScript开发人员的Rust简介
  6. 使用 Serverless Framework 部署个人博客到腾讯云
  7. #研發解決方案#易車前端監控系統
  8. Vue changes localhost to IP address and cannot access
  9. JavaScript进阶学习
  10. HTML5 from entry to proficient, realize annual salary 10W +, zero basic students must see
  11. Vue:vuex状态数据持久化插件vuex-persistedstate
  12. Vue source code analysis - start
  13. Vue -- the child component calls the method of the parent component and passes parameters --- props
  14. React-Native 获取设备当前网络状态 NetInfo
  15. 高性能 Nginx HTTPS 调优 - 如何为 HTTPS 提速 30%
  16. How to learn HTML5? How can Xiaobai start HTML5 quickly?
  17. HTML + CSS detailed tutorial, this article is enough, but also quickly save
  18. JavaScript高级:JavaScript面向对象,JavaScript内置对象,JavaScript BOM,JavaScript封装
  19. Why Vue uses asynchronous rendering
  20. JavaScript高级:JavaScript面向对象,JavaScript内置对象,JavaScript BOM,JavaScript封装
  21. vue判断elementui中el-form是否更新变化,变化就提示是否保存,没变就直接离开
  22. 算法题:两数之和——JavaScript及Java实现
  23. 高性能 Nginx HTTPS 调优
  24. Why Vue uses asynchronous rendering
  25. day 31 jQuery进阶
  26. day 30 jQuery
  27. CSS whimsy -- using background to create all kinds of wonderful backgrounds
  28. Why are more and more people learning front end?
  29. What do you do with 4K front-end development?
  30. 8 years of front-end development knowledge precipitation (do not know how many words, keep writing it...)
  31. What is the annual salary of a good web front end?
  32. Front end novice tutorial! How to get started with web front end
  33. Will the front end have a future?
  34. Is the front end hard to learn?
  35. Seven new Vue combat skills to improve efficiency in 2021!
  36. Is front end learning difficult?
  37. How about the process of Web front-end development and self-study?
  38. Front end learning route from zero basis to proficient
  39. What is the basis of learning front end?
  40. What knowledge points need to be learned for self-study front end? How long can I become a front-end Engineer?
  41. An inexperienced front-end engineer, what are the common problems when writing CSS?
  42. HttpServletRequest get URL (parameter, path, port number, protocol, etc.) details
  43. Springboot starts http2
  44. Enabling http2.0 in spring boot
  45. JQuery:JQuery基本语法,JQuery选择器,JQuery DOM,综合案例 复选框,综合案例 随机图片
  46. Using JavaScript in Safari browser history.back () the page will not refresh after returning to the previous page
  47. vue.js Error in win10 NPM install
  48. In less than two months, musk made more than $1 billion, more than Tesla's annual profit
  49. Springboot starts http2
  50. Vue event bus
  51. JQuery easy UI tutorial: custom data grid Pagination
  52. Using okhttp and okhttpgo to obtain onenet cloud platform data
  53. Vue3 component (IX) Vue + element plus + JSON = dynamic rendering form control
  54. HTTP 1. X learning notes: an authoritative guide to Web Performance
  55. Vue3 component (IX) Vue + element plus + JSON = dynamic rendering form control
  56. HTTP 1. X learning notes: an authoritative guide to Web Performance
  57. JQuery:JQuery基本语法,JQuery选择器,JQuery DOM,综合案例 复选框,综合案例 随机图片
  58. Event bubble and capture in JavaScript
  59. The root element is missing solution
  60. Event bubble and capture in JavaScript