Front end crawler framework - Introduction to puppeter (1)

Treat you as before 2020-11-12 22:02:41
end crawler framework introduction puppeter



The reason why I started to learn this technology was that I wanted to do a movie resource website similar to Renren film and television before , Therefore, I want to get the relevant movie resources through learning crawler to download them .

Most of the previous understanding of reptiles was heard to use python To achieve , And because I'm busy at work , I don't have much time to learn a new language , So I went online search to see if there was a front-end crawler framework .

Most of the online recommendations are node library --puppeteer

What is? Puppeteer?

Puppeteer It's a node library , He provided a set of controls Chrome Of API, Generally speaking, it's a headless chrome browser ( Of course, you can also configure it to have UI Of , There is no default ). Since it's a browser , So what we can do on the browser by hand Puppeteer They are all competent for , Like the user's mouse , Keyboard operation, etc .

Puppeteer What can be done ?

1. Can generate web screenshots and generate PDF
2. Reptiles ( frequently-used ) You can crawl to pages that load content asynchronously ( Basically, you can climb up to )
3. Simulate user operation ( Such as : Mouse button operation , Submit Form , open / close / Log on to the web )
4. Realization UI automated testing , To help analyze the performance of the site 

Operating environment and installation

Because in puppeteer, Most of them are asynchronous operations , So when you look at all kinds of documents, you can see async and await In this way ES7 The grammar of .
The official requirement at the moment is :

 stay puppeteerv1.18.1 Previous needs NODE The version is at least v6.4.0.
from v1.18.1 To v2.1.0 Of NODE Version at least no less than v8.9.0.
from v3.0.0 Start ,NODE At least not less than v10.18.1
And if you want to use async/await,NODE Version at least no less than v7.6.0

You use the latest chrome driver, This is when you go through npm install puppeteer It will automatically check your local driver edition , And then automatically download the latest chrome driver

adopt npm/cnpm/yarn install puppeteer
npm install puppeteer --save
cnpm install puppeteer --save
yarn add puppeteer ( Use yarn Installation may not be able to install the problem )

Easy to use ( Screenshot operation )

When you're done puppeteer After installation , We can write a simple example . Open our way to learning

// 1. First introduce puppeteer
const puppeteer = require("puppeteer");
// 2. start-up puppeteer, Start the browser engine
ignoreHTTPSErrors: true,
headless: false,
slowMo: 250,
defaultViewport: {
width: 1920,
height: 1080,
timeout: 0,
.then(async (browser) => {
// 3. Create a new browser page
let newPage = await browser.newPage();
// 4. Set the jump for this page URL
await newPage.goto("");
// 5. Take a screenshot of this page
await newPage.screenshot({
type: "jpeg",
path: "../index.jpg",
fullPage: true,
// 6. Close the browser
await browser.close();
demo result

At the top level of the project, we have cut the pictures we need

Code parsing ( According to the source code above )

1. puppeteer.launch(options)

 This method is used to start chrome browser , It returns a Promise, Use then Method to get browser example , You can operate the browser
Parameters options( Here are some common parameters ):
(1) ignoreHTTPSErrors <Boolean>: Whether to ignore during navigation HTTPS error , The default is false;
(2) headless <Boolean>: Whether to run the browser in headless mode , The default is true. The headless mode here is generally speaking whether there is a browser interface ( With UI Form display operation )
(3) slowMo <Number>: take puppeteer Operation to reduce the specified number of milliseconds , So you can see what each operation does , This is very useful
(4) defaultViewport <Object>:
width: The width of the page display , The default is 800
height: The height of the page display , The default is 600
(5) timeout: wait for Chrome Maximum time for instance to start . The default is 30000(30 second ). If you pass in 0 No time limit 

2. browser.newPage()

 This method returns a promise, To return a new Page Object to create a new page in the browser

3. newPage.goto(url,options)

 This method sets the new page in the address bar URL value , And jump to the corresponding address .
Parameters options:
(1) url <String>: Navigate to the appropriate address , The address should have http Or is it https The agreement , for example :https://
(2) options:
timeout <Number>: Waiting time for jump , In milliseconds , The default is 30 second , Set up 0 To wait indefinitely until passed

4. newPage.screenshot(options)

 This method returns Promise,resolve And then there's a screenshot buffer, It is used to take a screenshot operation on the open page
Parameters options:
(1) path <String>: The path to save the screenshot , The type of screenshot image will be inferred from the file extension name . If it's a relative path , Then we analyze it from the relative path ( The relative path is recommended here ). If no path is specified , Pictures will not be saved to the hard disk
(2) type <String>: The type of screenshot specified ,jpeg | png, The default is png
(3) quality <Number>: Picture quality , Optional 0-100,png Not available in format
(4) fullPage <Boolean>: If set to true, Then intercept the complete page ( Including the parts that need to be scrolled ), The default is false
(5) clip <Object>:
x <Number>: The crop region is relative to the upper left corner (0, 0) Of x coordinate
y <Number>: The crop region is relative to the upper left corner (0, 0) Of y coordinate
width <Number>: Cut width
height <Number>: Cutting height
(6) omitBackground <Boolean>: White background is hidden by default , The background is transparent ( Yes png The format is very useful )
(7) encoding: Image coding can make base64 or binary, The default is “ Binary system ”, The conversion of image encoding format plays a great role in uploading and downloading pictures 

5. borwser.close()

 close Chromium And all of its pages ( If the page is opened ).Browser The object itself is considered processed and cannot be used again . Unless you build a new one yourself browser
本文为[Treat you as before]所创,转载请带上原文链接,感谢

  1. [front end -- JavaScript] knowledge point (IV) -- memory leakage in the project (I)
  2. This mechanism in JS
  3. Vue 3.0 source code learning 1 --- rendering process of components
  4. Learning the realization of canvas and simple drawing
  5. gin里获取http请求过来的参数
  6. vue3的新特性
  7. Get the parameters from HTTP request in gin
  8. New features of vue3
  9. vue-cli 引入腾讯地图(最新 api,rocketmq原理面试
  10. Vue 学习笔记(3,免费Java高级工程师学习资源
  11. Vue 学习笔记(2,Java编程视频教程
  12. Vue cli introduces Tencent maps (the latest API, rocketmq)
  13. Vue learning notes (3, free Java senior engineer learning resources)
  14. Vue learning notes (2, Java programming video tutorial)
  15. 【Vue】—props属性
  16. 【Vue】—创建组件
  17. [Vue] - props attribute
  18. [Vue] - create component
  19. 浅谈vue响应式原理及发布订阅模式和观察者模式
  20. On Vue responsive principle, publish subscribe mode and observer mode
  21. 浅谈vue响应式原理及发布订阅模式和观察者模式
  22. On Vue responsive principle, publish subscribe mode and observer mode
  23. Xiaobai can understand it. It only takes 4 steps to solve the problem of Vue keep alive cache component
  24. Publish, subscribe and observer of design patterns
  25. Summary of common content added in ES6 + (II)
  26. No.8 Vue element admin learning (III) vuex learning and login method analysis
  27. Write a mini webpack project construction tool
  28. Shopping cart (front-end static page preparation)
  29. Introduction to the fluent platform
  30. Webpack5 cache
  31. The difference between drop-down box select option and datalist
  32. CSS review (III)
  33. Node.js学习笔记【七】
  34. Node.js learning notes [VII]
  35. Vue Router根据后台数据加载不同的组件(思考-&gt;实现-&gt;不止于实现)
  36. Vue router loads different components according to background data (thinking - & gt; Implementation - & gt; (more than implementation)
  37. 【JQuery框架,Java编程教程视频下载
  38. [jQuery framework, Java programming tutorial video download
  39. Vue Router根据后台数据加载不同的组件(思考-&gt;实现-&gt;不止于实现)
  40. Vue router loads different components according to background data (thinking - & gt; Implementation - & gt; (more than implementation)
  41. 【Vue,阿里P8大佬亲自教你
  42. 【Vue基础知识总结 5,字节跳动算法工程师面试经验
  43. [Vue, Ali P8 teaches you personally
  44. [Vue basic knowledge summary 5. Interview experience of byte beating Algorithm Engineer
  45. 【问题记录】- 谷歌浏览器 Html生成PDF
  46. [problem record] - PDF generated by Google browser HTML
  47. 【问题记录】- 谷歌浏览器 Html生成PDF
  48. [problem record] - PDF generated by Google browser HTML
  49. 【JavaScript】查漏补缺 —数组中reduce()方法
  50. [JavaScript] leak checking and defect filling - reduce() method in array
  51. 【重识 HTML (3),350道Java面试真题分享
  52. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  53. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  54. [re recognize HTML (3) and share 350 real Java interview questions
  55. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  56. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  57. 【重识 HTML ,nginx面试题阿里
  58. 【重识 HTML (4),ELK原来这么简单
  59. [re recognize HTML, nginx interview questions]
  60. [re recognize HTML (4). Elk is so simple