Common Java technology: httpclient to realize network request + jsoup parsing web page (case actual combat)

zyqok 2020-11-13 10:12:58
common java technology httpclient realize


【 Preface 】

Have you ever envied some phython The great God has the following divine operations :

They just gently execute a bunch of code , You can grab a lot of data you want in a loop .

You don't have to envy them too much , Because not only phython It can be realized , We use it Java It's also easy to do .

Without further ado , Now let's go straight to battle :

【1】 Create project

(1.1) We use it IDEA(Eclipse Empathy ) Create a new maven engineering , I'm here to name the project zyqok, Feel free to .

(1.2) stay pom.xml Inside plus  <dependencies>

(1.3) establish Test class , Well, the project has been set up .

【2】Httpclient Realize network request

(2.1) What is? httpclient ?

Httpclient yes  Apache A subproject of , It's a Java A client toolkit that can implement network requests .

To put it simply , He is a Jar package , Had he , We go through Java The program can implement the network request .

  (2.2) Copy the following httpclient rely on , Add to pom.xml In file .

<!-- httpclient Core packages -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>

 (2.3) Create a HttpTool Class , This class is specifically used to implement network request related methods .

(2.4) In order to avoid infringement of other websites , Take a page of my personal website as an example (http://www.zyqok.cn/material/index), Let's grab all the images on this page .

(2.5) You can see that , This is a get request , And the return is a Html page . So we are HttpTool Class to add a method body as follows :

 /**
* Realization Get request
* @param url Request address
* @return The page content
*/
public static String doGet(String url) {
return null;
}

(2.6) Copy code , add to get Implementation method :

 // structure get request
HttpGet get = new HttpGet(url);
// Create client
CloseableHttpClient client = HttpClients.createDefault();
try {
// Client execution request , Get a response
HttpResponse response = client.execute(get);
// Get the page content of the response
InputStream in = response.getEntity().getContent();
StringBuilder sb = new StringBuilder();
byte[]b = new byte[102400];
int length;
while ((length = in.read(b)) != -1) {
sb.append(new String(b, 0, length, "utf-8"));
}
// Back to page content
return sb.toString();
} catch (Exception e) {
e.printStackTrace();
return null;
}

(2.7)OK, We have written the implementation class of network request , Let's test , We are Test Class main Add the following code to the method :

 String html = HttpTool.doGet("http://www.zyqok.cn/material/index");
System.out.println(html);

(2.8) Execution procedure , View results . You can see that we have actually passed the request , Got the return content of the web page .

【3】Jsoup Parse web pages

Throughout 【2】 During the implementation of , We've got the data from the web page , But what we want is a picture of the whole page , It's not this kind of messy web page data , So what should we do ? Simple , Next we need to use another technology ---- Jsoup.

(3.1) What is? Jsoup technology ?

Here is an official explanation given by Du Niang :Jsoup Is a Java Of HTML Parser , Can directly parse a URL Address 、HTML Text content . It provides a very labor-saving API, It can be done by DOM,CSS And similar to jQuery To extract and manipulate data ( From Baidu ).

The following is a simple summary in my personal language :Jsoup Technology is used to deal with all kinds of html page and xml data . We can go through Jsoup To deal with it 【2】 In return  html page .

(3.2) Join in Jsoup rely on

  We are pom.xml Add the following dependencies :

<!-- Jsoup Core packages -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>

(3.3) Of course , Use Jsoup Before , We need to respond to HTML Page analysis , The main function of analysis is : How to locate and filter out the data we need ?

We put 【2】 Copy the page response from to txt In the text , And then you can see that : Every picture it contains in a  div in , And the div There is one named material-div Of class.

(3.4) According to the above analysis : First of all, we need to get all the... That contain the pictures div, So we modify main The code in the method is as follows :

 String html = HttpTool.doGet("http://www.zyqok.cn/material/index");
// take html The page resolves to Document object
Document doc = Jsoup.parse(html);
// Get all that contains class = material-div Of div Elements
Elements elements = doc.select("div.material-div");
for(Element div: elements){
System.out.println(div.toString());
}

Be careful :doc.select() The parameters in brackets are filter conditions , It's basically the same as Jquery The filter conditions of , So will Jquery Classmate , How to select the conditions is basically handy , Of course, don't be afraid to write the screening criteria , Here is a copy of Jsoup Use guide , You may as well take ( Portal :Jsoup Official use guide ).

(3.5) We execute the code , Continue to copy the output to the text .

You can see , This time, there are only pictures related to div Element , But it's not the end result we want , We end up with all the pictures .

So we need to continue to analyze : How to get links and names of all images .

(3.6) Because of the location of each picture div The structure of the elements is the same , So we can take a random one div Analysis of elements , So we can take the first one div To analyze , The structure is as follows :

<div align="center" style="padding: 10px;" class="material-div">
<div style="width: 80px; height: 80px; margin-bottom: 3px; display: flex; align-items: center; justify-content: center">
<img class="fangda image" src="https://zyqok.oss-cn-chengdu.aliyuncs.com/20200414220946131_ Big tree sunset .jpg">
<input type="hidden" class="materialId" value="121">
</div>
<font style="font-size: 5px"> Big tree sunset .jpg</font><br>
<font style="font-size: 5px">2020-04-14 22:09:46</font>
</div>

(3.7) We can see , Within the whole structure , Just one img The element tag , So we can take the first 1 individual img Labeled src Attribute is image link ; Empathy , Let's take number one 1 individual font The text content of the element is the image name .

(3.8) So we can modify the code in the loop as follows :

// For the first 1 individual img Elements
Element img = div.selectFirst("img");
// For the first 1 individual font Elements
Element font = div.selectFirst("font");
// obtain img Elements src attribute , It's a picture link
String url = img.attr("src");
// obtain name Element text , This is the name of the picture
String name = font.text();
System.out.println(name + ": " + url);

(3.9) We execute the code above , The following results can be obtained .

You can see , All the image addresses and names on this page have been successfully captured by us .

【4】 Get pictures to local

In the 【3】 In step , All we get is a link to all the pictures , We didn't download all the pictures to our local , So next , We have to download this image to our local area before it is finished .

(4.1) Now that you want to download it locally , Let's find a place here first , For storing these pictures .

such as : I'll download all the pictures to  D:\imgs(D Discoid imgs Folder ) in .

(4.2) We are HttpTool Class to add a method to save the image to the local , The code is as follows :

 /**
* Save pictures to local
* @param src Picture address
* @param name Image name
*/
public static void saveImg(String src, String name) {
// structure get request
HttpGet get = new HttpGet(src);
// Create client
CloseableHttpClient client = HttpClients.createDefault();
try {
// Client execution request , Get a response
HttpResponse response = client.execute(get);
// Get the page content of the response
InputStream in = response.getEntity().getContent();
int length;
byte[] bytes = new byte[1024];
FileOutputStream fos = new FileOutputStream("D:\\imgs\\" + name);
while ((length = in.read(bytes)) != -1) {
fos.write(bytes, 0, length);
fos.flush();
}
in.close();
fos.close();
} catch (Exception e) {
e.printStackTrace();
}
}

(4.3) modify Test  class main The final code of the method is as follows :

 public static void main(String args[]) throws Exception {
String html = HttpTool.doGet("http://www.zyqok.cn/material/index");
// take html The page resolves to Document object
Document doc = Jsoup.parse(html);
// Get all that contains class = material-div Of div Elements
Elements elements = doc.select("div.material-div");
for (int i = 0; i<elements.size(); i++) {
Element div = elements.get(i);
// For the first 1 individual img Elements
Element img = div.selectFirst("img");
// For the first 1 individual font Elements
Element font = div.selectFirst("font");
// obtain img Elements src attribute , It's a picture link
String src = img.attr("src");
// obtain name Element text , This is the name of the picture
String name = font.text();
if (!name.contains(".")) {
name += ".jpg";
}
HttpTool.saveImg(src, i + name);
System.out.println(" Grab the first " + i + " Picture success ! Image name : " + name);
}
System.out.println(" All the pictures are captured !!");
}

  

(4.4) Execute code , Print as shown below , See the result , Is it a bit like the beginning of the article .

Last , We just need to look under the local folder , Whether all the pictures have been saved to local successfully ? If there are pictures , Then we succeed .

(4.5) We turn on D disc imgs Folder , You can see that all the pictures on the website have been saved locally .

【5】 complimentary close

Through us [ Batch capture network pictures ] This actual combat case , We can feel : adopt Httopclient and Jsoup These two technologies , Not only can you grab data in batches , In fact, it can realize many functions .

such as : Website login , Data transfer between distributed servers , Three party platforms API docking , Screening and saving of valid data , Data processing and so on .

Last , Attached is a letter from an author HttpTool Tool class :Java Tool class :HttpTool( Realization http Request , Get a response )

版权声明
本文为[zyqok]所创,转载请带上原文链接,感谢

  1. [front end -- JavaScript] knowledge point (IV) -- memory leakage in the project (I)
  2. This mechanism in JS
  3. Vue 3.0 source code learning 1 --- rendering process of components
  4. Learning the realization of canvas and simple drawing
  5. gin里获取http请求过来的参数
  6. vue3的新特性
  7. Get the parameters from HTTP request in gin
  8. New features of vue3
  9. vue-cli 引入腾讯地图(最新 api,rocketmq原理面试
  10. Vue 学习笔记(3,免费Java高级工程师学习资源
  11. Vue 学习笔记(2,Java编程视频教程
  12. Vue cli introduces Tencent maps (the latest API, rocketmq)
  13. Vue learning notes (3, free Java senior engineer learning resources)
  14. Vue learning notes (2, Java programming video tutorial)
  15. 【Vue】—props属性
  16. 【Vue】—创建组件
  17. [Vue] - props attribute
  18. [Vue] - create component
  19. 浅谈vue响应式原理及发布订阅模式和观察者模式
  20. On Vue responsive principle, publish subscribe mode and observer mode
  21. 浅谈vue响应式原理及发布订阅模式和观察者模式
  22. On Vue responsive principle, publish subscribe mode and observer mode
  23. Xiaobai can understand it. It only takes 4 steps to solve the problem of Vue keep alive cache component
  24. Publish, subscribe and observer of design patterns
  25. Summary of common content added in ES6 + (II)
  26. No.8 Vue element admin learning (III) vuex learning and login method analysis
  27. Write a mini webpack project construction tool
  28. Shopping cart (front-end static page preparation)
  29. Introduction to the fluent platform
  30. Webpack5 cache
  31. The difference between drop-down box select option and datalist
  32. CSS review (III)
  33. Node.js学习笔记【七】
  34. Node.js learning notes [VII]
  35. Vue Router根据后台数据加载不同的组件(思考-&gt;实现-&gt;不止于实现)
  36. Vue router loads different components according to background data (thinking - & gt; Implementation - & gt; (more than implementation)
  37. 【JQuery框架,Java编程教程视频下载
  38. [jQuery framework, Java programming tutorial video download
  39. Vue Router根据后台数据加载不同的组件(思考-&gt;实现-&gt;不止于实现)
  40. Vue router loads different components according to background data (thinking - & gt; Implementation - & gt; (more than implementation)
  41. 【Vue,阿里P8大佬亲自教你
  42. 【Vue基础知识总结 5,字节跳动算法工程师面试经验
  43. [Vue, Ali P8 teaches you personally
  44. [Vue basic knowledge summary 5. Interview experience of byte beating Algorithm Engineer
  45. 【问题记录】- 谷歌浏览器 Html生成PDF
  46. [problem record] - PDF generated by Google browser HTML
  47. 【问题记录】- 谷歌浏览器 Html生成PDF
  48. [problem record] - PDF generated by Google browser HTML
  49. 【JavaScript】查漏补缺 —数组中reduce()方法
  50. [JavaScript] leak checking and defect filling - reduce() method in array
  51. 【重识 HTML (3),350道Java面试真题分享
  52. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  53. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  54. [re recognize HTML (3) and share 350 real Java interview questions
  55. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  56. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  57. 【重识 HTML ,nginx面试题阿里
  58. 【重识 HTML (4),ELK原来这么简单
  59. [re recognize HTML, nginx interview questions]
  60. [re recognize HTML (4). Elk is so simple