Pdf to HTML tool -- Wrapping pdf2htmlex command line tool with springboot

Endless travel 2021-05-04 08:25:51
pdf html tool wrapping pdf2htmlex


Convert PDF to HTML without losing text or format.

use springboot hold pdf2htmlEX The command line tool is packaged as web service , bring PDF turn HTML More convenient .

pdf2htmlEX See the :
https://github.com/pdf2htmlEX/pdf2htmlEX

pdf2html-service See source code :
https://github.com/iflyendless/pdf2html-service

Quick start

# Pull the mirror image
docker pull iflyendless/pdf2html-service:1.0.1
# start-up
docker run --name pdf2html -p 8686:8686 -d --rm iflyendless/pdf2html-service:1.0.1

Use :

curl -o html.zip --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/example.pdf'

As a reminder : /pdfs/example.pdf refer to pdf The absolute path of the file

Unzip in the current directory html.zip, You can see the converted html Files and 000-task.txt.

Build a mirror image

# Download code
git clone https://github.com/iflyendless/pdf2html-service.git
# Entry project
cd pdf2html-service
# Skip unit test packaging
mvn clean package -DskipTests
# build docker image
docker build -t pdf2html-service:1.0.1 .

If the build image fails , Please check https://enos.itcollege.ee/~jpoial/allalaadimised/jdk8/ Under this site jdk Is the version consistent with Dockerfile The download versions in are the same .

start-up

docker run --name pdf2html -p 8686:8686 -d --rm pdf2html-service:1.0.1

If you need to set some extra parameters , You can start docker Time pass -e Pass in :

# Maximum number of child processes started at the same time , It needs to be set reasonably according to the resources of the system ( Default 15)
-e PDF2HTML_MAX_PROCESS=15
# perform /usr/local/bin/pdf2htmlEX Maximum time-out on command , Company s For seconds ( Default 600s)
-e PDF2HTML_COMMAND_TIMEOUT=600s

namely :

docker run --name pdf2html -p 8686:8686 -e PDF2HTML_MAX_PROCESS=10 -e PDF2HTML_COMMAND_TIMEOUT=60s -d --rm pdf2html-service:1.0.1

See for more configuration : resources In the catalog application.yml file .

Http Interface

(1) View version

curl http://localhost:8686/api/version

(2) Check the configuration

curl http://localhost:8686/api/config

(3) Upload multiple pdf, And download html Compressed package

curl -o html.zip --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/001.pdf' --form 'files=@/pdfs/002.pdf' --form 'files=@/pdfs/003.pdf'

As a reminder : /pdfs/001.pdf refer to pdf The absolute path of the file

(4) The query program exposed metric

curl http://localhost:8686/api/metric

Troubleshoot problems

# Into the container
docker exec -it pdf2html bash
# Check the log directory
cd /opt/pdf2html-service/logs
# View the conversion failed pdf
cd /tmp/pdf2html-service/failed-pdfs
# Manual call pdf2htmlEX Command conversion pdf
pdf2htmlEX --help

Realization

Every manual call pdf2htmlEX Command line tools are not very convenient , On this basis, it is packaged into a web The service is more convenient to use . Complete source see :
https://github.com/iflyendless/pdf2html-service

Ideas

because pdf2htmlEX The dependency of command line tools is more complex , Compiling is also troublesome , So it can be directly provided by the government Docker Image Install in JDK, And then use springboot Quickly write a web application , Receiving user http request , The background call pdf2htmlEX The command line tool will have multiple PDF All turned into HTML, And then compress the resulting HTML by zip package , Let users download .

Dockerfile as follows :

# pdf2htmlex image
FROM pdf2htmlex/pdf2htmlex:0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64
ENV TZ='CST-8'
ENV LANG C.UTF-8
# apt
RUN sed -i s@/archive.ubuntu.com/@/mirrors.aliyun.com/@g /etc/apt/sources.list
RUN apt-get clean && apt-get update
RUN apt-get install -y vim curl htop net-tools
# vim
RUN echo "set fileencodings=utf-8,ucs-bom,gb18030,gbk,gb2312,cp936" >> /etc/vim/vimrc
RUN echo "set termencoding=utf-8" >> /etc/vim/vimrc
RUN echo "set encoding=utf-8" >> /etc/vim/vimrc
# jdk
ADD https://enos.itcollege.ee/~jpoial/allalaadimised/jdk8/jdk-8u291-linux-x64.tar.gz /tmp/
RUN tar -zxf /tmp/jdk-*.tar.gz -C /opt/ && rm -f /tmp/jdk-*.tar.gz && mv /opt/jdk* /opt/jdk
ENV JAVA_HOME /opt/jdk
ENV PATH ${JAVA_HOME}/bin:$PATH
# pdf2html-service
COPY target/pdf2html-service-*.tar.gz /tmp/
RUN tar -zxf /tmp/pdf2html-service-*.tar.gz -C /opt/ && rm -f /tmp/pdf2html-service-*.tar.gz
ENTRYPOINT [""]
WORKDIR /opt/pdf2html-service
CMD ["bash","-c","./start.sh && tail -f /dev/null"]

Introduce dependencies

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<java.version>1.8</java.version>
<maven.build.timestamp.format>yyyyMMdd</maven.build.timestamp.format>
<hutool.version>5.6.3</hutool.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-exec</artifactId>
<version>1.3</version>
</dependency>
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>${hutool.version}</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</dependency>
</dependencies>

This is a springboot application :

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.context.properties.ConfigurationPropertiesScan;
@SpringBootApplication
@ConfigurationPropertiesScan
public class Pdf2HtmlService {
public static void main(String[] args) {
SpringApplication.run(Pdf2HtmlService.class);
}
}

Application configuration

application.yml As follows :

server:
port: ${APP_PORT:8686}
servlet.context-path: /
pdf2html:
# /usr/local/bin/pdf2htmlEX --zoom 1.3
command: ${PDF2HTML_COMMAND:/usr/local/bin/pdf2htmlEX --zoom 1.3 --quiet 1}
command-timeout: ${PDF2HTML_COMMAND_TIMEOUT:600s}
work-dir: ${PDF2HTML_WORK_DIR:/tmp/pdf2html-service}
max-process: ${PDF2HTML_MAX_PROCESS:15}
spring:
application:
name: pdf2html-service

Corresponding Pdf2HtmlProperties as follows :

import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import java.time.Duration;
@Data
@ConfigurationProperties(prefix = "pdf2html")
public class Pdf2HtmlProperties {
private String command;
private String workDir;
private Duration commandTimeout;
// Maximum number of child processes started at the same time , It needs to be set reasonably according to the performance of the system
private int maxProcess;
}

Let's briefly explain the meaning of these configurations :

  • command: It means calling pdf2htmlEX The details of the command line tool command, For detailed parameters, see pdf2htmlEX --help
  • command-timeout: The use of apache Of commons-exec tool kit , Call the command line asynchronously , The maximum timeout time can be set .commons-exec Please refer to :https://commons.apache.org/proper/commons-exec/tutorial.html
  • work-dir: The web The working directory of the application , That is to say, receiving the user's request when , First the pdf The file is written to a subdirectory of the directory , call pdf2htmlEX Generated html The default is also in this directory , Then compress the html file , write in response. Another thing to note is : The conversion failed pdf Will be written to the work-dir Under the failed-pdfs Next . Easy to reproduce 、 Troubleshoot problems .
  • max-process: Because I invoke command line tools in my implementation is all asynchronous operation , You must limit the number of command lines that can be started at the same time , Avoid generating a large number of subprocesses in a short time , Not only does it seriously affect program performance , And it can cause the system to jam instantly . So this configuration limits the maximum number of child processes that can be started at the same time , It needs to be set reasonably according to the performance of the system . This is for JDK Self contained java.util.concurrent.Semaphore To limit the number of child processes .

Interface implementation

Interface implementation is not complicated , Some notes have been added to the key points . as follows :

import cn.hutool.core.collection.CollUtil;
import cn.hutool.core.io.FileTypeUtil;
import cn.hutool.core.io.FileUtil;
import cn.hutool.core.util.ArrayUtil;
import cn.hutool.core.util.CharsetUtil;
import cn.hutool.core.util.IdUtil;
import cn.hutool.core.util.ZipUtil;
import com.github.iflyendless.config.Pdf2HtmlProperties;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.exec.*;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import javax.annotation.PostConstruct;
import javax.annotation.Resource;
import javax.servlet.ServletOutputStream;
import javax.servlet.http.HttpServletResponse;
import java.io.File;
import java.io.FileFilter;
import java.net.URLEncoder;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.Semaphore;
@Slf4j
@RestController
@RequestMapping("/api")
public class Pdf2HtmlController {
private static final String PDF = "pdf";
private static final String FAILED_PDF_DIR = "failed-pdfs";
private static final String TASK_FILE = "000-task.txt";
@Resource
private Pdf2HtmlProperties pdf2HtmlProperties;
// To limit simultaneous startup pdf2htmlEX The number of child processes of the command line tool
private static Semaphore semaphore;
// transformation html The failure of the pdf Write to this directory , It is convenient for manual conversion in the future
private static File failedPdfDir;
@PostConstruct
public void init() {
semaphore = new Semaphore(pdf2HtmlProperties.getMaxProcess());
failedPdfDir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), FAILED_PDF_DIR));
}
@GetMapping("/version")
public Object version() {
return "1.0.1";
}
@GetMapping("/config")
public Object config() {
return pdf2HtmlProperties;
}
@GetMapping("/metric")
public Object metric() {
Map<String, Object> semaphoreMap = new LinkedHashMap<>();
semaphoreMap.put("availablePermits", semaphore.availablePermits());
semaphoreMap.put("queueLength", semaphore.getQueueLength());
Map<String, Object> metricMap = new LinkedHashMap<>();
metricMap.put("semaphore", semaphoreMap);
return metricMap;
}
@PostMapping("/pdf2html")
public void pdf2html(@RequestParam("files") MultipartFile[] files,
HttpServletResponse response) {
if (ArrayUtil.isEmpty(files)) {
log.warn(" The number of files is 0");
return;
}
File dir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), IdUtil.simpleUUID()));
try (ServletOutputStream outputStream = response.getOutputStream()) {
List<File> fileList = new ArrayList<>(files.length);
for (MultipartFile f : files) {
if (f == null || f.isEmpty()) {
continue;
}
// Write to local working directory
File localFile = FileUtil.writeFromStream(f.getInputStream(), FileUtil.file(dir, f.getOriginalFilename()));
// Only deal with pdf file
if (isPdf(localFile)) {
fileList.add(localFile);
}
}
if (CollUtil.isEmpty(fileList)) {
return;
}
long start = System.currentTimeMillis();
int size = fileList.size();
CountDownLatch latch = new CountDownLatch(size);
// Deal with the failed pdf Statistics
Map<String, Throwable> failedMap = new ConcurrentHashMap<>();
for (File file : fileList) {
// This limits the number of startup processes
// Because the following calls are asynchronous , Prevent the generation of a large number of subprocesses in an instant
semaphore.acquire();
// Asynchronous call pdf2htmlEX Command line tools
invokeCommand(dir, file, latch, failedMap);
}
// Wait for all child processes to finish
latch.await();
log.info("pdf2html A total of time consuming {}ms, pdf The number of {}", System.currentTimeMillis() - start, size);
// Record The statistics are written into a file 000-task.txt, transformation html The failure of the pdf Write to a fixed directory
recordTaskResult(size, failedMap, dir, fileList);
// The generated html Files and task.txt Compress , And write response
ZipUtil.zip(outputStream, CharsetUtil.CHARSET_UTF_8, true, new FileFilter() {
@Override
public boolean accept(File pathname) {
if (pathname.isDirectory()) {
return true;
}
String name = pathname.getName().toLowerCase();
return name.endsWith(".html") || name.endsWith(".txt");
}
}, dir);
response.addHeader("Content-Disposition",
"attachment;fileName=" + URLEncoder.encode(dir.getName() + ".zip", "UTF-8"));
response.addHeader("Content-type", "application/zip");
} catch (Throwable e) {
log.error("pdf2html error", e);
} finally {
FileUtil.del(dir);
}
}
/**
* Use here apache Of commons-exec perform pdf2htmlEX Command line tools
* For details, see : https://commons.apache.org/proper/commons-exec/tutorial.html
*/
public void invokeCommand(File workDir, File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
String filePath = file.getAbsolutePath();
String line = String.format("%s --dest-dir %s %s", pdf2HtmlProperties.getCommand(), workDir.getAbsolutePath(), filePath);
CommandLine commandLine = CommandLine.parse(line);
// Command line timeout handling
ExecuteWatchdog watchdog = new ExecuteWatchdog(1000 * pdf2HtmlProperties.getCommandTimeout().getSeconds());
// Command line Execute the completed callback
ResultHandler resultHandler = new ResultHandler(file, latch, failedMap);
Executor executor = new DefaultExecutor();
executor.setExitValue(0);
executor.setWatchdog(watchdog);
try {
executor.execute(commandLine, resultHandler);
} catch (Throwable e) {
semaphore.release();
String fileName = file.getName();
if (!failedMap.containsKey(fileName)) {
failedMap.put(fileName, e);
}
latch.countDown();
log.error("invokeCommand failed, command: {}, error:{}", line, e);
}
}
public static boolean isPdf(File file) {
try {
return PDF.equalsIgnoreCase(FileTypeUtil.getType(file));
} catch (Exception e) {
log.error(" distinguish pdf Type failure , file name :{}, error: {}", file.getAbsolutePath(), e);
return false;
}
}
public static void recordTaskResult(int total, Map<String, Throwable> failedMap, File workDir, List<File> pdfs) {
List<String> list = new ArrayList<>();
list.add("total:" + total);
list.add("success:" + (total - failedMap.size()));
list.add("failed:" + failedMap.size());
list.add("");
list.add("failed-pdfs:");
list.add("");
Set<String> failedNames = failedMap.keySet();
list.addAll(failedNames);
// Record the general situation of task completion
FileUtil.writeLines(list, FileUtil.file(workDir, TASK_FILE), CharsetUtil.CHARSET_UTF_8);
// The conversion failed pdf Write to other directories , Further processing may be needed in the future
if (CollUtil.isNotEmpty(failedNames)) {
for (File pdf : pdfs) {
String name = pdf.getName();
if (failedNames.contains(name)) {
File dest = FileUtil.file(failedPdfDir, name);
if (dest.exists()) {
dest = FileUtil.file(failedPdfDir, IdUtil.simpleUUID() + "-" + name);
}
FileUtil.copyFile(pdf, dest);
}
}
}
}
/**
* According to the specific business logic to do the corresponding implementation , The error log will be printed here
*/
public static class ResultHandler implements ExecuteResultHandler {
private final File file;
private final CountDownLatch latch;
private final Map<String, Throwable> failedMap;
@Getter
private int exitValue = -8686;
public ResultHandler(File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
this.file = file;
this.latch = latch;
this.failedMap = failedMap;
}
@Override
public void onProcessComplete(int exitValue) {
semaphore.release();
this.latch.countDown();
this.exitValue = exitValue;
}
@Override
public void onProcessFailed(ExecuteException e) {
semaphore.release();
this.failedMap.put(this.file.getName(), e);
this.latch.countDown();
log.error("pdf2html failed, file: {}, error:{}", this.file.getAbsolutePath(), e);
}
}
}

Written in the back

Because I am not familiar with the front-end development , There's no time for a simple page . If you know about front-end development and are interested in this tool , You can follow a page , So much the better !!! in addition , If you know PDF turn HTML There are better tools or implementations , Leave a comment in the comments section !!!

Notes , It's convenient for you and me .

版权声明
本文为[Endless travel]所创,转载请带上原文链接,感谢
https://qdmana.com/2021/05/20210504082502360s.html

  1. 21. Object oriented foundation "problems and solutions of object traversal"
  2. Discussion on hot micro front end: Google AdWords is a real micro front end
  3. Usecallback and usememo for real performance optimization
  4. 【前端圭臬】十一:从规范看 JavaScript 执行上下文(下)
  5. [front end standard] 11: Javascript execution context from the perspective of specification (2)
  6. Hexagonal六角形架构ReactJS的实现方式 - Janos Pasztor
  7. Transaction of spring's reactive / imperative relational database
  8. The implementation of hexagonal hexagonal reactjs Janos pasztor
  9. HTTP状态码:402 Payment Required需要付款 - mozilla
  10. HTTP status code: 402 payment required - Mozilla
  11. Factory mode, constructor mode and prototype mode
  12. Build the scaffold of react project from scratch (Series 1: encapsulating a request method with cache function based on Axios)
  13. Cocos Quick Start Guide
  14. Comparison of three default configurations of webpack5 modes
  15. A case study of the combination of flutter WebView and Vue
  16. CSS: BFC and IFC
  17. A common error report and solution in Vue combat
  18. JS: this point
  19. JS: prototype chain
  20. JavaScript series -- promise, generator, async and await
  21. JS: event flow
  22. Front end performance optimization: rearrangement and redrawing
  23. JS - deep and shallow copy
  24. JavaScript异步编程3——Promise的链式使用
  25. JavaScript asynchronous programming 3 -- chain use of promise
  26. Vue.js组件的使用
  27. The use of vue.js component
  28. How to judge whether a linked list has links
  29. Element UI custom theme configuration
  30. Text image parallax effect HTML + CSS + JS
  31. Spring的nohttp宣言:消灭http://
  32. Vue3 intermediate guide - composition API
  33. Analysis of URL
  34. These 10 widgets that every developer must know
  35. Spring's nohttp Manifesto: eliminate http://
  36. Learn more about JS prototypes
  37. Refer to await to JS to write an await error handling
  38. A short article will directly let you understand what the event loop mechanism is
  39. Vue3 uses mitt for component communication
  40. Characteristics and thinking of ES6 symbol
  41. Two way linked list: I'm no longer one-way driving
  42. Vue event and form processing
  43. Reactive TraderCloud实时外汇开源交易平台
  44. Reactive tradercloud real time foreign exchange open source trading platform
  45. Node.js REST API的10个最佳实践
  46. Ten best practices of node.js rest API
  47. Fiddler advanced usage
  48. Process from Vue template to render
  49. Promise up (asynchronous or synchronous)
  50. Principle and implementation of promise
  51. Vs code plug in sharing - run code
  52. Vue practical notes (1) introduction of Ant Design
  53. Vue actual combat notes (2) introduction of element plus
  54. Introduction to webpack
  55. Webpack construction process
  56. Vue notes
  57. The experience and lessons of moving from ruby megalith architecture to go microservice
  58. Using leancloud to add artitalk module to hexo blog
  59. Implementation of chrome request filtering extension
  60. Detailed introduction of beer import declaration elements and label quarantine [import knowledge]