PDF转HTML工具——用springboot包装pdf2htmlEX命令行工具

行无际 2021-05-04 08:25:19
工具 html SpringBoot PDF 包装


Convert PDF to HTML without losing text or format.

springbootpdf2htmlEX命令行工具包装为web服务, 使得PDFHTML更方便。

pdf2htmlEX命令行工具详情见:
https://github.com/pdf2htmlEX/pdf2htmlEX

pdf2html-service源码见:
https://github.com/iflyendless/pdf2html-service

快速开始

# 拉取镜像
docker pull iflyendless/pdf2html-service:1.0.1
# 启动
docker run --name pdf2html -p 8686:8686 -d --rm iflyendless/pdf2html-service:1.0.1

使用:

curl -o html.zip --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/example.pdf'

提醒一下: /pdfs/example.pdf指的是pdf文件所在的绝对路径

在当前目录解压html.zip, 即可看到转换后的html文件以及000-task.txt

构建镜像

# 下载代码
git clone https://github.com/iflyendless/pdf2html-service.git
# 进入项目
cd pdf2html-service
# 跳过单元测试打包
mvn clean package -DskipTests
# build docker image
docker build -t pdf2html-service:1.0.1 .

如果构建镜像失败,请检查 https://enos.itcollege.ee/~jpoial/allalaadimised/jdk8/ 该站点下jdk版本是否与Dockerfile中的下载版本一致。

启动

docker run --name pdf2html -p 8686:8686 -d --rm pdf2html-service:1.0.1

如果需要格外设置一些参数的话, 可以启动docker的时候通过-e传进去:

# 同时启动的最大子进程数, 需要根据系统的资源合理设置(默认15)
-e PDF2HTML_MAX_PROCESS=15
# 执行/usr/local/bin/pdf2htmlEX命令时最大超时时间,单位s表示秒(默认600s)
-e PDF2HTML_COMMAND_TIMEOUT=600s

即:

docker run --name pdf2html -p 8686:8686 -e PDF2HTML_MAX_PROCESS=10 -e PDF2HTML_COMMAND_TIMEOUT=60s -d --rm pdf2html-service:1.0.1

更多配置见: resources目录下的application.yml文件。

Http接口

(1) 查看版本

curl http://localhost:8686/api/version

(2) 查看配置

curl http://localhost:8686/api/config

(3) 上传多个pdf, 并下载html压缩包

curl -o html.zip --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/001.pdf' --form 'files=@/pdfs/002.pdf' --form 'files=@/pdfs/003.pdf'

提醒一下: /pdfs/001.pdf指的是pdf文件所在的绝对路径

(4) 查询程序暴露出来的metric

curl http://localhost:8686/api/metric

问题排查

# 进入容器
docker exec -it pdf2html bash
# 查看日志目录
cd /opt/pdf2html-service/logs
# 查看转换失败的pdf
cd /tmp/pdf2html-service/failed-pdfs
# 手动调用pdf2htmlEX命令转换pdf
pdf2htmlEX --help

实现

每次手动调用pdf2htmlEX命令行工具不太方便,在此基础上包装成一个web服务更加方便使用。完整源码见:
https://github.com/iflyendless/pdf2html-service

思路

由于pdf2htmlEX命令行工具的依赖较为复杂,编译也比较麻烦,所以可直接在官方提供的Docker Image中安装JDK,然后用springboot快速编写一个web应用,接收用户http请求,后台调用pdf2htmlEX命令行工具将多个PDF都转为HTML,然后压缩生成的HTMLzip包,让用户下载。

Dockerfile如下:

# pdf2htmlex image
FROM pdf2htmlex/pdf2htmlex:0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64
ENV TZ='CST-8'
ENV LANG C.UTF-8
# apt
RUN sed -i s@/archive.ubuntu.com/@/mirrors.aliyun.com/@g /etc/apt/sources.list
RUN apt-get clean && apt-get update
RUN apt-get install -y vim curl htop net-tools
# vim
RUN echo "set fileencodings=utf-8,ucs-bom,gb18030,gbk,gb2312,cp936" >> /etc/vim/vimrc
RUN echo "set termencoding=utf-8" >> /etc/vim/vimrc
RUN echo "set encoding=utf-8" >> /etc/vim/vimrc
# jdk
ADD https://enos.itcollege.ee/~jpoial/allalaadimised/jdk8/jdk-8u291-linux-x64.tar.gz /tmp/
RUN tar -zxf /tmp/jdk-*.tar.gz -C /opt/ && rm -f /tmp/jdk-*.tar.gz && mv /opt/jdk* /opt/jdk
ENV JAVA_HOME /opt/jdk
ENV PATH ${JAVA_HOME}/bin:$PATH
# pdf2html-service
COPY target/pdf2html-service-*.tar.gz /tmp/
RUN tar -zxf /tmp/pdf2html-service-*.tar.gz -C /opt/ && rm -f /tmp/pdf2html-service-*.tar.gz
ENTRYPOINT [""]
WORKDIR /opt/pdf2html-service
CMD ["bash","-c","./start.sh && tail -f /dev/null"]

引入依赖

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<java.version>1.8</java.version>
<maven.build.timestamp.format>yyyyMMdd</maven.build.timestamp.format>
<hutool.version>5.6.3</hutool.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-exec</artifactId>
<version>1.3</version>
</dependency>
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>${hutool.version}</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</dependency>
</dependencies>

这是一个springboot应用:

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.context.properties.ConfigurationPropertiesScan;
@SpringBootApplication
@ConfigurationPropertiesScan
public class Pdf2HtmlService {
public static void main(String[] args) {
SpringApplication.run(Pdf2HtmlService.class);
}
}

程序配置

application.yml大致如下:

server:
port: ${APP_PORT:8686}
servlet.context-path: /
pdf2html:
# /usr/local/bin/pdf2htmlEX --zoom 1.3
command: ${PDF2HTML_COMMAND:/usr/local/bin/pdf2htmlEX --zoom 1.3 --quiet 1}
command-timeout: ${PDF2HTML_COMMAND_TIMEOUT:600s}
work-dir: ${PDF2HTML_WORK_DIR:/tmp/pdf2html-service}
max-process: ${PDF2HTML_MAX_PROCESS:15}
spring:
application:
name: pdf2html-service

对应的Pdf2HtmlProperties如下:

import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import java.time.Duration;
@Data
@ConfigurationProperties(prefix = "pdf2html")
public class Pdf2HtmlProperties {
private String command;
private String workDir;
private Duration commandTimeout;
// 同时启动的最大子进程数, 需要根据系统的性能合理设置
private int maxProcess;
}

下面简单解释一下这几个配置的含义:

  • command:指的是调用pdf2htmlEX命令行工具的具体command,详细参数见pdf2htmlEX --help
  • command-timeout:使用的apachecommons-exec工具包,异步调用命令行,可设置最大超时时间。commons-exec的使用详情见:https://commons.apache.org/proper/commons-exec/tutorial.html
  • work-dir:该web应用程序的工作目录,也就是接收到用户的request时,先将pdf文件写入该目录的一个子目录下,调用pdf2htmlEX生成的html默认也是在该目录下,然后压缩该目录下生成的html文件,写入response。另外注意的是:转换失败的pdf会写入到该work-dir下的failed-pdfs下。方便复现、排查问题。
  • max-process:由于我的实现中调用命令行工具是全异步操作,必须对同时启动的命令行个数加以限制,避免短时间内产生大量子进程,不仅严重影响程序性能,而且可能导致系统瞬间卡死。所以该配置限制了同时启动的最大子进程数, 需要根据系统的性能合理设置。这里是用JDK自带的java.util.concurrent.Semaphore来限制子进程数量。

接口实现

接口实现并不复杂,关键地方也加了一些注释。如下:

import cn.hutool.core.collection.CollUtil;
import cn.hutool.core.io.FileTypeUtil;
import cn.hutool.core.io.FileUtil;
import cn.hutool.core.util.ArrayUtil;
import cn.hutool.core.util.CharsetUtil;
import cn.hutool.core.util.IdUtil;
import cn.hutool.core.util.ZipUtil;
import com.github.iflyendless.config.Pdf2HtmlProperties;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.exec.*;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import javax.annotation.PostConstruct;
import javax.annotation.Resource;
import javax.servlet.ServletOutputStream;
import javax.servlet.http.HttpServletResponse;
import java.io.File;
import java.io.FileFilter;
import java.net.URLEncoder;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.Semaphore;
@Slf4j
@RestController
@RequestMapping("/api")
public class Pdf2HtmlController {
private static final String PDF = "pdf";
private static final String FAILED_PDF_DIR = "failed-pdfs";
private static final String TASK_FILE = "000-task.txt";
@Resource
private Pdf2HtmlProperties pdf2HtmlProperties;
// 为了限制同时启动pdf2htmlEX命令行工具的子进程数
private static Semaphore semaphore;
// 转换html失败的pdf写到这个目录, 方便后面手动转换排查原因
private static File failedPdfDir;
@PostConstruct
public void init() {
semaphore = new Semaphore(pdf2HtmlProperties.getMaxProcess());
failedPdfDir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), FAILED_PDF_DIR));
}
@GetMapping("/version")
public Object version() {
return "1.0.1";
}
@GetMapping("/config")
public Object config() {
return pdf2HtmlProperties;
}
@GetMapping("/metric")
public Object metric() {
Map<String, Object> semaphoreMap = new LinkedHashMap<>();
semaphoreMap.put("availablePermits", semaphore.availablePermits());
semaphoreMap.put("queueLength", semaphore.getQueueLength());
Map<String, Object> metricMap = new LinkedHashMap<>();
metricMap.put("semaphore", semaphoreMap);
return metricMap;
}
@PostMapping("/pdf2html")
public void pdf2html(@RequestParam("files") MultipartFile[] files,
HttpServletResponse response) {
if (ArrayUtil.isEmpty(files)) {
log.warn("文件数为0");
return;
}
File dir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), IdUtil.simpleUUID()));
try (ServletOutputStream outputStream = response.getOutputStream()) {
List<File> fileList = new ArrayList<>(files.length);
for (MultipartFile f : files) {
if (f == null || f.isEmpty()) {
continue;
}
// 写入本地工作目录
File localFile = FileUtil.writeFromStream(f.getInputStream(), FileUtil.file(dir, f.getOriginalFilename()));
// 只处理pdf文件
if (isPdf(localFile)) {
fileList.add(localFile);
}
}
if (CollUtil.isEmpty(fileList)) {
return;
}
long start = System.currentTimeMillis();
int size = fileList.size();
CountDownLatch latch = new CountDownLatch(size);
// 处理失败的pdf统计
Map<String, Throwable> failedMap = new ConcurrentHashMap<>();
for (File file : fileList) {
// 这里限制启动子进程的数量
// 因为后面的调用是异步的, 防止瞬间产生大量子进程
semaphore.acquire();
// 异步调用pdf2htmlEX命令行工具
invokeCommand(dir, file, latch, failedMap);
}
// 等待所有子进程结束
latch.await();
log.info("pdf2html一共耗时{}ms, pdf数量为{}", System.currentTimeMillis() - start, size);
// 记录 统计数据写入文件000-task.txt, 转换html失败的pdf写入固定目录
recordTaskResult(size, failedMap, dir, fileList);
// 将生成的html文件以及task.txt压缩, 并写入response
ZipUtil.zip(outputStream, CharsetUtil.CHARSET_UTF_8, true, new FileFilter() {
@Override
public boolean accept(File pathname) {
if (pathname.isDirectory()) {
return true;
}
String name = pathname.getName().toLowerCase();
return name.endsWith(".html") || name.endsWith(".txt");
}
}, dir);
response.addHeader("Content-Disposition",
"attachment;fileName=" + URLEncoder.encode(dir.getName() + ".zip", "UTF-8"));
response.addHeader("Content-type", "application/zip");
} catch (Throwable e) {
log.error("pdf2html error", e);
} finally {
FileUtil.del(dir);
}
}
/**
* 这里使用apache的commons-exec执行pdf2htmlEX命令行工具
* 详情见: https://commons.apache.org/proper/commons-exec/tutorial.html
*/
public void invokeCommand(File workDir, File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
String filePath = file.getAbsolutePath();
String line = String.format("%s --dest-dir %s %s", pdf2HtmlProperties.getCommand(), workDir.getAbsolutePath(), filePath);
CommandLine commandLine = CommandLine.parse(line);
// 命令行的超时处理
ExecuteWatchdog watchdog = new ExecuteWatchdog(1000 * pdf2HtmlProperties.getCommandTimeout().getSeconds());
// 命令行 执行完成的回调
ResultHandler resultHandler = new ResultHandler(file, latch, failedMap);
Executor executor = new DefaultExecutor();
executor.setExitValue(0);
executor.setWatchdog(watchdog);
try {
executor.execute(commandLine, resultHandler);
} catch (Throwable e) {
semaphore.release();
String fileName = file.getName();
if (!failedMap.containsKey(fileName)) {
failedMap.put(fileName, e);
}
latch.countDown();
log.error("invokeCommand failed, command: {}, error:{}", line, e);
}
}
public static boolean isPdf(File file) {
try {
return PDF.equalsIgnoreCase(FileTypeUtil.getType(file));
} catch (Exception e) {
log.error("识别pdf类型失败, 文件名:{}, error: {}", file.getAbsolutePath(), e);
return false;
}
}
public static void recordTaskResult(int total, Map<String, Throwable> failedMap, File workDir, List<File> pdfs) {
List<String> list = new ArrayList<>();
list.add("total:" + total);
list.add("success:" + (total - failedMap.size()));
list.add("failed:" + failedMap.size());
list.add("");
list.add("failed-pdfs:");
list.add("");
Set<String> failedNames = failedMap.keySet();
list.addAll(failedNames);
// 记录任务完成大致情况
FileUtil.writeLines(list, FileUtil.file(workDir, TASK_FILE), CharsetUtil.CHARSET_UTF_8);
// 转换失败的pdf写入其他目录,后续可能需要进一步处理
if (CollUtil.isNotEmpty(failedNames)) {
for (File pdf : pdfs) {
String name = pdf.getName();
if (failedNames.contains(name)) {
File dest = FileUtil.file(failedPdfDir, name);
if (dest.exists()) {
dest = FileUtil.file(failedPdfDir, IdUtil.simpleUUID() + "-" + name);
}
FileUtil.copyFile(pdf, dest);
}
}
}
}
/**
* 根据具体的业务逻辑做相应的实现, 这里会打印一下错误日志
*/
public static class ResultHandler implements ExecuteResultHandler {
private final File file;
private final CountDownLatch latch;
private final Map<String, Throwable> failedMap;
@Getter
private int exitValue = -8686;
public ResultHandler(File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
this.file = file;
this.latch = latch;
this.failedMap = failedMap;
}
@Override
public void onProcessComplete(int exitValue) {
semaphore.release();
this.latch.countDown();
this.exitValue = exitValue;
}
@Override
public void onProcessFailed(ExecuteException e) {
semaphore.release();
this.failedMap.put(this.file.getName(), e);
this.latch.countDown();
log.error("pdf2html failed, file: {}, error:{}", this.file.getAbsolutePath(), e);
}
}
}

写在后面

由于本人对前端开发不太熟悉,就没有花时间做个简单的页面了。如果你了解前端开发而且对此工具有点兴趣,可以顺手写个页面出来,那就更好了!!!另外,如果你知道PDFHTML有更好的工具或实现,欢迎评论区留言!!!

随手记录,方便你我他。

版权声明
本文为[行无际]所创,转载请带上原文链接,感谢
https://www.cnblogs.com/iflyendless/p/pdf2html.html

  1. 21. Object oriented foundation "problems and solutions of object traversal"
  2. Discussion on hot micro front end: Google AdWords is a real micro front end
  3. Usecallback and usememo for real performance optimization
  4. 【前端圭臬】十一:从规范看 JavaScript 执行上下文(下)
  5. [front end standard] 11: Javascript execution context from the perspective of specification (2)
  6. Hexagonal六角形架构ReactJS的实现方式 - Janos Pasztor
  7. Transaction of spring's reactive / imperative relational database
  8. The implementation of hexagonal hexagonal reactjs Janos pasztor
  9. HTTP状态码:402 Payment Required需要付款 - mozilla
  10. HTTP status code: 402 payment required - Mozilla
  11. Factory mode, constructor mode and prototype mode
  12. Build the scaffold of react project from scratch (Series 1: encapsulating a request method with cache function based on Axios)
  13. Cocos Quick Start Guide
  14. Comparison of three default configurations of webpack5 modes
  15. A case study of the combination of flutter WebView and Vue
  16. CSS: BFC and IFC
  17. A common error report and solution in Vue combat
  18. JS: this point
  19. JS: prototype chain
  20. JavaScript series -- promise, generator, async and await
  21. JS: event flow
  22. Front end performance optimization: rearrangement and redrawing
  23. JS - deep and shallow copy
  24. JavaScript异步编程3——Promise的链式使用
  25. JavaScript asynchronous programming 3 -- chain use of promise
  26. Vue.js组件的使用
  27. The use of vue.js component
  28. How to judge whether a linked list has links
  29. Element UI custom theme configuration
  30. Text image parallax effect HTML + CSS + JS
  31. Spring的nohttp宣言:消灭http://
  32. Vue3 intermediate guide - composition API
  33. Analysis of URL
  34. These 10 widgets that every developer must know
  35. Spring's nohttp Manifesto: eliminate http://
  36. Learn more about JS prototypes
  37. Refer to await to JS to write an await error handling
  38. A short article will directly let you understand what the event loop mechanism is
  39. Vue3 uses mitt for component communication
  40. Characteristics and thinking of ES6 symbol
  41. Two way linked list: I'm no longer one-way driving
  42. Vue event and form processing
  43. Reactive TraderCloud实时外汇开源交易平台
  44. Reactive tradercloud real time foreign exchange open source trading platform
  45. Node.js REST API的10个最佳实践
  46. Ten best practices of node.js rest API
  47. Fiddler advanced usage
  48. Process from Vue template to render
  49. Promise up (asynchronous or synchronous)
  50. Principle and implementation of promise
  51. Vs code plug in sharing - run code
  52. Vue practical notes (1) introduction of Ant Design
  53. Vue actual combat notes (2) introduction of element plus
  54. Introduction to webpack
  55. Webpack construction process
  56. Vue notes
  57. The experience and lessons of moving from ruby megalith architecture to go microservice
  58. Using leancloud to add artitalk module to hexo blog
  59. Implementation of chrome request filtering extension
  60. Detailed introduction of beer import declaration elements and label quarantine [import knowledge]