Convert PDF to HTML without losing text or format.

use springboot hold pdf2htmlEX The command line tool is packaged as web service , bring PDF turn HTML More convenient .

pdf2htmlEX See the :

pdf2html-service See source code :

Quick start

# Pull the mirror image 
docker pull iflyendless/pdf2html-service:1.0.1 # start-up
docker run --name pdf2html -p 8686:8686 -d --rm iflyendless/pdf2html-service:1.0.1

Use :

curl -o --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/example.pdf'

As a reminder : /pdfs/example.pdf refer to pdf The absolute path of the file

Unzip in the current directory, You can see the converted html Files and 000-task.txt.

Build a mirror image

# Download code 
git clone # Entry project
cd pdf2html-service # Skip unit test packaging
mvn clean package -DskipTests # build docker image
docker build -t pdf2html-service:1.0.1 .

If the build image fails , Please check Under this site jdk Is the version consistent with Dockerfile The download versions in are the same .


docker run --name pdf2html -p 8686:8686 -d --rm pdf2html-service:1.0.1

If you need to set some extra parameters , You can start docker Time pass -e Pass in :

# Maximum number of child processes started at the same time , It needs to be set reasonably according to the resources of the system ( Default 15)
-e PDF2HTML_MAX_PROCESS=15 # perform /usr/local/bin/pdf2htmlEX Maximum time-out on command , Company s For seconds ( Default 600s)

namely :

docker run --name pdf2html -p 8686:8686 -e PDF2HTML_MAX_PROCESS=10 -e PDF2HTML_COMMAND_TIMEOUT=60s -d --rm pdf2html-service:1.0.1

See for more configuration : resources In the catalog application.yml file .

Http Interface

(1) View version

curl http://localhost:8686/api/version

(2) Check the configuration

curl http://localhost:8686/api/config

(3) Upload multiple pdf, And download html Compressed package

curl -o --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/001.pdf' --form 'files=@/pdfs/002.pdf' --form 'files=@/pdfs/003.pdf'

As a reminder : /pdfs/001.pdf refer to pdf The absolute path of the file

(4) The query program exposed metric

curl http://localhost:8686/api/metric

Troubleshoot problems

# Into the container 
docker exec -it pdf2html bash # Check the log directory
cd /opt/pdf2html-service/logs # View the conversion failed pdf
cd /tmp/pdf2html-service/failed-pdfs # Manual call pdf2htmlEX Command conversion pdf
pdf2htmlEX --help


Every manual call pdf2htmlEX Command line tools are not very convenient , On this basis, it is packaged into a web The service is more convenient to use . Complete source see :


because pdf2htmlEX The dependency of command line tools is more complex , Compiling is also troublesome , So it can be directly provided by the government Docker Image Install in JDK, And then use springboot Quickly write a web application , Receiving user http request , The background call pdf2htmlEX The command line tool will have multiple PDF All turned into HTML, And then compress the resulting HTML by zip package , Let users download .

Dockerfile as follows :

# pdf2htmlex image
FROM pdf2htmlex/pdf2htmlex:0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64 ENV TZ='CST-8'
ENV LANG C.UTF-8 # apt
RUN sed -i s@/ /etc/apt/sources.list
RUN apt-get clean && apt-get update
RUN apt-get install -y vim curl htop net-tools # vim
RUN echo "set fileencodings=utf-8,ucs-bom,gb18030,gbk,gb2312,cp936" >> /etc/vim/vimrc
RUN echo "set termencoding=utf-8" >> /etc/vim/vimrc
RUN echo "set encoding=utf-8" >> /etc/vim/vimrc # jdk
ADD /tmp/
RUN tar -zxf /tmp/jdk-*.tar.gz -C /opt/ && rm -f /tmp/jdk-*.tar.gz && mv /opt/jdk* /opt/jdk ENV JAVA_HOME /opt/jdk
ENV PATH ${JAVA_HOME}/bin:$PATH # pdf2html-service
COPY target/pdf2html-service-*.tar.gz /tmp/
RUN tar -zxf /tmp/pdf2html-service-*.tar.gz -C /opt/ && rm -f /tmp/pdf2html-service-*.tar.gz ENTRYPOINT [""]
WORKDIR /opt/pdf2html-service
CMD ["bash","-c","./ && tail -f /dev/null"]

Introduce dependencies

</properties> <dependencies> <dependency>
</dependency> <dependency>
</dependency> <dependency>
</dependency> <dependency>
</dependency> <dependency>
</dependency> </dependencies>

This is a springboot application :

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import; @SpringBootApplication
public class Pdf2HtmlService { public static void main(String[] args) {;

Application configuration

application.yml As follows :

port: ${APP_PORT:8686}
servlet.context-path: / pdf2html:
# /usr/local/bin/pdf2htmlEX --zoom 1.3
command: ${PDF2HTML_COMMAND:/usr/local/bin/pdf2htmlEX --zoom 1.3 --quiet 1}
command-timeout: ${PDF2HTML_COMMAND_TIMEOUT:600s}
work-dir: ${PDF2HTML_WORK_DIR:/tmp/pdf2html-service}
max-process: ${PDF2HTML_MAX_PROCESS:15} spring:
name: pdf2html-service

Corresponding Pdf2HtmlProperties as follows :

import lombok.Data;
import; import java.time.Duration; @Data
@ConfigurationProperties(prefix = "pdf2html")
public class Pdf2HtmlProperties { private String command; private String workDir; private Duration commandTimeout; // Maximum number of child processes started at the same time , It needs to be set reasonably according to the performance of the system
private int maxProcess;

Let's briefly explain the meaning of these configurations :

  • command: It means calling pdf2htmlEX The details of the command line tool command, For detailed parameters, see pdf2htmlEX --help
  • command-timeout: The use of apache Of commons-exec tool kit , Call the command line asynchronously , The maximum timeout time can be set .commons-exec Please refer to :
  • work-dir: The web The working directory of the application , That is to say, receiving the user's request when , First the pdf The file is written to a subdirectory of the directory , call pdf2htmlEX Generated html The default is also in this directory , Then compress the html file , write in response. Another thing to note is : The conversion failed pdf Will be written to the work-dir Under the failed-pdfs Next . Easy to reproduce 、 Troubleshoot problems .
  • max-process: Because I invoke command line tools in my implementation is all asynchronous operation , You must limit the number of command lines that can be started at the same time , Avoid generating a large number of subprocesses in a short time , Not only does it seriously affect program performance , And it can cause the system to jam instantly . So this configuration limits the maximum number of child processes that can be started at the same time , It needs to be set reasonably according to the performance of the system . This is for JDK Self contained java.util.concurrent.Semaphore To limit the number of child processes .

Interface implementation

Interface implementation is not complicated , Some notes have been added to the key points . as follows :

import cn.hutool.core.collection.CollUtil;
import cn.hutool.core.util.ArrayUtil;
import cn.hutool.core.util.CharsetUtil;
import cn.hutool.core.util.IdUtil;
import cn.hutool.core.util.ZipUtil;
import com.github.iflyendless.config.Pdf2HtmlProperties;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.exec.*;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile; import javax.annotation.PostConstruct;
import javax.annotation.Resource;
import javax.servlet.ServletOutputStream;
import javax.servlet.http.HttpServletResponse;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.Semaphore; @Slf4j
public class Pdf2HtmlController { private static final String PDF = "pdf";
private static final String FAILED_PDF_DIR = "failed-pdfs";
private static final String TASK_FILE = "000-task.txt"; @Resource
private Pdf2HtmlProperties pdf2HtmlProperties; // To limit simultaneous startup pdf2htmlEX The number of child processes of the command line tool
private static Semaphore semaphore; // transformation html The failure of the pdf Write to this directory , It is convenient for manual conversion in the future
private static File failedPdfDir; @PostConstruct
public void init() {
semaphore = new Semaphore(pdf2HtmlProperties.getMaxProcess());
failedPdfDir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), FAILED_PDF_DIR));
} @GetMapping("/version")
public Object version() {
return "1.0.1";
} @GetMapping("/config")
public Object config() {
return pdf2HtmlProperties;
} @GetMapping("/metric")
public Object metric() {
Map<String, Object> semaphoreMap = new LinkedHashMap<>();
semaphoreMap.put("availablePermits", semaphore.availablePermits());
semaphoreMap.put("queueLength", semaphore.getQueueLength()); Map<String, Object> metricMap = new LinkedHashMap<>();
metricMap.put("semaphore", semaphoreMap); return metricMap;
} @PostMapping("/pdf2html")
public void pdf2html(@RequestParam("files") MultipartFile[] files,
HttpServletResponse response) {
if (ArrayUtil.isEmpty(files)) {
log.warn(" The number of files is 0");
} File dir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), IdUtil.simpleUUID())); try (ServletOutputStream outputStream = response.getOutputStream()) {
List<File> fileList = new ArrayList<>(files.length);
for (MultipartFile f : files) {
if (f == null || f.isEmpty()) {
// Write to local working directory
File localFile = FileUtil.writeFromStream(f.getInputStream(), FileUtil.file(dir, f.getOriginalFilename()));
// Only deal with pdf file
if (isPdf(localFile)) {
} if (CollUtil.isEmpty(fileList)) {
} long start = System.currentTimeMillis(); int size = fileList.size();
CountDownLatch latch = new CountDownLatch(size);
// Deal with the failed pdf Statistics
Map<String, Throwable> failedMap = new ConcurrentHashMap<>(); for (File file : fileList) {
// This limits the number of startup processes
// Because the following calls are asynchronous , Prevent the generation of a large number of subprocesses in an instant
// Asynchronous call pdf2htmlEX Command line tools
invokeCommand(dir, file, latch, failedMap);
} // Wait for all child processes to finish
latch.await();"pdf2html A total of time consuming {}ms, pdf The number of {}", System.currentTimeMillis() - start, size); // Record The statistics are written into a file 000-task.txt, transformation html The failure of the pdf Write to a fixed directory
recordTaskResult(size, failedMap, dir, fileList); // The generated html Files and task.txt Compress , And write response, CharsetUtil.CHARSET_UTF_8, true, new FileFilter() {
public boolean accept(File pathname) {
if (pathname.isDirectory()) {
return true;
String name = pathname.getName().toLowerCase();
return name.endsWith(".html") || name.endsWith(".txt");
}, dir); response.addHeader("Content-Disposition",
"attachment;fileName=" + URLEncoder.encode(dir.getName() + ".zip", "UTF-8"));
response.addHeader("Content-type", "application/zip");
} catch (Throwable e) {
log.error("pdf2html error", e);
} finally {
} /**
* Use here apache Of commons-exec perform pdf2htmlEX Command line tools
* For details, see :
public void invokeCommand(File workDir, File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
String filePath = file.getAbsolutePath(); String line = String.format("%s --dest-dir %s %s", pdf2HtmlProperties.getCommand(), workDir.getAbsolutePath(), filePath);
CommandLine commandLine = CommandLine.parse(line); // Command line timeout handling
ExecuteWatchdog watchdog = new ExecuteWatchdog(1000 * pdf2HtmlProperties.getCommandTimeout().getSeconds());
// Command line Execute the completed callback
ResultHandler resultHandler = new ResultHandler(file, latch, failedMap); Executor executor = new DefaultExecutor();
executor.setWatchdog(watchdog); try {
executor.execute(commandLine, resultHandler);
} catch (Throwable e) {
String fileName = file.getName();
if (!failedMap.containsKey(fileName)) {
failedMap.put(fileName, e);
latch.countDown(); log.error("invokeCommand failed, command: {}, error:{}", line, e);
} public static boolean isPdf(File file) {
try {
return PDF.equalsIgnoreCase(FileTypeUtil.getType(file));
} catch (Exception e) {
log.error(" distinguish pdf Type failure , file name :{}, error: {}", file.getAbsolutePath(), e);
return false;
} public static void recordTaskResult(int total, Map<String, Throwable> failedMap, File workDir, List<File> pdfs) {
List<String> list = new ArrayList<>();
list.add("total:" + total);
list.add("success:" + (total - failedMap.size()));
list.add("failed:" + failedMap.size()); list.add("");
list.add(""); Set<String> failedNames = failedMap.keySet();
list.addAll(failedNames); // Record the general situation of task completion
FileUtil.writeLines(list, FileUtil.file(workDir, TASK_FILE), CharsetUtil.CHARSET_UTF_8); // The conversion failed pdf Write to other directories , Further processing may be needed in the future
if (CollUtil.isNotEmpty(failedNames)) {
for (File pdf : pdfs) {
String name = pdf.getName();
if (failedNames.contains(name)) {
File dest = FileUtil.file(failedPdfDir, name);
if (dest.exists()) {
dest = FileUtil.file(failedPdfDir, IdUtil.simpleUUID() + "-" + name);
FileUtil.copyFile(pdf, dest);
} /**
* According to the specific business logic to do the corresponding implementation , The error log will be printed here
public static class ResultHandler implements ExecuteResultHandler { private final File file;
private final CountDownLatch latch;
private final Map<String, Throwable> failedMap; @Getter
private int exitValue = -8686; public ResultHandler(File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
this.file = file;
this.latch = latch;
this.failedMap = failedMap;
} @Override
public void onProcessComplete(int exitValue) {
this.latch.countDown(); this.exitValue = exitValue;
} @Override
public void onProcessFailed(ExecuteException e) {
this.failedMap.put(this.file.getName(), e);
this.latch.countDown(); log.error("pdf2html failed, file: {}, error:{}", this.file.getAbsolutePath(), e);

Written in the back

Because I am not familiar with the front-end development , There's no time for a simple page . If you know about front-end development and are interested in this tool , You can follow a page , So much the better !!! in addition , If you know PDF turn HTML There are better tools or implementations , Leave a comment in the comments section !!!

Notes , It's convenient for you and me .

