Convert PDF to HTML without losing text or format.

use springboot hold pdf2htmlEX The command line tool is packaged as web service , bring PDF turn HTML More convenient .

pdf2htmlEX See the :

pdf2html-service See source code :

Quick start

# Pull the mirror image 
docker pull iflyendless/pdf2html-service:1.0.1 # start-up
docker run --name pdf2html -p 8686:8686 -d --rm iflyendless/pdf2html-service:1.0.1

Use :

curl -o --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/example.pdf'

As a reminder : /pdfs/example.pdf refer to pdf The absolute path of the file

Unzip in the current directory, You can see the converted html Files and 000-task.txt.

Build a mirror image

# Download code 
git clone # Entry project
cd pdf2html-service # Skip unit test packaging
mvn clean package -DskipTests # build docker image
docker build -t pdf2html-service:1.0.1 .

If the build image fails , Please check Under this site jdk Is the version consistent with Dockerfile The download versions in are the same .


docker run --name pdf2html -p 8686:8686 -d --rm pdf2html-service:1.0.1

If you need to set some extra parameters , You can start docker Time pass -e Pass in :

# Maximum number of child processes started at the same time , It needs to be set reasonably according to the resources of the system ( Default 15)
-e PDF2HTML_MAX_PROCESS=15 # perform /usr/local/bin/pdf2htmlEX Maximum time-out on command , Company s For seconds ( Default 600s)

namely :

docker run --name pdf2html -p 8686:8686 -e PDF2HTML_MAX_PROCESS=10 -e PDF2HTML_COMMAND_TIMEOUT=60s -d --rm pdf2html-service:1.0.1

See for more configuration : resources In the catalog application.yml file .

Http Interface

(1) View version

curl http://localhost:8686/api/version

(2) Check the configuration

curl http://localhost:8686/api/config

(3) Upload multiple pdf, And download html Compressed package

curl -o --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/001.pdf' --form 'files=@/pdfs/002.pdf' --form 'files=@/pdfs/003.pdf'

As a reminder : /pdfs/001.pdf refer to pdf The absolute path of the file

(4) The query program exposed metric

curl http://localhost:8686/api/metric

Troubleshoot problems

# Into the container 
docker exec -it pdf2html bash # Check the log directory
cd /opt/pdf2html-service/logs # View the conversion failed pdf
cd /tmp/pdf2html-service/failed-pdfs # Manual call pdf2htmlEX Command conversion pdf
pdf2htmlEX --help


Every manual call pdf2htmlEX Command line tools are not very convenient , On this basis, it is packaged into a web The service is more convenient to use . Complete source see :


because pdf2htmlEX The dependency of command line tools is more complex , Compiling is also troublesome , So it can be directly provided by the government Docker Image Install in JDK, And then use springboot Quickly write a web application , Receiving user http request , The background call pdf2htmlEX The command line tool will have multiple PDF All turned into HTML, And then compress the resulting HTML by zip package , Let users download .

Dockerfile as follows :

# pdf2htmlex image
FROM pdf2htmlex/pdf2htmlex:0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64 ENV TZ='CST-8'
ENV LANG C.UTF-8 # apt
RUN sed -i s@/ /etc/apt/sources.list
RUN apt-get clean && apt-get update
RUN apt-get install -y vim curl htop net-tools # vim
RUN echo "set fileencodings=utf-8,ucs-bom,gb18030,gbk,gb2312,cp936" >> /etc/vim/vimrc
RUN echo "set termencoding=utf-8" >> /etc/vim/vimrc
RUN echo "set encoding=utf-8" >> /etc/vim/vimrc # jdk
ADD /tmp/
RUN tar -zxf /tmp/jdk-*.tar.gz -C /opt/ && rm -f /tmp/jdk-*.tar.gz && mv /opt/jdk* /opt/jdk ENV JAVA_HOME /opt/jdk
ENV PATH ${JAVA_HOME}/bin:$PATH # pdf2html-service
COPY target/pdf2html-service-*.tar.gz /tmp/
RUN tar -zxf /tmp/pdf2html-service-*.tar.gz -C /opt/ && rm -f /tmp/pdf2html-service-*.tar.gz ENTRYPOINT [""]
WORKDIR /opt/pdf2html-service
CMD ["bash","-c","./ && tail -f /dev/null"]

Introduce dependencies

</properties> <dependencies> <dependency>
</dependency> <dependency>
</dependency> <dependency>
</dependency> <dependency>
</dependency> <dependency>
</dependency> </dependencies>

This is a springboot application :

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import; @SpringBootApplication
public class Pdf2HtmlService { public static void main(String[] args) {;

Application configuration

application.yml As follows :

port: ${APP_PORT:8686}
servlet.context-path: / pdf2html:
# /usr/local/bin/pdf2htmlEX --zoom 1.3
command: ${PDF2HTML_COMMAND:/usr/local/bin/pdf2htmlEX --zoom 1.3 --quiet 1}
command-timeout: ${PDF2HTML_COMMAND_TIMEOUT:600s}
work-dir: ${PDF2HTML_WORK_DIR:/tmp/pdf2html-service}
max-process: ${PDF2HTML_MAX_PROCESS:15} spring:
name: pdf2html-service

Corresponding Pdf2HtmlProperties as follows :

import lombok.Data;
import; import java.time.Duration; @Data
@ConfigurationProperties(prefix = "pdf2html")
public class Pdf2HtmlProperties { private String command; private String workDir; private Duration commandTimeout; // Maximum number of child processes started at the same time , It needs to be set reasonably according to the performance of the system
private int maxProcess;

Let's briefly explain the meaning of these configurations :

  • command: It means calling pdf2htmlEX The details of the command line tool command, For detailed parameters, see pdf2htmlEX --help
  • command-timeout: The use of apache Of commons-exec tool kit , Call the command line asynchronously , The maximum timeout time can be set .commons-exec Please refer to :
  • work-dir: The web The working directory of the application , That is to say, receiving the user's request when , First the pdf The file is written to a subdirectory of the directory , call pdf2htmlEX Generated html The default is also in this directory , Then compress the html file , write in response. Another thing to note is : The conversion failed pdf Will be written to the work-dir Under the failed-pdfs Next . Easy to reproduce 、 Troubleshoot problems .
  • max-process: Because I invoke command line tools in my implementation is all asynchronous operation , You must limit the number of command lines that can be started at the same time , Avoid generating a large number of subprocesses in a short time , Not only does it seriously affect program performance , And it can cause the system to jam instantly . So this configuration limits the maximum number of child processes that can be started at the same time , It needs to be set reasonably according to the performance of the system . This is for JDK Self contained java.util.concurrent.Semaphore To limit the number of child processes .

Interface implementation

Interface implementation is not complicated , Some notes have been added to the key points . as follows :

import cn.hutool.core.collection.CollUtil;
import cn.hutool.core.util.ArrayUtil;
import cn.hutool.core.util.CharsetUtil;
import cn.hutool.core.util.IdUtil;
import cn.hutool.core.util.ZipUtil;
import com.github.iflyendless.config.Pdf2HtmlProperties;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.exec.*;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile; import javax.annotation.PostConstruct;
import javax.annotation.Resource;
import javax.servlet.ServletOutputStream;
import javax.servlet.http.HttpServletResponse;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.Semaphore; @Slf4j
public class Pdf2HtmlController { private static final String PDF = "pdf";
private static final String FAILED_PDF_DIR = "failed-pdfs";
private static final String TASK_FILE = "000-task.txt"; @Resource
private Pdf2HtmlProperties pdf2HtmlProperties; // To limit simultaneous startup pdf2htmlEX The number of child processes of the command line tool
private static Semaphore semaphore; // transformation html The failure of the pdf Write to this directory , It is convenient for manual conversion in the future
private static File failedPdfDir; @PostConstruct
public void init() {
semaphore = new Semaphore(pdf2HtmlProperties.getMaxProcess());
failedPdfDir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), FAILED_PDF_DIR));
} @GetMapping("/version")
public Object version() {
return "1.0.1";
} @GetMapping("/config")
public Object config() {
return pdf2HtmlProperties;
} @GetMapping("/metric")
public Object metric() {
Map<String, Object> semaphoreMap = new LinkedHashMap<>();
semaphoreMap.put("availablePermits", semaphore.availablePermits());
semaphoreMap.put("queueLength", semaphore.getQueueLength()); Map<String, Object> metricMap = new LinkedHashMap<>();
metricMap.put("semaphore", semaphoreMap); return metricMap;
} @PostMapping("/pdf2html")
public void pdf2html(@RequestParam("files") MultipartFile[] files,
HttpServletResponse response) {
if (ArrayUtil.isEmpty(files)) {
log.warn(" The number of files is 0");
} File dir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), IdUtil.simpleUUID())); try (ServletOutputStream outputStream = response.getOutputStream()) {
List<File> fileList = new ArrayList<>(files.length);
for (MultipartFile f : files) {
if (f == null || f.isEmpty()) {
// Write to local working directory
File localFile = FileUtil.writeFromStream(f.getInputStream(), FileUtil.file(dir, f.getOriginalFilename()));
// Only deal with pdf file
if (isPdf(localFile)) {
} if (CollUtil.isEmpty(fileList)) {
} long start = System.currentTimeMillis(); int size = fileList.size();
CountDownLatch latch = new CountDownLatch(size);
// Deal with the failed pdf Statistics
Map<String, Throwable> failedMap = new ConcurrentHashMap<>(); for (File file : fileList) {
// This limits the number of startup processes
// Because the following calls are asynchronous , Prevent the generation of a large number of subprocesses in an instant
// Asynchronous call pdf2htmlEX Command line tools
invokeCommand(dir, file, latch, failedMap);
} // Wait for all child processes to finish
latch.await();"pdf2html A total of time consuming {}ms, pdf The number of {}", System.currentTimeMillis() - start, size); // Record The statistics are written into a file 000-task.txt, transformation html The failure of the pdf Write to a fixed directory
recordTaskResult(size, failedMap, dir, fileList); // The generated html Files and task.txt Compress , And write response, CharsetUtil.CHARSET_UTF_8, true, new FileFilter() {
public boolean accept(File pathname) {
if (pathname.isDirectory()) {
return true;
String name = pathname.getName().toLowerCase();
return name.endsWith(".html") || name.endsWith(".txt");
}, dir); response.addHeader("Content-Disposition",
"attachment;fileName=" + URLEncoder.encode(dir.getName() + ".zip", "UTF-8"));
response.addHeader("Content-type", "application/zip");
} catch (Throwable e) {
log.error("pdf2html error", e);
} finally {
} /**
* Use here apache Of commons-exec perform pdf2htmlEX Command line tools
* For details, see :
public void invokeCommand(File workDir, File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
String filePath = file.getAbsolutePath(); String line = String.format("%s --dest-dir %s %s", pdf2HtmlProperties.getCommand(), workDir.getAbsolutePath(), filePath);
CommandLine commandLine = CommandLine.parse(line); // Command line timeout handling
ExecuteWatchdog watchdog = new ExecuteWatchdog(1000 * pdf2HtmlProperties.getCommandTimeout().getSeconds());
// Command line Execute the completed callback
ResultHandler resultHandler = new ResultHandler(file, latch, failedMap); Executor executor = new DefaultExecutor();
executor.setWatchdog(watchdog); try {
executor.execute(commandLine, resultHandler);
} catch (Throwable e) {
String fileName = file.getName();
if (!failedMap.containsKey(fileName)) {
failedMap.put(fileName, e);
latch.countDown(); log.error("invokeCommand failed, command: {}, error:{}", line, e);
} public static boolean isPdf(File file) {
try {
return PDF.equalsIgnoreCase(FileTypeUtil.getType(file));
} catch (Exception e) {
log.error(" distinguish pdf Type failure , file name :{}, error: {}", file.getAbsolutePath(), e);
return false;
} public static void recordTaskResult(int total, Map<String, Throwable> failedMap, File workDir, List<File> pdfs) {
List<String> list = new ArrayList<>();
list.add("total:" + total);
list.add("success:" + (total - failedMap.size()));
list.add("failed:" + failedMap.size()); list.add("");
list.add(""); Set<String> failedNames = failedMap.keySet();
list.addAll(failedNames); // Record the general situation of task completion
FileUtil.writeLines(list, FileUtil.file(workDir, TASK_FILE), CharsetUtil.CHARSET_UTF_8); // The conversion failed pdf Write to other directories , Further processing may be needed in the future
if (CollUtil.isNotEmpty(failedNames)) {
for (File pdf : pdfs) {
String name = pdf.getName();
if (failedNames.contains(name)) {
File dest = FileUtil.file(failedPdfDir, name);
if (dest.exists()) {
dest = FileUtil.file(failedPdfDir, IdUtil.simpleUUID() + "-" + name);
FileUtil.copyFile(pdf, dest);
} /**
* According to the specific business logic to do the corresponding implementation , The error log will be printed here
public static class ResultHandler implements ExecuteResultHandler { private final File file;
private final CountDownLatch latch;
private final Map<String, Throwable> failedMap; @Getter
private int exitValue = -8686; public ResultHandler(File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
this.file = file;
this.latch = latch;
this.failedMap = failedMap;
} @Override
public void onProcessComplete(int exitValue) {
this.latch.countDown(); this.exitValue = exitValue;
} @Override
public void onProcessFailed(ExecuteException e) {
this.failedMap.put(this.file.getName(), e);
this.latch.countDown(); log.error("pdf2html failed, file: {}, error:{}", this.file.getAbsolutePath(), e);

Written in the back

Because I am not familiar with the front-end development , There's no time for a simple page . If you know about front-end development and are interested in this tool , You can follow a page , So much the better !!! in addition , If you know PDF turn HTML There are better tools or implementations , Leave a comment in the comments section !!!

Notes , It's convenient for you and me .

PDF turn HTML Tools —— use springboot packing pdf2htmlEX More articles on command line tools

  1. Hash Calibration tool 、MD5 SHA1 SHA256 Command line tools

    MyHash Inspection tools HashMyFiles Hash Calibration tool ...

  2. Mysql Command line tools

    1.Mysql There are two types of command line tools : Server command line tool and client command line tool . 2. Server tools mysql_install_db: Database building tools mysqld_safe:Mysql Service startup tool ,mysqld_sa ...

  3. Orchard Command line tools in

    stay Orchard A command line tool is provided in , We can use this command line tool to create users . Create a blog . The generated code . Configure the website . Packaging modules, etc . And this command line tool is extensible , As long as we create a Command class ...

  4. [MySQL] Command line tools and basic operations

    [MySQL] Command line tools and basic operations One MySQL Command line tools  ( view help ---help, or -?) 1)MySQL MySQL It's a simple one SQL Shell ( Yes GNU readline function ). It's interactive ...

  5. Command line tools aspnet_regiis.exe Achieve encryption and decryption web.config

    Command line tools aspnet_regiis.exe, Is a similar to DOS Command tool for , Call it the command interpreter . Use command line tools to encrypt and decrypt web.config Database connection string in file , Just a simple syntax command . Cryptogram ...

  6. turn :windows Command line tools

    from : Windows Next CMD It fails to work well , Far from it Linux, Or some SSH Tools are easy to use . Actually Windows ...

  7. JVM Performance monitoring and troubleshooting command line tools

    JDK Command line tools Sun The company acts as ” gift “ Gift to JDK The user's tools : Most of these command-line tools are jdk/lib/tools.jar A thin layer of packaging for a class library , The main function code is in tools Class library ( Do not belong to java Standards for API) ...

  8. 10 paragraph Windows Command line tools

    Windows Next CMD It fails to work well , Far from it Linux, Or some SSH Tools are easy to use . Actually Windows Next , There are also some good tools to replace CMD: 0.powercmd By comparison , I finally chose this , Here's a screenshot :

  9. 8 There's something you can't know Mac OS X Special command line tools ( turn )

    OS X There's a lot of common use under the terminal Unix Tools and scripts for . If from Linux Migrate to OS X You'll find a lot of familiar command and script tools , It doesn't make any difference . however OS X It also provides many special command-line tools that other systems don't have . We recommend 8 individual ...

  10. GitBook Is a command line tool (Node.js library ), We can borrow the tool to use Github/Git and Markdown To make beautiful books , But it's not a book about Git My tutorial .

    GitBook Is a command line tool (Node.js library ), We can borrow the tool to use Github/Git and Markdown To make beautiful books , But it's not a book about Git My tutorial . Supports multiple output formats GitBook the ...

Random recommendation

  1. iOS Development --AVFoundation Custom camera

    First, import a header file #import <AVFoundation/AVFoundation.h> Because we need to write the photos into the system album later , So we need to import a header file for the album here ...

  2. oracle in TO_CHAR And TO_DATE

    TO_CHAR Is to convert a date or number to a string TO_DATE Is to convert a string to a date type conversion function in the database TO_DATE Format ( In time :2016-07-25   11:45:25 For example ) Year: yy t ...

  3. hdu 5748(LIS) Bellovin

    hdu 5748 Peter There's a sequence a1,a2,...,ana_1,a_2,...,a_na​1​​,a​2​​,...,a​n​​. Definition F(a1,a2,...,an)=(f1,f2,...,fn ...

  4. discuz x2 diy It doesn't work to click on the style of the module , Module data 、 Titles can be edited

    This is diy Forget to add the template file   <style id="diy_style" type="text/css"></style>   One ...

  5. 《 Introduction to algorithms 》 Problem solving Chapter 22.1-4( Remove the heavy edge )

    Ideas : Reopen a new picture , Traverse from top to bottom in the order of adjacent list , Before traversing each row of linked list , Empty visited Array , If you haven't accessed this element , Add a new picture , If you've already visited ( Heavy edge ), It doesn't move . Pseudo code : Complexity :O(V+E) f ...

  6. static state html The parameter

    Take notes a.html <html> <head> <body> <a href="c.html?test= Master " target=& ...

  7. JS Two decimal places are reserved for formatted data in

    problem : stay JS There are many ways to format a function with two decimal places The best way : It's like keeping two      var   a   =   9.39393;     alert(a.toFixed(2)); explain : ...

  8. Linux command File creation, move, delete

    cat [ Functional specifications ] create file  #cat Commands are used to concatenate files or display the contents of files, but if you read data from a standard input device and redirect the result to a new file , You can create a new file .Cat The command can only be used when editing a new file ...

  9. Android It's easy to build from scratch MVP Demo

    First, a brief introduction MVP The composition, advantages and disadvantages of the system : MVP The full name is Model-View-Presenter, Model Provide data ( Network request . Data storage, etc ): View Responsible for page display : Presenter In charge of logical processing ...

  10. AI Mobile automation test framework design ( Reading )

    Statement : Original text " At the top of the front end " WeChat official account " Iqiyi is based on AI The design of mobile automation testing framework " One article , author : He Liangwei , Iqiyi Android Architects . This article provides a method based on AI The self-sufficiency of the algorithm ...