Convert PDF to HTML without losing text or format.

use springboot hold pdf2htmlEX The command line tool is packaged as web service , bring PDF turn HTML More convenient .

pdf2htmlEX See the :

https://github.com/pdf2htmlEX/pdf2htmlEX

pdf2html-service See source code :

https://github.com/iflyendless/pdf2html-service

Quick start

# Pull the mirror image 
docker pull iflyendless/pdf2html-service:1.0.1 # start-up
docker run --name pdf2html -p 8686:8686 -d --rm iflyendless/pdf2html-service:1.0.1

Use :

curl -o html.zip --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/example.pdf'

As a reminder : /pdfs/example.pdf refer to pdf The absolute path of the file

Unzip in the current directory html.zip, You can see the converted html Files and 000-task.txt.

Build a mirror image

# Download code 
git clone https://github.com/iflyendless/pdf2html-service.git # Entry project
cd pdf2html-service # Skip unit test packaging
mvn clean package -DskipTests # build docker image
docker build -t pdf2html-service:1.0.1 .

If the build image fails , Please check https://enos.itcollege.ee/~jpoial/allalaadimised/jdk8/ Under this site jdk Is the version consistent with Dockerfile The download versions in are the same .

start-up

docker run --name pdf2html -p 8686:8686 -d --rm pdf2html-service:1.0.1

If you need to set some extra parameters , You can start docker Time pass -e Pass in :

# Maximum number of child processes started at the same time , It needs to be set reasonably according to the resources of the system ( Default 15)
-e PDF2HTML_MAX_PROCESS=15 # perform /usr/local/bin/pdf2htmlEX Maximum time-out on command , Company s For seconds ( Default 600s)
-e PDF2HTML_COMMAND_TIMEOUT=600s

namely :

docker run --name pdf2html -p 8686:8686 -e PDF2HTML_MAX_PROCESS=10 -e PDF2HTML_COMMAND_TIMEOUT=60s -d --rm pdf2html-service:1.0.1

See for more configuration : resources In the catalog application.yml file .

Http Interface

(1) View version

curl http://localhost:8686/api/version

(2) Check the configuration

curl http://localhost:8686/api/config

(3) Upload multiple pdf, And download html Compressed package

curl -o html.zip --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/001.pdf' --form 'files=@/pdfs/002.pdf' --form 'files=@/pdfs/003.pdf'

As a reminder : /pdfs/001.pdf refer to pdf The absolute path of the file

(4) The query program exposed metric

curl http://localhost:8686/api/metric

Troubleshoot problems

# Into the container 
docker exec -it pdf2html bash # Check the log directory
cd /opt/pdf2html-service/logs # View the conversion failed pdf
cd /tmp/pdf2html-service/failed-pdfs # Manual call pdf2htmlEX Command conversion pdf
pdf2htmlEX --help

Realization

Every manual call pdf2htmlEX Command line tools are not very convenient , On this basis, it is packaged into a web The service is more convenient to use . Complete source see :

https://github.com/iflyendless/pdf2html-service

Ideas

because pdf2htmlEX The dependency of command line tools is more complex , Compiling is also troublesome , So it can be directly provided by the government Docker Image Install in JDK, And then use springboot Quickly write a web application , Receiving user http request , The background call pdf2htmlEX The command line tool will have multiple PDF All turned into HTML, And then compress the resulting HTML by zip package , Let users download .

Dockerfile as follows :

# pdf2htmlex image
FROM pdf2htmlex/pdf2htmlex:0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64 ENV TZ='CST-8'
ENV LANG C.UTF-8 # apt
RUN sed -i s@/archive.ubuntu.com/@/mirrors.aliyun.com/@g /etc/apt/sources.list
RUN apt-get clean && apt-get update
RUN apt-get install -y vim curl htop net-tools # vim
RUN echo "set fileencodings=utf-8,ucs-bom,gb18030,gbk,gb2312,cp936" >> /etc/vim/vimrc
RUN echo "set termencoding=utf-8" >> /etc/vim/vimrc
RUN echo "set encoding=utf-8" >> /etc/vim/vimrc # jdk
ADD https://enos.itcollege.ee/~jpoial/allalaadimised/jdk8/jdk-8u291-linux-x64.tar.gz /tmp/
RUN tar -zxf /tmp/jdk-*.tar.gz -C /opt/ && rm -f /tmp/jdk-*.tar.gz && mv /opt/jdk* /opt/jdk ENV JAVA_HOME /opt/jdk
ENV PATH ${JAVA_HOME}/bin:$PATH # pdf2html-service
COPY target/pdf2html-service-*.tar.gz /tmp/
RUN tar -zxf /tmp/pdf2html-service-*.tar.gz -C /opt/ && rm -f /tmp/pdf2html-service-*.tar.gz ENTRYPOINT [""]
WORKDIR /opt/pdf2html-service
CMD ["bash","-c","./start.sh && tail -f /dev/null"]

Introduce dependencies

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<java.version>1.8</java.version>
<maven.build.timestamp.format>yyyyMMdd</maven.build.timestamp.format>
<hutool.version>5.6.3</hutool.version>
</properties> <dependencies> <dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency> <dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
</dependency> <dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-exec</artifactId>
<version>1.3</version>
</dependency> <dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>${hutool.version}</version>
</dependency> <dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</dependency> </dependencies>

This is a springboot application :

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.context.properties.ConfigurationPropertiesScan; @SpringBootApplication
@ConfigurationPropertiesScan
public class Pdf2HtmlService { public static void main(String[] args) {
SpringApplication.run(Pdf2HtmlService.class);
}
}

Application configuration

application.yml As follows :

server:
port: ${APP_PORT:8686}
servlet.context-path: / pdf2html:
# /usr/local/bin/pdf2htmlEX --zoom 1.3
command: ${PDF2HTML_COMMAND:/usr/local/bin/pdf2htmlEX --zoom 1.3 --quiet 1}
command-timeout: ${PDF2HTML_COMMAND_TIMEOUT:600s}
work-dir: ${PDF2HTML_WORK_DIR:/tmp/pdf2html-service}
max-process: ${PDF2HTML_MAX_PROCESS:15} spring:
application:
name: pdf2html-service

Corresponding Pdf2HtmlProperties as follows :

import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties; import java.time.Duration; @Data
@ConfigurationProperties(prefix = "pdf2html")
public class Pdf2HtmlProperties { private String command; private String workDir; private Duration commandTimeout; // Maximum number of child processes started at the same time , It needs to be set reasonably according to the performance of the system
private int maxProcess;
}

Let's briefly explain the meaning of these configurations :

  • command: It means calling pdf2htmlEX The details of the command line tool command, For detailed parameters, see pdf2htmlEX --help
  • command-timeout: The use of apache Of commons-exec tool kit , Call the command line asynchronously , The maximum timeout time can be set .commons-exec Please refer to :https://commons.apache.org/proper/commons-exec/tutorial.html
  • work-dir: The web The working directory of the application , That is to say, receiving the user's request when , First the pdf The file is written to a subdirectory of the directory , call pdf2htmlEX Generated html The default is also in this directory , Then compress the html file , write in response. Another thing to note is : The conversion failed pdf Will be written to the work-dir Under the failed-pdfs Next . Easy to reproduce 、 Troubleshoot problems .
  • max-process: Because I invoke command line tools in my implementation is all asynchronous operation , You must limit the number of command lines that can be started at the same time , Avoid generating a large number of subprocesses in a short time , Not only does it seriously affect program performance , And it can cause the system to jam instantly . So this configuration limits the maximum number of child processes that can be started at the same time , It needs to be set reasonably according to the performance of the system . This is for JDK Self contained java.util.concurrent.Semaphore To limit the number of child processes .

Interface implementation

Interface implementation is not complicated , Some notes have been added to the key points . as follows :

import cn.hutool.core.collection.CollUtil;
import cn.hutool.core.io.FileTypeUtil;
import cn.hutool.core.io.FileUtil;
import cn.hutool.core.util.ArrayUtil;
import cn.hutool.core.util.CharsetUtil;
import cn.hutool.core.util.IdUtil;
import cn.hutool.core.util.ZipUtil;
import com.github.iflyendless.config.Pdf2HtmlProperties;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.exec.*;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile; import javax.annotation.PostConstruct;
import javax.annotation.Resource;
import javax.servlet.ServletOutputStream;
import javax.servlet.http.HttpServletResponse;
import java.io.File;
import java.io.FileFilter;
import java.net.URLEncoder;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.Semaphore; @Slf4j
@RestController
@RequestMapping("/api")
public class Pdf2HtmlController { private static final String PDF = "pdf";
private static final String FAILED_PDF_DIR = "failed-pdfs";
private static final String TASK_FILE = "000-task.txt"; @Resource
private Pdf2HtmlProperties pdf2HtmlProperties; // To limit simultaneous startup pdf2htmlEX The number of child processes of the command line tool
private static Semaphore semaphore; // transformation html The failure of the pdf Write to this directory , It is convenient for manual conversion in the future
private static File failedPdfDir; @PostConstruct
public void init() {
semaphore = new Semaphore(pdf2HtmlProperties.getMaxProcess());
failedPdfDir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), FAILED_PDF_DIR));
} @GetMapping("/version")
public Object version() {
return "1.0.1";
} @GetMapping("/config")
public Object config() {
return pdf2HtmlProperties;
} @GetMapping("/metric")
public Object metric() {
Map<String, Object> semaphoreMap = new LinkedHashMap<>();
semaphoreMap.put("availablePermits", semaphore.availablePermits());
semaphoreMap.put("queueLength", semaphore.getQueueLength()); Map<String, Object> metricMap = new LinkedHashMap<>();
metricMap.put("semaphore", semaphoreMap); return metricMap;
} @PostMapping("/pdf2html")
public void pdf2html(@RequestParam("files") MultipartFile[] files,
HttpServletResponse response) {
if (ArrayUtil.isEmpty(files)) {
log.warn(" The number of files is 0");
return;
} File dir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), IdUtil.simpleUUID())); try (ServletOutputStream outputStream = response.getOutputStream()) {
List<File> fileList = new ArrayList<>(files.length);
for (MultipartFile f : files) {
if (f == null || f.isEmpty()) {
continue;
}
// Write to local working directory
File localFile = FileUtil.writeFromStream(f.getInputStream(), FileUtil.file(dir, f.getOriginalFilename()));
// Only deal with pdf file
if (isPdf(localFile)) {
fileList.add(localFile);
}
} if (CollUtil.isEmpty(fileList)) {
return;
} long start = System.currentTimeMillis(); int size = fileList.size();
CountDownLatch latch = new CountDownLatch(size);
// Deal with the failed pdf Statistics
Map<String, Throwable> failedMap = new ConcurrentHashMap<>(); for (File file : fileList) {
// This limits the number of startup processes
// Because the following calls are asynchronous , Prevent the generation of a large number of subprocesses in an instant
semaphore.acquire();
// Asynchronous call pdf2htmlEX Command line tools
invokeCommand(dir, file, latch, failedMap);
} // Wait for all child processes to finish
latch.await(); log.info("pdf2html A total of time consuming {}ms, pdf The number of {}", System.currentTimeMillis() - start, size); // Record The statistics are written into a file 000-task.txt, transformation html The failure of the pdf Write to a fixed directory
recordTaskResult(size, failedMap, dir, fileList); // The generated html Files and task.txt Compress , And write response
ZipUtil.zip(outputStream, CharsetUtil.CHARSET_UTF_8, true, new FileFilter() {
@Override
public boolean accept(File pathname) {
if (pathname.isDirectory()) {
return true;
}
String name = pathname.getName().toLowerCase();
return name.endsWith(".html") || name.endsWith(".txt");
}
}, dir); response.addHeader("Content-Disposition",
"attachment;fileName=" + URLEncoder.encode(dir.getName() + ".zip", "UTF-8"));
response.addHeader("Content-type", "application/zip");
} catch (Throwable e) {
log.error("pdf2html error", e);
} finally {
FileUtil.del(dir);
}
} /**
* Use here apache Of commons-exec perform pdf2htmlEX Command line tools
* For details, see : https://commons.apache.org/proper/commons-exec/tutorial.html
*/
public void invokeCommand(File workDir, File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
String filePath = file.getAbsolutePath(); String line = String.format("%s --dest-dir %s %s", pdf2HtmlProperties.getCommand(), workDir.getAbsolutePath(), filePath);
CommandLine commandLine = CommandLine.parse(line); // Command line timeout handling
ExecuteWatchdog watchdog = new ExecuteWatchdog(1000 * pdf2HtmlProperties.getCommandTimeout().getSeconds());
// Command line Execute the completed callback
ResultHandler resultHandler = new ResultHandler(file, latch, failedMap); Executor executor = new DefaultExecutor();
executor.setExitValue(0);
executor.setWatchdog(watchdog); try {
executor.execute(commandLine, resultHandler);
} catch (Throwable e) {
semaphore.release();
String fileName = file.getName();
if (!failedMap.containsKey(fileName)) {
failedMap.put(fileName, e);
}
latch.countDown(); log.error("invokeCommand failed, command: {}, error:{}", line, e);
}
} public static boolean isPdf(File file) {
try {
return PDF.equalsIgnoreCase(FileTypeUtil.getType(file));
} catch (Exception e) {
log.error(" distinguish pdf Type failure , file name :{}, error: {}", file.getAbsolutePath(), e);
return false;
}
} public static void recordTaskResult(int total, Map<String, Throwable> failedMap, File workDir, List<File> pdfs) {
List<String> list = new ArrayList<>();
list.add("total:" + total);
list.add("success:" + (total - failedMap.size()));
list.add("failed:" + failedMap.size()); list.add("");
list.add("failed-pdfs:");
list.add(""); Set<String> failedNames = failedMap.keySet();
list.addAll(failedNames); // Record the general situation of task completion
FileUtil.writeLines(list, FileUtil.file(workDir, TASK_FILE), CharsetUtil.CHARSET_UTF_8); // The conversion failed pdf Write to other directories , Further processing may be needed in the future
if (CollUtil.isNotEmpty(failedNames)) {
for (File pdf : pdfs) {
String name = pdf.getName();
if (failedNames.contains(name)) {
File dest = FileUtil.file(failedPdfDir, name);
if (dest.exists()) {
dest = FileUtil.file(failedPdfDir, IdUtil.simpleUUID() + "-" + name);
}
FileUtil.copyFile(pdf, dest);
}
}
}
} /**
* According to the specific business logic to do the corresponding implementation , The error log will be printed here
*/
public static class ResultHandler implements ExecuteResultHandler { private final File file;
private final CountDownLatch latch;
private final Map<String, Throwable> failedMap; @Getter
private int exitValue = -8686; public ResultHandler(File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
this.file = file;
this.latch = latch;
this.failedMap = failedMap;
} @Override
public void onProcessComplete(int exitValue) {
semaphore.release();
this.latch.countDown(); this.exitValue = exitValue;
} @Override
public void onProcessFailed(ExecuteException e) {
semaphore.release();
this.failedMap.put(this.file.getName(), e);
this.latch.countDown(); log.error("pdf2html failed, file: {}, error:{}", this.file.getAbsolutePath(), e);
}
}
}

Written in the back

Because I am not familiar with the front-end development , There's no time for a simple page . If you know about front-end development and are interested in this tool , You can follow a page , So much the better !!! in addition , If you know PDF turn HTML There are better tools or implementations , Leave a comment in the comments section !!!

Notes , It's convenient for you and me .

PDF turn HTML Tools —— use springboot packing pdf2htmlEX More articles on command line tools

  1. Hash Calibration tool 、MD5 SHA1 SHA256 Command line tools

    MyHash Inspection tools http://www.zdfans.com/html/4346.html HashMyFiles Hash Calibration tool http://www.nirsoft.net/utils/has ...

  2. Mysql Command line tools

    1.Mysql There are two types of command line tools : Server command line tool and client command line tool . 2. Server tools mysql_install_db: Database building tools mysqld_safe:Mysql Service startup tool ,mysqld_sa ...

  3. Orchard Command line tools in

    stay Orchard A command line tool is provided in , We can use this command line tool to create users . Create a blog . The generated code . Configure the website . Packaging modules, etc . And this command line tool is extensible , As long as we create a Command class ...

  4. [MySQL] Command line tools and basic operations

    [MySQL] Command line tools and basic operations One MySQL Command line tools  ( view help ---help, or -?) 1)MySQL MySQL It's a simple one SQL Shell ( Yes GNU readline function ). It's interactive ...

  5. Command line tools aspnet_regiis.exe Achieve encryption and decryption web.config

    Command line tools aspnet_regiis.exe, Is a similar to DOS Command tool for , Call it the command interpreter . Use command line tools to encrypt and decrypt web.config Database connection string in file , Just a simple syntax command . Cryptogram ...

  6. turn :windows Command line tools

    from : http://www.cnblogs.com/haochuang/p/5593411.html Windows Next CMD It fails to work well , Far from it Linux, Or some SSH Tools are easy to use . Actually Windows ...

  7. JVM Performance monitoring and troubleshooting command line tools

    JDK Command line tools Sun The company acts as ” gift “ Gift to JDK The user's tools : Most of these command-line tools are jdk/lib/tools.jar A thin layer of packaging for a class library , The main function code is in tools Class library ( Do not belong to java Standards for API) ...

  8. 10 paragraph Windows Command line tools

    Windows Next CMD It fails to work well , Far from it Linux, Or some SSH Tools are easy to use . Actually Windows Next , There are also some good tools to replace CMD: 0.powercmd By comparison , I finally chose this , Here's a screenshot :

  9. 8 There's something you can't know Mac OS X Special command line tools ( turn )

    OS X There's a lot of common use under the terminal Unix Tools and scripts for . If from Linux Migrate to OS X You'll find a lot of familiar command and script tools , It doesn't make any difference . however OS X It also provides many special command-line tools that other systems don't have . We recommend 8 individual ...

  10. GitBook Is a command line tool (Node.js library ), We can borrow the tool to use Github/Git and Markdown To make beautiful books , But it's not a book about Git My tutorial .

    GitBook Is a command line tool (Node.js library ), We can borrow the tool to use Github/Git and Markdown To make beautiful books , But it's not a book about Git My tutorial . Supports multiple output formats GitBook the ...

Random recommendation

  1. iOS Development --AVFoundation Custom camera

    First, import a header file #import <AVFoundation/AVFoundation.h> Because we need to write the photos into the system album later , So we need to import a header file for the album here ...

  2. oracle in TO_CHAR And TO_DATE

    TO_CHAR Is to convert a date or number to a string TO_DATE Is to convert a string to a date type conversion function in the database TO_DATE Format ( In time :2016-07-25   11:45:25 For example ) Year: yy t ...

  3. hdu 5748(LIS) Bellovin

    hdu 5748 Peter There's a sequence a1,a2,...,ana_1,a_2,...,a_na​1​​,a​2​​,...,a​n​​. Definition F(a1,a2,...,an)=(f1,f2,...,fn ...

  4. discuz x2 diy It doesn't work to click on the style of the module , Module data 、 Titles can be edited

    This is diy Forget to add the template file   <style id="diy_style" type="text/css"></style>   One ...

  5. 《 Introduction to algorithms 》 Problem solving Chapter 22.1-4( Remove the heavy edge )

    Ideas : Reopen a new picture , Traverse from top to bottom in the order of adjacent list , Before traversing each row of linked list , Empty visited Array , If you haven't accessed this element , Add a new picture , If you've already visited ( Heavy edge ), It doesn't move . Pseudo code : Complexity :O(V+E) f ...

  6. static state html The parameter

    Take notes a.html <html> <head> <body> <a href="c.html?test= Master " target=& ...

  7. JS Two decimal places are reserved for formatted data in

    problem : stay JS There are many ways to format a function with two decimal places The best way : It's like keeping two      var   a   =   9.39393;     alert(a.toFixed(2)); explain : ...

  8. Linux command File creation, move, delete

    cat [ Functional specifications ] create file  #cat Commands are used to concatenate files or display the contents of files, but if you read data from a standard input device and redirect the result to a new file , You can create a new file .Cat The command can only be used when editing a new file ...

  9. Android It's easy to build from scratch MVP Demo

    First, a brief introduction MVP The composition, advantages and disadvantages of the system : MVP The full name is Model-View-Presenter, Model Provide data ( Network request . Data storage, etc ): View Responsible for page display : Presenter In charge of logical processing ...

  10. AI Mobile automation test framework design ( Reading )

    Statement : Original text " At the top of the front end " WeChat official account " Iqiyi is based on AI The design of mobile automation testing framework " One article , author : He Liangwei , Iqiyi Android Architects . This article provides a method based on AI The self-sufficiency of the algorithm ...