How to convert PDF to text with format kept on Linux?

How to convert PDF to text with format kept on Linux?

Many of the formatting in PDF will not be available in text. But better keep the text’s relative positions as the same. For example, the table columns should be kept.

The pdftotext tool can convert PDF to text pretty well:

pdftotext – Portable Document Format (PDF) to text converter

with the -layout option:

-layout

Maintain (as best as possible) the original physical layout of the text. The default is to 'undo' physical layout (columns, hyphenation,

etc.) and output the text in reading order.

$ pdftotext -layout file.pdf file.txt

and file.txt will contain the text version of the main text content of the PDF with layout kept as best as possible.

Similar Posts

  • Creating and Running Virtual Machine Using VMware Player and qemu-img

    基于VMware Player, qemu-img创建和运行虚拟机 Linux系统上的免费个人虚拟机方案. 1. 创建虚拟硬盘映象文件: qemu-img create -f vmdk WindowsXPPro.vmdk 10G 2. 创建.vmx虚拟机配置文件: 这是一文本文件. WindowsXPPro.vmx内容: config.version = “8” virtualHW.version = “3” ide0:0.present = “TRUE” ide0:0.filename = “WindowsXPPro.vmdk” memsize = “256” MemAllowAutoScaleDown = “FALSE” ide1:0.present = “TRUE” ide1:0.fileName = “auto detect” ide1:0.deviceType = “cdrom-raw” ide1:0.autodetect = “TRUE” floppy0.present = “FALSE” ethernet0.present = “TRUE” usb.present =…

  • Cache at Facebook

    About caching system at Facebook. According to: https://www.facebook.com/notes/facebook-engineering/monitoring-cache-with-claspin/10151076705703920 Facebook has two major cache systems: Memcache, which is a simple lookaside cache with most of its smarts in the client, and TAO, a caching graph database that does its own queries to MySQL. The NSDI’13 paper introduces more about Memcache: https://www.usenix.org/conference/nsdi13/scaling-memcache-facebook The USENIX ATC’13 paper introduces…

  • Systems Conferences

    Which ones are good systems conferences? Top ones by ACM and USENIX: OSDI: https://www.usenix.org/conferences/byname/179 SOSP: http://sosp.org/ Other SIGOPS Events: http://www.sigops.org/conf-sponsored.html EuroSys: http://www.eurosys.org/ SoCC: http://www.socc2013.org/ (SoCC 2013) ASPLOS: http://www.sigplan.org/Conferences/ASPLOS/Main VEE: http://www.sigplan.org/vee.htm USENIX ATC: https://www.usenix.org/conferences/byname/131 NSDI: https://www.usenix.org/conferences/byname/178 IEEE Conferences: ICDCS: http://www.temple.edu/cis/icdcs2013/ (2013) IPDPS: http://www.ipdps.org/ Other related ones and workshops: HPCA: Search HPCA ConferenceSC: http://www.supercomp.org/IEEE CLUSTER: http://www.clustercomp.org/ HotCloud:…

One Comment

  1. For linux users, nothing works better than using Calibre to convert pdf files to docx (or any other number of other formats). After conversion, clean up the docx by using LibreOffice Writer with the Advanced Search and Replace plug-in installed. https://calibre-ebook.com/download_linux

Leave a Reply

Your email address will not be published. Required fields are marked *