catdoc (1) Linux Manual Page
NAME
catdoc – reads MS-Word file and puts its content as plain text on standard output
SYNOPSIS
catdoc [-vlu8btawxV] [-m number] [ -s charset] [ -d charset] [ -f output-format] file
DESCRIPTION
catdoc behaves much like cat(1) but it reads MS-Word file and produces human-readable text on standard output. Optionally it can use latex(1) escape sequences for characters which have special meaning for LaTeX. It also makes some effort to recognize MS-Word tables, although it never tries to write correct headers for LaTeX tabular environment. Additional output formats, such is HTML can be easily defined.
catdoc doesn’t attempt to extract formatting information other than tables from MS-Word document, so different output modes means mainly that different characters should be escaped and different ways used to represent characters, missing from output charset. See CHARACTER SUBSTITUTION below
catdoc uses internal unicode(4) representation of text, so it is able to convert texts when charset in source document doesn’t match charset on target system. See CHARACTER SETS below.
If no file names supplied, catdoc processes its standard input unless it is terminal. It is unlikely that somebody could type Word document from keyboard, so if catdoc invoked without arguments and stdin is not redirected, it prints brief usage message and exits. Processing of standard input (even among other files) can be forced using dash ‘-‘ as file name.
By default, catdoc wraps lines which are more than 72 chars long and separates paragraphs by blank lines. This behavior can be turned of by -w switch. In wide mode catdoc prints each paragraph as one long line, suitable for import into word processors which perform word wrapping theirselves.
OPTIONS
-a- – shortcut for -f ascii. Produces ASCII text as output. Separates table columns with TAB
-b- – process broken MS-Word file. Normally,
catdoc checks if first 8 bytesof file is Microsoft OLE signature. If so, it processes file, otherwise it just copies it to stdin. It is intended to usecatdocas filter for viewing all files with .doc extension. -dcharset- – specifies destination charset name. Charset file has format described in CHARACTER SETS below and should have
.txtextension and reside incatdoc library directory ( /usr/lib64/catdoc). By default, currentlocale charset is used if langinfo support compiled in. -fformat- – specifies output format as described in CHARACTER SUBSTITUTION below.
catdoccomes with two output formats – ascii and tex. You can add your own if you wish. -l- Causes
catdocto list names of available charsets to the stdout and exit successfully. -mnumber- Specifies right margin for text (default 72).
-m 0is equivalent to-w -scharset- Specifies source charset. (one used in Word document), if Word document doesn’t contain UTF-16 text. When reading rtf documents, it is typically not necessary, because rtf documents contain ansicpg specification. But it can be set wrong by Word (I’ve seen RTF documents on Russian, where cp1252 was specified). In this case this option would take precedence over charset, specified in the document. But source_charset statement in the configuration file have less priority than charset in the document.
-t- – shortcut for
-f tex
converts all printable chars, which have special meaning for LaTeX(1) into appropriate control sequences. Separates table columns by&.-u- – declares that Word document contain UNICODE (UTF-16) representation of text (as some Word-97 documents). If catdoc fails to correct Word document with default charset, try this option.
-8- – declares is Word document is 8 bit. Just in case that catdoc
recognizes file format incorrectly. -w- disables word wrapping. By default
catdocoutput is splitted into lines not longer than 72 (or number, specified by -m option) characters and paragraphs are separated by blank line. With this option each paragraph is one long line.-x- causes catdoc to output unknown UNICODE character as
