Catdoc (1) Linux Manual Page

NAME

catdoc – reads MS-Word file and puts its content as plain text on standard output

SYNOPSIS

catdoc [-vlu8btawxV] [-m number] [ -s charset] [ -d charset] [ -f output-format] file

DESCRIPTION

catdoc behaves much like cat(1) but it reads MS-Word file and produces human-readable text on standard output. Optionally it can use latex(1) escape sequences for characters which have special meaning for LaTeX. It also makes some effort to recognize MS-Word tables, although it never tries to write correct headers for LaTeX tabular environment. Additional output formats, such is HTML can be easily defined.

catdoc doesn’t attempt to extract formatting information other than tables from MS-Word document, so different output modes means mainly that different characters should be escaped and different ways used to represent characters, missing from output charset. See CHARACTER SUBSTITUTION below

catdoc uses internal unicode(4) representation of text, so it is able to convert texts when charset in source document doesn’t match charset on target system. See CHARACTER SETS below.

If no file names supplied, catdoc processes its standard input unless it is terminal. It is unlikely that somebody could type Word document from keyboard, so if catdoc invoked without arguments and stdin is not redirected, it prints brief usage message and exits. Processing of standard input (even among other files) can be forced using dash ‘-‘ as file name.

By default, catdoc wraps lines which are more than 72 chars long and separates paragraphs by blank lines. This behavior can be turned of by -w switch. In wide mode catdoc prints each paragraph as one long line, suitable for import into word processors which perform word wrapping theirselves.

OPTIONS

-a: – shortcut for -f ascii. Produces ASCII text as output. Separates table columns with TAB
-b: – process broken MS-Word file. Normally, catdoc checks if first 8 bytes of file is Microsoft OLE signature. If so, it processes file, otherwise it just copies it to stdin. It is intended to use catdoc as filter for viewing all files with .doc extension.
-dcharset: – specifies destination charset name. Charset file has format described in CHARACTER SETS below and should have .txt extension and reside in catdoc library directory ( /usr/lib64/catdoc). By default, current locale charset is used if langinfo support compiled in.
-fformat: – specifies output format as described in CHARACTER SUBSTITUTION below. catdoc comes with two output formats – ascii and tex. You can add your own if you wish.
-l: Causes catdoc to list names of available charsets to the stdout and exit successfully.
-mnumber: Specifies right margin for text (default 72). -m 0 is equivalent to -w
-scharset: Specifies source charset. (one used in Word document), if Word document doesn’t contain UTF-16 text. When reading rtf documents, it is typically not necessary, because rtf documents contain ansicpg specification. But it can be set wrong by Word (I’ve seen RTF documents on Russian, where cp1252 was specified). In this case this option would take precedence over charset, specified in the document. But source_charset statement in the configuration file have less priority than charset in the document.
-t: – shortcut for -f tex
converts all printable chars, which have special meaning for LaTeX(1) into appropriate control sequences. Separates table columns by &.
-u: – declares that Word document contain UNICODE (UTF-16) representation of text (as some Word-97 documents). If catdoc fails to correct Word document with default charset, try this option.
-8: – declares is Word document is 8 bit. Just in case that catdoc
recognizes file format incorrectly.
-w: disables word wrapping. By default catdoc output is splitted into lines not longer than 72 (or number, specified by -m option) characters and paragraphs are separated by blank line. With this option each paragraph is one long line.
-x: causes catdoc to output unknown UNICODE character as

catdoc (1) Linux Manual Page

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

Leave a Reply Cancel reply