Converting HTML to Plain Text on Linux
You have several options for converting HTML files to plain text, each with different strengths depending on your use case.
html2text
The most straightforward tool is html2text, available on most distributions:
html2text input.html
The converted text prints to stdout. Redirect to a file if needed:
html2text input.html > output.txt
Install it via your package manager:
# Debian/Ubuntu
sudo apt install html2text
# Fedora/RHEL
sudo dnf install html2text
# Arch
sudo pacman -S html2text
Useful options include:
-o output.txt: Write directly to a file instead of using redirection--body-width 100: Set text width (useful for pages that render strangely with default width)--ignore-links: Remove hyperlinks from output--ignore-images: Remove image references--unicode-snob: Use Unicode characters instead of ASCII alternatives--bypass-tables: Skip table rendering (output becomes more readable but loses structure)
Example with multiple options:
html2text --body-width 100 --ignore-links input.html > output.txt
lynx
lynx is a text-based web browser that can dump HTML to plain text:
lynx -dump -nolist input.html > output.txt
Key flags:
-dump: Output formatted text and exit-nolist: Don’t number links-display-charset=utf-8: Handle Unicode properly-assume_charset=utf-8: Assume input is UTF-8
Install with:
# Debian/Ubuntu
sudo apt install lynx
# Fedora/RHEL
sudo dnf install lynx
# Arch
sudo pacman -S lynx
w3m
Another text browser option:
w3m -dump input.html > output.txt
# Debian/Ubuntu
sudo apt install w3m
# Fedora/RHEL
sudo dnf install w3m
# Arch
sudo pacman -S w3m
pandoc
For more sophisticated conversions (especially when converting between multiple formats), use pandoc:
pandoc input.html -t plain -o output.txt
Pandoc is particularly useful if you need to convert to other formats later (Markdown, reStructuredText, etc.). It handles complex HTML better than simpler tools:
# Install
sudo apt install pandoc # Debian/Ubuntu
sudo dnf install pandoc # Fedora/RHEL
Command-line processing with sed/awk
For quick one-off conversions or integrating into scripts, strip HTML tags with basic text processing:
sed 's/<[^>]*>//g' input.html > output.txt
This removes all HTML tags but doesn’t handle entities ( , <, etc.). For better handling:
sed 's/<[^>]*>//g; s/ / /g; s/</</g; s/>/>/g; s/&/\&/g' input.html
Choosing a tool
- html2text: Best for general use, handles markup reasonably well
- lynx/w3m: Good for web pages with proper rendering, slower
- pandoc: Best for complex documents or when you need output flexibility
- sed/awk: For scripting or when you just need tags stripped quickly
For most cases, html2text is the right choice. It’s fast, handles most HTML correctly, and doesn’t require interpreting JavaScript or rendering complex layouts.
Thank you for sharing this information about html conversion.