把docx文件转换为md文件,图片、表格、标题、章节保持不变
| test123 | ||
| testfile2 | ||
| .gitignore | ||
| analyze_outline.py | ||
| create_test_doc.py | ||
| docx_to_md.py | ||
| README.md | ||
| requirements.txt | ||
| testfile.md | ||
DOCX to Markdown Converter
This Python script converts DOCX files to Markdown format, preserving formatting such as headings, bold, italic, underline, strikethrough, and highlight. It also extracts images from the DOCX file and saves them in an images directory.
Features
- Converts DOCX to Markdown format
- Preserves text formatting (headings, bold, italic, underline, strikethrough, highlight)
- Extracts images and saves them in an
imagesdirectory - Processes tables and converts them to Markdown format
- Command-line interface for specifying input and output files
Requirements
- Python 3.x
- python-docx library
Install the required dependencies with:
pip install python-docx
Usage
python docx_to_md.py <input.docx> [output_directory]
Examples
# Convert a DOCX file to Markdown (output to current directory)
python docx_to_md.py document.docx
# Convert a DOCX file to Markdown with a specific output directory
python docx_to_md.py document.docx /path/to/output/directory
# If not specified, the output directory defaults to the current directory
python docx_to_md.py document.docx
The output Markdown file will have the same name as the input DOCX file, but with a .md extension.
How It Works
- The script reads the DOCX file using the
python-docxlibrary - It extracts all images from the document and saves them in an
imagessubdirectory - It processes paragraphs, preserving formatting:
- Headings are converted to Markdown headings (#, ##, ###, etc.)
- Bold text is wrapped in
** - Italic text is wrapped in
* - Underlined text is wrapped in
* - Strikethrough text is wrapped in
~~ - Highlighted text is wrapped in
**
- Tables are converted to Markdown table format
- The output is written to the specified Markdown file
Output Structure
The script creates the following structure:
output.md # The main Markdown file
images/ # Directory containing extracted images
image_1.png
image_2.png
...
License
This project is licensed under the MIT License.