MarkItDown - Microsoft's Python tool for turning documents into Markdown
Contents
When you want an LLM to read a document, handing it a raw PDF or Word file is awkward. If you can extract text and convert it to Markdown while preserving structure, processing becomes much easier. Microsoft’s MarkItDown is built for exactly that, and its GitHub star count has already pushed past 80,000.
What MarkItDown is
MarkItDown is a Python utility that converts many file formats into Markdown. It is developed by the AutoGen team, Microsoft’s multi-agent framework group.
There are similar tools like textract, but MarkItDown focuses on preserving document structure as Markdown, including headings, lists, tables, and links.
Supported formats
The range is broad.
| Format | Notes |
|---|---|
Uses pdfminer | |
Word (.docx) | Uses python-docx |
PowerPoint (.pptx) | Uses python-pptx |
Excel (.xlsx, .xls) | Uses openpyxl |
| HTML | Uses BeautifulSoup |
| Images | EXIF extraction, optional LLM-based OCR |
| Audio | EXIF extraction, optional LLM-based transcription |
| CSV, JSON, XML | Treated as plain text |
| ZIP | Processes internal files recursively |
| EPUB | E-books |
| Jupyter Notebook | .ipynb |
| Outlook MSG | |
| YouTube URL | Subtitle extraction |
| RSS, Wikipedia, Bing SERP | Web-related sources |
Installation
Python 3.10 or later is required.
# Full support for all formats (recommended)
pip install 'markitdown[all]'
# Only selected formats
pip install 'markitdown[pdf,docx,pptx]'
It works fine on M1/M2 Macs too. It is a pure Python library with very few native binary dependencies.
Basic usage
CLI
# Convert a file
markitdown document.pdf > output.md
# Write to a file directly
markitdown document.pdf -o output.md
# Pipe input
cat document.pdf | markitdown
Python API
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.xlsx")
print(result.text_content)
LLM-powered image descriptions
If you pass an OpenAI-compatible client, MarkItDown can add descriptions for images and PowerPoint slides.
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o"
)
result = md.convert("presentation.pptx")
print(result.text_content)
MCP server support
Using the markitdown-mcp package, Claude Desktop and Claude Code can call document conversion over MCP.
pip install markitdown-mcp
Example .mcp.json configuration:
{
"mcpServers": {
"markitdown": {
"type": "stdio",
"command": "markitdown-mcp"
}
}
}
That means you can simply ask Claude to “read this PDF”, and the Markdown conversion happens under the hood.
Related: when to use OCR instead
MarkItDown is best at extracting text from structured documents. For scanned PDFs or image OCR, it depends on an external service such as the OpenAI API.
If you want OCR that stays local, you need a different approach:
- PaddleOCR-VL - strong at structure recognition with a VLM
- NDLOCR - high-accuracy Japanese OCR
- VLM-based OCR - hybrid approaches such as DeepSeek-OCR
MarkItDown can also serve as a post-processing step after those OCR pipelines.
Limitations
- Not for layout fidelity: it is meant for LLM and text analysis, not pixel-perfect document reproduction
- Image OCR and speech transcription: require external APIs such as OpenAI
- Breaking changes in v0.1.0: dependencies were reorganized into optional feature groups
Repository info
- GitHub: https://github.com/microsoft/markitdown
- License: MIT
- Latest stable release: v0.1.4 (December 2025)
- Stars: 86,000+