MarkItDown - Microsoft's Python tool for turning documents into Markdown

When you want an LLM to read a document, handing it a raw PDF or Word file is awkward. If you can extract text and convert it to Markdown while preserving structure, processing becomes much easier. Microsoft’s MarkItDown is built for exactly that, and its GitHub star count has already pushed past 80,000.

What MarkItDown is

MarkItDown is a Python utility that converts many file formats into Markdown. It is developed by the AutoGen team, Microsoft’s multi-agent framework group.

There are similar tools like textract, but MarkItDown focuses on preserving document structure as Markdown, including headings, lists, tables, and links.

Supported formats

The range is broad.

Format	Notes
PDF	Uses `pdfminer`
Word (`.docx`)	Uses `python-docx`
PowerPoint (`.pptx`)	Uses `python-pptx`
Excel (`.xlsx`, `.xls`)	Uses `openpyxl`
HTML	Uses BeautifulSoup
Images	EXIF extraction, optional LLM-based OCR
Audio	EXIF extraction, optional LLM-based transcription
CSV, JSON, XML	Treated as plain text
ZIP	Processes internal files recursively
EPUB	E-books
Jupyter Notebook	`.ipynb`
Outlook MSG	Email
YouTube URL	Subtitle extraction
RSS, Wikipedia, Bing SERP	Web-related sources

Installation

Python 3.10 or later is required.

# Full support for all formats (recommended)
pip install 'markitdown[all]'

# Only selected formats
pip install 'markitdown[pdf,docx,pptx]'

It works fine on M1/M2 Macs too. It is a pure Python library with very few native binary dependencies.

Basic usage

CLI

# Convert a file
markitdown document.pdf > output.md

# Write to a file directly
markitdown document.pdf -o output.md

# Pipe input
cat document.pdf | markitdown

Python API

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.xlsx")
print(result.text_content)

LLM-powered image descriptions

If you pass an OpenAI-compatible client, MarkItDown can add descriptions for images and PowerPoint slides.

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)
result = md.convert("presentation.pptx")
print(result.text_content)

MCP server support

Using the markitdown-mcp package, Claude Desktop and Claude Code can call document conversion over MCP.

pip install markitdown-mcp

Example .mcp.json configuration:

{
  "mcpServers": {
    "markitdown": {
      "type": "stdio",
      "command": "markitdown-mcp"
    }
  }
}

That means you can simply ask Claude to “read this PDF”, and the Markdown conversion happens under the hood.

MarkItDown is best at extracting text from structured documents. For scanned PDFs or image OCR, it depends on an external service such as the OpenAI API.

If you want OCR that stays local, you need a different approach:

PaddleOCR-VL - strong at structure recognition with a VLM
NDLOCR - high-accuracy Japanese OCR
VLM-based OCR - hybrid approaches such as DeepSeek-OCR

MarkItDown can also serve as a post-processing step after those OCR pipelines.

Limitations

Not for layout fidelity: it is meant for LLM and text analysis, not pixel-perfect document reproduction
Image OCR and speech transcription: require external APIs such as OpenAI
Breaking changes in v0.1.0: dependencies were reorganized into optional feature groups

Repository info

GitHub: https://github.com/microsoft/markitdown
License: MIT
Latest stable release: v0.1.4 (December 2025)
Stars: 86,000+