Tech 3 min read

MarkItDown - Microsoft's Python tool for turning documents into Markdown

IkesanContents

When you want an LLM to read a document, handing it a raw PDF or Word file is awkward. If you can extract text and convert it to Markdown while preserving structure, processing becomes much easier. Microsoft’s MarkItDown is built for exactly that, and its GitHub star count has already pushed past 80,000.

What MarkItDown is

MarkItDown is a Python utility that converts many file formats into Markdown. It is developed by the AutoGen team, Microsoft’s multi-agent framework group.

There are similar tools like textract, but MarkItDown focuses on preserving document structure as Markdown, including headings, lists, tables, and links.

Supported formats

The range is broad.

FormatNotes
PDFUses pdfminer
Word (.docx)Uses python-docx
PowerPoint (.pptx)Uses python-pptx
Excel (.xlsx, .xls)Uses openpyxl
HTMLUses BeautifulSoup
ImagesEXIF extraction, optional LLM-based OCR
AudioEXIF extraction, optional LLM-based transcription
CSV, JSON, XMLTreated as plain text
ZIPProcesses internal files recursively
EPUBE-books
Jupyter Notebook.ipynb
Outlook MSGEmail
YouTube URLSubtitle extraction
RSS, Wikipedia, Bing SERPWeb-related sources

Installation

Python 3.10 or later is required.

# Full support for all formats (recommended)
pip install 'markitdown[all]'

# Only selected formats
pip install 'markitdown[pdf,docx,pptx]'

It works fine on M1/M2 Macs too. It is a pure Python library with very few native binary dependencies.

Basic usage

CLI

# Convert a file
markitdown document.pdf > output.md

# Write to a file directly
markitdown document.pdf -o output.md

# Pipe input
cat document.pdf | markitdown

Python API

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.xlsx")
print(result.text_content)

LLM-powered image descriptions

If you pass an OpenAI-compatible client, MarkItDown can add descriptions for images and PowerPoint slides.

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)
result = md.convert("presentation.pptx")
print(result.text_content)

MCP server support

Using the markitdown-mcp package, Claude Desktop and Claude Code can call document conversion over MCP.

pip install markitdown-mcp

Example .mcp.json configuration:

{
  "mcpServers": {
    "markitdown": {
      "type": "stdio",
      "command": "markitdown-mcp"
    }
  }
}

That means you can simply ask Claude to “read this PDF”, and the Markdown conversion happens under the hood.

MarkItDown is best at extracting text from structured documents. For scanned PDFs or image OCR, it depends on an external service such as the OpenAI API.

If you want OCR that stays local, you need a different approach:

MarkItDown can also serve as a post-processing step after those OCR pipelines.

Limitations

  • Not for layout fidelity: it is meant for LLM and text analysis, not pixel-perfect document reproduction
  • Image OCR and speech transcription: require external APIs such as OpenAI
  • Breaking changes in v0.1.0: dependencies were reorganized into optional feature groups

Repository info