Convert HTML to Markdown: How to Strip the Tags and Keep What Matters

HTML to Markdown is the most common format conversion in modern content work, and most people doing it are doing it for the same handful of reasons. They are migrating a website from one CMS to another. They are extracting an article to feed into an AI tool. They are archiving content that they want stored as plain text rather than as a rendered webpage. They are building documentation pipelines that consume Markdown but receive HTML from upstream tools.

The conversion sounds simple. Strip the tags, keep the structure. The reality is messier. HTML has constructs that Markdown cannot represent directly. Modern websites embed <div> and <span> tags so deep that naive converters produce nested garbage. Tables, code blocks, and images each have their own conversion quirks. The right tool for a one-off paste is different from the right tool for a CMS migration with 10,000 pages.

This guide covers the four serious approaches. The fastest path for a single page is to paste the HTML into Scrubadoc and download the cleaned Markdown. The rest of the post explains when that is the right move and when one of the heavier tools earns its complexity.

What “clean Markdown” actually means

A clean Markdown conversion preserves three things and discards the rest.

It preserves structure. Headings stay headings. Lists stay lists. Tables stay tables. Bold and italic survive. Hyperlinks survive with their text and their destination intact.

It preserves meaningful inline elements. Code spans stay as code. Blockquotes stay as >. Strikethrough survives where the target Markdown variant supports it. GitHub Flavored Markdown supports the most extensions; CommonMark is stricter.

It discards visual styling. Inline style attributes go. Color and font choices go. Custom class names go unless your downstream system actually uses them. The point of converting to Markdown is to escape the visual layer of HTML and end up with portable text.

Tools that fail to make these distinctions either preserve too much (and produce Markdown so cluttered it might as well still be HTML) or preserve too little (and lose the structure you actually wanted). The four tools below get the distinction right.

Method 1: A browser-based cleaner

For a one-off conversion, a browser-based cleaner is the right tool. You paste the HTML in, you copy the Markdown out, you move on.

Scrubadoc handles this workflow. You can paste HTML directly or upload an .html file. The conversion runs in your browser; the content never crosses a network boundary. The output is clean Markdown you can copy or download as .md.

The structure that survives includes headings (h1 through h6), ordered and unordered lists, hyperlinks, blockquotes, code spans, code blocks, tables, and basic emphasis. The structure that gets stripped includes inline styles, class attributes that do not map to Markdown, empty tags, and HTML comments.

When this method falls short: full-site migrations (you have thousands of pages), batch jobs that need to integrate into a build pipeline, and HTML that contains unusual structures the tool was not designed for.

Method 2: Turndown (JavaScript)

Turndown is the most widely used JavaScript library for HTML-to-Markdown conversion. It runs in Node, in the browser, and inside Electron applications. It is the engine behind several browser-based converter tools.

A minimal Node example:

js

const TurndownService = require("turndown");
const turndown = new TurndownService();

const html = "<h1>Hello</h1><p>This is <strong>bold</strong>.</p>";
const markdown = turndown.turndown(html);
console.log(markdown);

Turndown’s design lets you customize how specific HTML elements convert to Markdown. If your input HTML uses a non-standard tag for code blocks, you can register a custom rule that handles it. The Turndown documentation covers the options.

For tables specifically, Turndown ships with an optional plugin called turndown-plugin-gfm that adds GitHub Flavored Markdown support. Without it, Turndown drops tables entirely. With it, tables convert correctly.

When this method makes sense: you build a JavaScript application that converts HTML to Markdown on the fly, you need fine-grained control over the conversion rules, or you want the conversion to run in the browser.

Method 3: Pandoc on the command line

Pandoc handles HTML-to-Markdown alongside its many other conversions. Its output is among the cleanest available, and it scripts cleanly into automation pipelines.

The basic command:

pandoc input.html -o output.md

For better results, specify the input and output formats explicitly:

pandoc input.html -f html -t gfm -o output.md

Pandoc handles tables, code blocks, footnotes, and embedded images. Its user guide covers every option in detail. For batch processing, a one-line shell script handles a folder of HTML files:

for f in *.html; do pandoc "$f" -o "${f%.html}.md"; done

Pandoc is also the right choice when the input HTML is unusually structured. Documentation generated from MediaWiki, Confluence exports, and old static-site templates often produce HTML that other tools choke on. Pandoc’s parser is more forgiving than most.

When this method makes sense: you have a large set of HTML files to convert, you want the conversion scripted, or you need to integrate the conversion into a build pipeline.

Method 4: Language-specific libraries

If you build software in a language other than JavaScript, your ecosystem has a Markdown conversion library.

For Go, html-to-markdown by Johannes Kaufmann is the standard. It handles tables (with the GFM plugin), nested lists, and most edge cases out of the box.

For Python, markdownify and html2text are both widely used. Markdownify is more recently maintained and produces cleaner output for modern HTML.

For Ruby, ReverseMarkdown covers most cases.

For PHP, league/html-to-markdown is the canonical implementation.

The choice between them comes down to which language your application uses and which Markdown variant you target. All of them produce reasonable output for normal HTML. They diverge on edge cases.

The edge cases that trip every tool

Tables are the first place tools fail. Markdown tables are limited. They cannot represent merged cells, column-spanning headers, or row-spanning data. If your source HTML uses any of these, the conversion has to lose information. Pandoc handles this with the most grace; most other tools either drop the table or produce a flattened version.

Code blocks are the second place. HTML uses <pre> and <code> tags. Markdown uses triple-backtick fences. The conversion is straightforward when the HTML uses the conventional pattern. The conversion is messy when the HTML wraps each line in a separate <span> for syntax highlighting, which most modern code-rendering systems do.

Nested lists trip naive converters. A list inside a list inside a paragraph is valid HTML and valid Markdown. Some converters flatten the nesting; some preserve it. Test on a representative document before committing.

Images need special handling. Markdown represents an image as ![alt text](url). If the source HTML embeds the image as a base64 data URL, the resulting Markdown will contain a multi-megabyte string in the middle of the file. Most tools handle this acceptably; some require configuration.

Inline HTML inside Markdown. CommonMark allows raw HTML inside Markdown documents. Some converters use this as an escape hatch for HTML structures Markdown cannot represent. Others strip the HTML entirely. Decide which behavior you want before you start, and test for it.

The pivot most HTML-to-Markdown posts miss

The conventional advice is “use Turndown or Pandoc.” That works. Here is what most guides skip.

The biggest source of bad Markdown is not the converter. It is the source HTML.

Modern websites wrap their content in layout <div> and <span> tags that have nothing to do with the content itself. CMS-generated HTML often embeds tracking pixels, navigation, sidebars, and ad placements in the same document tree as the article. A naive converter that sees all of this will produce Markdown that captures all of it.

The fix is to extract the content before you convert. The browser pattern is to use Mozilla’s Readability library, which powers the reader-mode feature in Firefox and several other browsers. Readability finds the main content of a webpage and discards the chrome. You feed the cleaned HTML into the converter, not the raw page source.

For a one-off paste, a tool that handles this for you saves the step. Scrubadoc cleans junk from pasted HTML before converting. For programmatic conversion, run Readability on the input first, then convert with Turndown or Pandoc. The Markdown that comes out is cleaner because the HTML that went in was cleaner.

Pick a tool, then commit

For a one-off conversion, the browser cleaner. Scrubadoc handles paste-in HTML and .html file uploads, runs client-side for privacy, and produces clean output without an install.

For batch jobs, Pandoc and a shell script.

For a JavaScript application that converts on the fly, Turndown.

For a server-side application in another language, the canonical library for that language.

The biggest mistake is using a heavyweight tool for a lightweight job. If you need to convert a single article right now, the browser tool will be done before Pandoc finishes installing.

How is the HTML you need to convert reaching you: as files, as paste-in text, or as URLs to fetch? That answer changes which tool fits your workflow.

Pasting from Word doesn't have to be hard

Use Scrub-a-Doc today to make copy-pasting as easy as it should be! No payment or registration required!

Learn more about text formatting online

Convert Word to HTML Without the Hidden Junk

Microsoft Word’s “Save as Web Page” feature has a deserved reputation. It produces some of the worst HTML you will encounter outside of a 1998 GeoCities archive. Anyone who has ever pasted a Word document into a CMS and watched the ...