Scrub-a-Doc.

Scrubber ✦ Blog

Convert Word to HTML Without the Hidden Junk

Microsoft Word’s “Save as Web Page” feature has a deserved reputation. It produces some of the worst HTML you will encounter outside of a 1998 GeoCities archive. Anyone who has ever pasted a Word document into a CMS and watched the layout fall apart knows the problem. The HTML works, technically. It is also full of <o:p> tags, MSO-prefixed CSS, and empty <span> elements that override your site’s styling.

The good news is that converting a .docx to clean HTML is genuinely fast in 2026. The native Word export is the worst option. Three other approaches give you publishable output in under a minute. This guide walks all four, ranks them by use case, and explains why each one fails on the edge cases it fails on.

If you want the fastest path: paste your Word content into Scrub-a-Doc and copy the cleaned HTML. It runs in your browser, processes the document client-side, and costs nothing. The rest of this post explains when that is the right call and when one of the heavier options earns its complexity.

Why Word’s HTML is bad on purpose

Word stores documents in the Office Open XML format, defined by the ECMA-376 specification. A .docx file is a ZIP archive containing several XML files plus the embedded media. The XML inside is dense and verbose because it captures everything Word needs to recreate the document exactly: the layout, the tracked-changes history, the styles, the section breaks, the floating-image anchors, and the comments.

When you save a Word document as HTML, Word translates that XML into HTML that tries to preserve the visual layout. The translation includes a lot of metadata and styling that has no business in a webpage. You get inline styles like mso-fareast-font-family, conditional comments wrapped around Outlook-specific blocks, and nested <span> tags that exist only to mark where Word’s internal style runs began and ended.

Microsoft documents these limitations and offers a “Filtered” HTML option that strips the worst of the Office-specific tags. Even Filtered HTML is not clean. It is just less dirty.

The deeper problem is that Word’s HTML export is designed to round-trip back into Word. Microsoft optimized for the case where someone exports HTML, opens it again in Word, and expects the same document. That requirement forces Word to preserve information that no website needs. The HTML is verbose because the format prioritizes Word’s own use case, not yours.

Method 1: Use a browser-based cleaner

For most users, a browser-based cleaner is the right tool. You paste content or upload the .docx, you copy the cleaned HTML, you ship it.

The advantages are real. No install. No CLI. No learning curve. A good browser tool processes the document client-side, which means your content never leaves your machine. That privacy property matters when the document contains anything you would not want logged on a third-party server.

Scrub-a-Doc handles this workflow specifically. You can paste content directly from Word, Word Online, or Google Docs, or upload a .docx file. The tool strips Word’s hidden formatting while preserving the structure you actually want: headings, lists, hyperlinks, tables, and basic emphasis. The output is HTML you can copy to your clipboard or download as a file. Other browser-based options exist (some run server-side, which has tradeoffs for confidential material), but the workflow is the same.

When this method falls short: large documents (thousands of pages), batch jobs (you have 200 .docx files to process), or workflows that need to integrate into a build pipeline.

Method 2: Save as Filtered HTML inside Word

Word’s built-in option is the slowest of the three good methods, but it works without any external tool. It is also the only one of the four that runs entirely inside Microsoft’s product, which matters if your IT policy prohibits uploading documents to outside services.

The steps:

Open the document in Word.
Click File → Save As.
Choose “Web Page, Filtered” from the format dropdown.
Click Save.

The filtered output strips Office-specific tags but still produces verbose HTML. Expect to do a cleanup pass after the export. Common things to remove or simplify include the inline style attributes, the empty <span> tags, and the class attributes that reference Word styles your CSS does not know about.

Microsoft’s file-formats reference lists every Save As variant. Use Filtered HTML, never the unfiltered version, unless you are intentionally preserving everything for a round-trip back into Word.

When this method makes sense: small documents, a one-off conversion, and environments where uploading to an external tool is not permitted.

Method 3: Pandoc on the command line

Pandoc is the gold-standard document converter. Installing it takes a few minutes on any operating system. The learning curve is one afternoon. After that, you have a tool that handles dozens of conversions, scripts cleanly into automation, and produces HTML that is among the cleanest you can get out of .docx.

The basic command is straightforward:

pandoc input.docx -o output.html

For better results, add the --standalone flag to produce a complete HTML document with a <head>, or omit it for a fragment you can drop into an existing template:

pandoc input.docx -f docx -t html5 -o output.html

Pandoc handles tables, footnotes, hyperlinks, and embedded images correctly. Its user guide documents every option, including how to extract embedded media to a separate folder, how to apply a custom template, and how to control which HTML elements get used for which Word styles.

When this method makes sense: you process more than a handful of documents per week, you want the conversion to be scriptable, or you need to convert documents as part of a build pipeline. Pandoc also supports custom Lua filters if you need to transform the document during conversion. The Pandoc filter reference covers the API.

The downside is that you are running a separate program. For a single document, the browser tool is faster.

Method 4: Mammoth for developers

If you build software, Mammoth is the library worth knowing. It exists to convert .docx to clean HTML programmatically and ships in two flavors: Mammoth.js for JavaScript and python-mammoth for Python.

Mammoth’s design philosophy is the opposite of Word’s HTML export. It throws out Word-specific styling and produces semantic HTML by default. A Word heading becomes an <h1>. A bulleted list becomes a <ul>. A bold run becomes a <strong>. The library exposes a style map so you can override the defaults if your document uses non-standard Word styles.

A minimal Node example:

const mammoth = require("mammoth");

mammoth.convertToHtml({path: "input.docx"})
  .then(result => {
    console.log(result.value); // the HTML
    console.log(result.messages); // any warnings
  });

Mammoth is what you reach for when your application needs to accept Word documents from end users and render them as HTML in a webpage. It is also the engine behind several browser-based converter tools, because the same JavaScript that runs in Node also runs in the browser.

When this method makes sense: you build a product, you accept Word uploads, and you need the conversion to happen inside your code rather than outside it.

Edge cases that some tools may not cover

Tables. Word lets you do nested tables, merged cells, and column-spanning headers. Most converters handle simple tables well and stumble on the complex ones. If your document uses heavy tables, test the output before committing to a workflow.

Images. Word embeds images directly inside the .docx ZIP. Conversion tools handle them in three ways: extract them to a separate folder and link to them, encode them as base64 inside the HTML, or skip them entirely. Pandoc lets you choose. Mammoth and most browser tools default to one of the first two.

Footnotes and endnotes. Pandoc and Mammoth handle these well. Word’s own export and most quick-and-dirty converters do not.

Tracked changes and comments. These exist in the .docx XML but should not appear in the published HTML. Most converters drop them. Confirm before you publish.

Special characters and smart quotes. Word’s autocorrect inserts curly quotes, em dashes, and other typographic characters. Most modern converters preserve them as Unicode. If your downstream system does not handle Unicode, you may need a final pass to convert them to HTML entities.

Most “Word to HTML” problems are not Word-to-HTML problems

The conventional advice on Word-to-HTML conversion is “pick the right tool.” That is the easy half. Here is the part most guides skip.

The biggest source of dirty HTML is not the converter. It is what is already inside the .docx before you convert.

A Word document that started life as a copy-pasted email, then got pasted into a slide deck, then got pasted back into Word will carry along formatting from each step. Inline styles override the document’s stylesheet. Empty runs and zero-width spaces hide between visible characters. Fonts and colors that came from the original sources persist long after they should have been cleaned.

Even the best converter cannot guess that the bright yellow highlight on page three was a note to self that should not survive. It can only convert what is there.

The fix is to clean the source before you convert. Inside Word, you can select all and use the “Clear All Formatting” command (Ctrl+Spacebar on Windows or Cmd+Spacebar on Mac), then reapply the structure you actually want. A faster fix is to use a converter whose first job is cleaning the source. Tools that strip junk on input save you the cleanup pass afterward.

That second approach is also why a “Word to HTML converter” and a “Word document cleaner” tend to be the same tool. A clean source produces clean output. A dirty source produces dirty output, no matter how good the conversion engine is downstream.

Pick a method and ship

Four methods, four use cases. The browser cleaner for one-off conversions. Word’s Save As Filtered HTML for environments where uploading is restricted. Pandoc for scripted batch jobs. Mammoth for software you ship.

If you have not picked a default tool yet and you write more than once a week, the browser cleaner is the right starting point. Scrub-a-Doc handles the workflow most writers actually need: paste, clean, copy. When your volume grows past one document a day, install Pandoc. When you build something that consumes documents at scale, embed Mammoth.

What is the longest Word document you have tried to convert in the last month, and which method did it break?

Text formatting doesn't have to be hard

Scrub-a-Doc helps you preserve text structure between the platforms and tools you use everyday.

Start Now

Learn more about text formatting online

Convert Word to HTML Without the Hidden Junk

Microsoft Word’s “Save as Web Page” feature has a deserved reputation. It produces some of the worst HTML you will encounter outside of a 1998 GeoCities archive. Anyone who has ever pasted a Word document into a CMS and watched the ...

Convert Word to Markdown: Four Methods That Keep Your Structure

Markdown has become the lingua franca of modern content tooling. Documentation sites run on it. Static-site generators consume it. AI tools like ChatGPT, Claude, and Gemini prefer it over Word for context-window efficiency. Version control treats it ...

Convert HTML to Markdown: How to Strip the Tags and Keep What Matters

HTML to Markdown is the most common format conversion in modern content work, and most people doing it are doing it for the same handful of reasons. They are migrating a website from one CMS to another. They are extracting an article to feed into an ...

Convert Markdown to HTML: Three Methods From Plain Text to Publish-Ready Output

Markdown to HTML is the most-rendered conversion on the internet. Every time you read a Reddit post, view a GitHub README, open a Notion page, or scroll a Slack channel, software is converting Markdown to HTML in real time so your browser can render ...