Scrub-a-Doc.

Scrubber ✦ Blog

Convert a Google Doc to Clean HTML (Without the Bloated Code)

Google Docs has more than three billion active users, and a meaningful slice of them are writers. Drafts, blog posts, marketing copy, internal memos, and AI prompts all start in Google Docs because the collaboration story is unmatched. The shared cursor, the comments, the suggesting mode, and the version history are all reasons teams default to Docs over Word.

The reason teams stop loving Google Docs is the export. The native “Download as Web Page” option produces some of the worst HTML on the modern internet. The output ships every paragraph wrapped in a <span> with an opaque inline style. It defines a class for every text variation in the document, including ones you did not intend. It includes Google-specific metadata that has no business in a webpage. Pasting that HTML into a CMS, an email, or a static site is a guaranteed cleanup job.

This guide walks four ways to get clean HTML out of a Google Doc, with the tradeoffs each one carries. The fastest path is to paste the document content into Scrub-a-Doc and copy the cleaned HTML. The other methods earn their place when you have a specific reason to use them.

Why Google’s native HTML is unusable

When you click File → Download → Web Page (.html, zipped), Google produces a ZIP file containing your HTML and any embedded images. Opening the HTML reveals the problem. Every paragraph carries an inline style. Every formatting variation gets a class with a name like c12 or c47 that maps to a <style> block at the top of the document. The classes do not reuse; the same bold style applied in two paragraphs gets two different class names.

The reason is that Google Docs models documents internally as a stream of styled runs, not as semantic structure. A heading is a paragraph with a specific style applied; a bulleted list is a series of paragraphs with list properties; a table is a grid of cells with cell-specific styles. Google’s HTML export translates this internal model directly. The output preserves everything the document says about how it should look, but it carries that information as inline styles rather than as semantic HTML.

The result is HTML that renders correctly in a browser and falls apart inside any other system. CMSs strip the inline styles and leave you with unstyled text. Email clients reject the unfamiliar class names. Static-site generators ignore the classes and apply their own theme. The export is technically valid HTML; it is also unusable for almost every workflow except recreating the document inside another browser.

Google’s documentation acknowledges the export options without addressing the cleanup problem. Ultimately, Google’s primary use case for the HTML download is not what most people use it for.

Method 1: Paste into a browser-based cleaner

For a one-off conversion, paste the Google Doc content directly into a cleaner. This skips Google’s HTML export entirely and takes the rich-text content from your clipboard, which is much closer to what you want.

Scrub-a-Doc handles this workflow specifically. You select the content in your Google Doc, copy it (Cmd+C or Ctrl+C), and paste into Scrub-a-Doc’s editor. The tool reads the rich-text data from your clipboard, strips Google’s inline styles and class references, and produces clean HTML you can copy or download. Because the conversion runs client-side in your browser, the document content never leaves your machine.

The structure that survives includes headings (h1 through h6), bulleted and numbered lists, hyperlinks, blockquotes, basic emphasis (bold, italic, underline), and tables. The structure that gets stripped includes Google’s class references, inline styles, font and color choices, and the c12-style metadata.

When this method falls short: very long documents (hundreds of pages), batch jobs (you have many documents to process), and workflows that need to integrate into an automated pipeline.

Method 2: Use Google’s HTML export, then clean

If you cannot paste (because the document is enormous, because you need to preserve embedded images, or because you need to script the export), Google’s native HTML export is the next option. The cost is a manual cleanup pass.

The steps:

Open the document in Google Docs.
Click File → Download → Web Page (.html, zipped).
Unzip the result. You will get an .html file plus an images/ folder.
Run the HTML through a cleanup tool to strip Google’s inline styles and classes.

For the cleanup pass, you can paste the unzipped HTML into Scrub-a-Doc, which handles HTML input as well as Google Doc paste-in. You can also run it through Pandoc on the command line to convert to clean HTML or directly to Markdown:

pandoc input.html -t html5 -o cleaned.html

The advantage of going through Google’s native export is that it preserves embedded images as separate files in the images/ folder. The paste-in workflow does not capture embedded images. If your document leans on images, the export-and-clean approach is worth the extra step.

When this method makes sense: large documents, documents with many embedded images, and one-off conversions where Method 1 cannot capture the full content.

Method 3: Google Apps Script for programmatic conversion

If you process Google Docs at scale (a content team migrating dozens of documents, an automation that produces HTML from a Google Doc template), Google Apps Script is the right tool. Apps Script runs JavaScript inside Google’s infrastructure with direct access to your Google Docs.

A minimal script that reads a Google Doc and produces HTML:

function docToHtml(docId) {
  const doc = DocumentApp.openById(docId);
  const body = doc.getBody();
  // walk the document body and emit semantic HTML
  return convertElement(body);
}

The full implementation is heavier, because Apps Script’s Document Service exposes the document as a tree of elements (paragraphs, list items, tables, images) that you walk and translate yourself. Several open-source Apps Script projects on GitHub provide starting points; pick one that matches your output target (HTML, Markdown, or both).

The advantage of Apps Script is that the conversion logic runs inside Google’s environment. You skip the export step entirely; you read the structured document directly. The output you produce is whatever HTML you write code to emit, which can be cleaner than any export.

When this method makes sense: you process Google Docs as part of an automated workflow, you have engineering time to invest in the script, and you need cleaner output than any one-shot tool produces.

Method 4: Export as .docx, then convert

The fourth option treats the Google Doc as a Word document. You export the doc as .docx (File → Download → Microsoft Word), then convert the .docx to HTML using any of the methods covered in Convert Word to HTML.

This works because Google’s .docx export is closer to standard Word output than its HTML export is to standard HTML. The structure survives the export better, and tools like Pandoc and Mammoth handle the resulting .docx correctly.

The command-line version with Pandoc:

pandoc input.docx -f docx -t html5 -o output.html

When this method makes sense: large documents, documents with many embedded images, scripted conversions that already use Pandoc, and workflows that need both .docx and HTML versions of the same document.

The edge cases worth knowing

Embedded images. The paste-in workflow does not capture images. Google’s HTML export captures them as separate files. The .docx export captures them as embedded data inside the ZIP. Pick your method based on whether you need images and how you want them delivered.

Tables. Google Docs tables convert reasonably across all four methods. Complex tables (merged cells, nested tables) survive best through the .docx export plus Pandoc.

Comments and suggestions. Drop them before publishing. Most conversion methods leave them out automatically. The .docx export is the exception; check the resulting Word file before converting.

Footnotes. Pandoc handles them. The paste-in workflow drops them. If your document has footnotes, use the .docx-plus-Pandoc path.

Styles you applied manually. Google Docs lets you apply custom font sizes and colors directly to text. None of these survive the conversion to clean semantic HTML, because semantic HTML does not encode visual styling. If your document depends on inline styling for meaning, the meaning is lost in the conversion. Encode that meaning in headings, lists, or other semantic structure before exporting.

The importance of clean Google Doc structure

The cleanest HTML conversion starts with structurally clean source documents. A Google Doc that uses heading styles consistently, applies emphasis through Bold and Italic instead of font weight changes, and uses the list buttons instead of typed bullets converts cleaner than one that fakes structure with manual styling.

The fix is upstream of the converter. Train writers to use the heading dropdown instead of making text bigger and bold. Use the list button instead of typing dashes. Apply blockquote styles instead of indenting paragraphs manually. The conversion downstream becomes trivial because the structure was real to start with.

This advice sounds like nagging. It is also the difference between a one-paragraph cleanup and a thirty-minute cleanup, every time you convert a document.

Pick by use case and ship

For a one-off conversion, paste the document into Scrub-a-Doc and copy the cleaned HTML.

For a document with embedded images, use Google’s native HTML export, unzip, and run the result through a cleaner.

For programmatic conversion at scale, write a Google Apps Script that reads the document directly.

For a workflow that already uses Pandoc, export the doc as .docx and convert from there.

What is the longest Google Doc you have tried to convert to HTML in the last quarter, and which step ate the most time?

Text formatting doesn't have to be hard

Scrub-a-Doc helps you preserve text structure between the platforms and tools you use everyday.

Start Now

Learn more about text formatting online

Convert Word to HTML Without the Hidden Junk

Microsoft Word’s “Save as Web Page” feature has a deserved reputation. It produces some of the worst HTML you will encounter outside of a 1998 GeoCities archive. Anyone who has ever pasted a Word document into a CMS and watched the ...

Convert Word to Markdown: Four Methods That Keep Your Structure

Markdown has become the lingua franca of modern content tooling. Documentation sites run on it. Static-site generators consume it. AI tools like ChatGPT, Claude, and Gemini prefer it over Word for context-window efficiency. Version control treats it ...

Convert HTML to Markdown: How to Strip the Tags and Keep What Matters

HTML to Markdown is the most common format conversion in modern content work, and most people doing it are doing it for the same handful of reasons. They are migrating a website from one CMS to another. They are extracting an article to feed into an ...

Convert Markdown to HTML: Three Methods From Plain Text to Publish-Ready Output

Markdown to HTML is the most-rendered conversion on the internet. Every time you read a Reddit post, view a GitHub README, open a Notion page, or scroll a Slack channel, software is converting Markdown to HTML in real time so your browser can render ...