Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Serge Artishev
Serge Artishev

Posted on

     

Building a DOCX to Markdown Converter with Node.js

Welcome to a step-by-step guide on building a powerful DOCX to Markdown converter using Node.js. This project is a great way to learn about file manipulation, command-line interfaces, and converting document formats. By the end of this series, you'll have a tool that not only converts DOCX files to Markdown but also extracts images and formats tables. Let's dive in!

Table of Contents

  1. Introduction
  2. Setting Up the Project
  3. Basic DOCX to HTML Conversion
  4. Converting HTML to Markdown
  5. Extracting Images
  6. Formatting Tables
  7. Conclusion

Introduction

Markdown is a lightweight markup language with plain text formatting syntax. It's widely used for documentation due to its simplicity and readability. However, many documents are created in DOCX format, especially in corporate environments. Converting these documents to Markdown can be tedious if done manually. This is where our converter comes in handy.

Setting Up the Project

First, let's create a new directory for our project and initialize it withnpm.

mkdirdocx-to-md-convertercddocx-to-md-converternpm init-y
Enter fullscreen modeExit fullscreen mode

Next, we'll install the necessary dependencies. We'll usemammoth for converting DOCX to HTML,turndown for converting HTML to Markdown,commander for building the CLI, anduuid for unique image names.

npminstallmammoth turndown commander uuid
Enter fullscreen modeExit fullscreen mode

Create a new file namedindex.js in your project directory. This will be the main file for our converter.

touchindex.js
Enter fullscreen modeExit fullscreen mode

Basic DOCX to HTML Conversion

Let's start by writing a simple script to convert DOCX files to HTML. We'll use themammoth library for this.

Openindex.js and add the following code:

#!/usr/bin/env nodeimport*asfsfrom'fs';import*aspathfrom'path';import*asmammothfrom'mammoth';import{program}from'commander';program.version('1.0.0').description('Convert DOCX to HTML').argument('<input>','Input DOCX file').argument('[output]','Output HTML file (default: same as input with .html extension)').action(async(input,output)=>{try{awaitconvertDocxToHtml(input,output);}catch(error){console.error('Error:',error);process.exit(1);}});program.parse(process.argv);asyncfunctionconvertDocxToHtml(inputFile,outputFile){if(!outputFile){outputFile=path.join(path.dirname(inputFile),`${path.basename(inputFile,'.docx')}.html`);}constresult=awaitmammoth.convertToHtml({path:inputFile});awaitfs.promises.writeFile(outputFile,result.value);console.log(`Conversion complete. Output saved to${outputFile}`);}
Enter fullscreen modeExit fullscreen mode

This script usescommander to parse command-line arguments,mammoth to convert DOCX to HTML, andfs to write the output to a file. To make this script executable, add the following line at the top ofindex.js:

#!/usr/bin/env node
Enter fullscreen modeExit fullscreen mode

Make sure the script has execute permissions:

chmod +x index.js
Enter fullscreen modeExit fullscreen mode

Now you can run the script to convert a DOCX file to HTML:

node index.js example.docx example.html
Enter fullscreen modeExit fullscreen mode

Converting HTML to Markdown

Next, we'll add the functionality to convert HTML to Markdown usingturndown.

First, installturndown:

npminstallturndown
Enter fullscreen modeExit fullscreen mode

Updateindex.js to include the HTML to Markdown conversion:

#!/usr/bin/env nodeimport*asfsfrom'fs';import*aspathfrom'path';import*asmammothfrom'mammoth';importTurndownServicefrom'turndown';import{program}from'commander';program.version('1.0.0').description('Convert DOCX to Markdown').argument('<input>','Input DOCX file').argument('[output]','Output Markdown file (default: same as input with .md extension)').action(async(input,output)=>{try{awaitconvertDocxToMarkdown(input,output);}catch(error){console.error('Error:',error);process.exit(1);}});program.parse(process.argv);asyncfunctionconvertDocxToMarkdown(inputFile,outputFile){if(!outputFile){outputFile=path.join(path.dirname(inputFile),`${path.basename(inputFile,'.docx')}.md`);}constresult=awaitmammoth.convertToHtml({path:inputFile});constturndownService=newTurndownService();constmarkdown=turndownService.turndown(result.value);awaitfs.promises.writeFile(outputFile,markdown);console.log(`Conversion complete. Output saved to${outputFile}`);}
Enter fullscreen modeExit fullscreen mode

Now you can convert DOCX files to Markdown:

node index.js example.docx example.md
Enter fullscreen modeExit fullscreen mode

Extracting Images

DOCX files often contain images that we need to handle. We'll extract these images and save them to a folder, updating the image links in the Markdown file.

Updateindex.js to include image extraction:

#!/usr/bin/env nodeimport*asfsfrom'fs';import*aspathfrom'path';import*asmammothfrom'mammoth';importTurndownServicefrom'turndown';import{program}from'commander';import{v4asuuidv4}from'uuid';program.version('1.0.0').description('Convert DOCX to Markdown with image extraction').argument('<input>','Input DOCX file').argument('[output]','Output Markdown file (default: same as input with .md extension)').action(async(input,output)=>{try{awaitconvertDocxToMarkdown(input,output);}catch(error){console.error('Error:',error);process.exit(1);}});program.parse(process.argv);asyncfunctionconvertDocxToMarkdown(inputFile,outputFile){if(!outputFile){outputFile=path.join(path.dirname(inputFile),`${path.basename(inputFile,'.docx')}.md`);}constimageDir=path.join(path.dirname(outputFile),'images');if(!fs.existsSync(imageDir)){fs.mkdirSync(imageDir,{recursive:true});}constresult=awaitmammoth.convertToHtml({path:inputFile},{convertImage:mammoth.images.imgElement(async(image)=>{constbuffer=awaitimage.read();constextension=image.contentType.split('/')[1];constimageName=`image-${uuidv4()}.${extension}`;constimagePath=path.join(imageDir,imageName);awaitfs.promises.writeFile(imagePath,buffer);return{src:`images/${imageName}`};})});constturndownService=newTurndownService();constmarkdown=turndownService.turndown(result.value);awaitfs.promises.writeFile(outputFile,markdown);console.log(`Conversion complete. Output saved to${outputFile}`);}
Enter fullscreen modeExit fullscreen mode

Now, images will be extracted and saved in animages folder, and the Markdown file will contain the correct links to these images.

Formatting Tables

The final feature we'll add is table formatting. DOCX files often contain tables that need to be correctly formatted in Markdown.

Updateindex.js to include table formatting:

#!/usr/bin/env nodeimport*asfsfrom'fs';import*aspathfrom'path';import*asmammothfrom'mammoth';importTurndownServicefrom'turndown';import{program}from'commander';import{v4asuuidv4}from'uuid';program.version('1.0.0').description('Convert DOCX to Markdown with image extraction and table formatting').argument('<input>','Input DOCX file').argument('[output]','Output Markdown file (default: same as input with .md extension)').action(async(input,output)=>{try{awaitconvertDocxToMarkdown(input,output);}catch(error){console.error('Error:',error);process.exit(1);}});program.parse(process.argv);functioncreateMarkdownTable(table){constrows=Array.from(table.rows);if(rows.length===0)return'';constheaders=Array.from(rows[0].cells).map(cell=>cell.textContent?.trim()||'');constmarkdownRows=rows.slice(1).map(row=>Array.from(row.cells).map(cell=>cell.textContent?.trim()||''));letmarkdown='|'+headers.join(' |')+' |\n';markdown+='|'+headers.map(()=>'---').join(' |')+' |\n';markdownRows.forEach(row=>{markdown+='|'+row.join(' |')+' |\n';});returnmarkdown;}asyncfunctionconvertDocxToMarkdown(inputFile,outputFile){if(!outputFile){outputFile=path.join(path.dirname(inputFile),`${path.basename(inputFile,'.docx')}.md`);}constimageDir=path.join(path.dirname(outputFile),'images');if(!fs.existsSync(imageDir)){fs.mkdirSync(imageDir,{recursive:true});}constresult=awaitmammoth.convertToHtml({path:inputFile},{convertImage:mammoth.images.imgElement(async(image)=>{constbuffer=awaitimage.read();constextension=image.contentType.split('/')[1];constimageName=`image-${uuidv4()}.${extension}`;constimagePath=path.join(imageDir,imageName);awaitfs.promises.writeFile(imagePath,buffer);return{src:`images/${imageName}`};})});lethtml=result.value;constturndownService=newTurndownService();turndownService.addRule('table',{filter:'table',replacement:function(content,node){return'\n\n'+createMarkdownTable(node)+'\n\n';}});constmarkdown=turndownService.turndown(html);awaitfs.promises.writeFile(outputFile,markdown);console.log(`Conversion complete. Output saved to${outputFile}`);}
Enter fullscreen modeExit fullscreen mode

Conclusion

In this blog post, we built a DOCX to Markdown converter step by step, adding features like image extraction and table formatting. This tool demonstrates the power and flexibility of Node.js for handling file manipulations and conversions.

The source code for this project is available onGitHub, where you can find the latest updates, contribute to the project, and explore further enhancements.

Thank you for following along with this guide. Happy coding!

Top comments(1)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
CollapseExpand
 
darren_cooper_bb51950eca1 profile image
Darren Cooper
  • Joined

Thanks for this - I have been thinking about this very task. Incredibly useful

Some comments have been hidden by the post's author -find out more

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Tech and data enthusiast. Sharing insights to empower fellow developers. Lifelong learner & code master.
  • Joined

More fromSerge Artishev

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp