Instantly share code, notes, and snippets.
Save plembo/409a8d7b1bae66622dbcd26337bbb185 to your computer and use it in GitHub Desktop.
I usepandoc to convert masses of Word documents to markdown. Still working on a generic script, but for nowhere's the "gist" of what I type into the terminal:
$ myfilename="example"$ pandoc \-t markdown_strict \--extract-media='./attachments/$myfilename' \$myfilename.docx \-o$myfilename.md
Pandoc markdown is nice, but with Word documents it oftenadds odd things in translation.Stick to markdown_strict to avoid that.
I try to organize media (images, etc) embedded in documents under an attachments subdirectory with folders named for each file.This helps avoid "collision" between media file names and makes conversion out of markdown into other formats (HTML, PDF)less messy.
mattman-ps commentedMar 2, 2023
Nice. Thanks for the tips ;-)
Mjboothaus commentedJan 31, 2024
Thanks
iambumblehead commentedMar 7, 2024
works perfectly
mrtngrsbch commentedMay 19, 2024
cool, nice gist !
STrRedWolf commentedAug 10, 2024
This helps get me 90% of the way there. I use a mix of'markdown+bracketed_spans+backtick_code_blocks+fenced_code_attributes+fenced_divs' but I have to manually re-add the[]{custom-style="foobar"} code as well as the horizontal lines... well, close...
brucegl commentedJun 27, 2025
beautiful!
hochun836 commentedAug 19, 2025
thanks a lot !
amalytix commentedAug 25, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
In case someone needs to process multiple files in a given directory calledinput-files this helped me:
#!/usr/bin/env bashset -euo pipefailIFS=$'\n\t'BASE_DIR="input-files"ATTACH_ROOT="attachments"# Ensure base attachment dir existsmkdir -p"$ATTACH_ROOT"# Find all .docx files under BASE_DIR (recursively), handling spaces safelyfind"$BASE_DIR" -type f -name'*.docx' -print0|while IFS=read -r -d'' docx;do# Relative path (without BASE_DIR/ prefix if present) rel="$docx"case"$rel"in"$BASE_DIR"/*) rel="${rel#"$BASE_DIR"/}" ;;esac# Strip extension rel_noext="${rel%.*}"# Build a filesystem-safe prefix from the relative path:# - lowercase# - replace any non [a-z0-9] with '-'# - collapse multiple '-' and trim leading/trailing '-' prefix="$(printf'%s'"$rel_noext" \| tr'[:upper:]''[:lower:]' \| sed -E's/[^a-z0-9]+/-/g; s/-+/-/g; s/^-+//; s/-+$//')" media_dir="$ATTACH_ROOT/$prefix" mkdir -p"$media_dir"# Output Markdown path: same folder next to the .docx, same basename with .md. No spaces in filename. md_out="${docx%.*}.md" md_out=$(echo"$md_out"| sed's/ /-/g')echo"Converting:$docx"echo" -> Markdown:$md_out"echo" -> Media:$media_dir" pandoc -t markdown_strict \ --extract-media="$media_dir" \"$docx" \ -o"$md_out"doneecho"Done."
- Save as
convert-docx.sh - Make executable:
chmod +x convert-docx.sh - Run:
./convert-docx.sh
plembo commentedAug 27, 2025
Thanks for this!