Movatterモバイル変換


[0]ホーム

URL:


Skip to content
Search Gists
Sign in Sign up

Instantly share code, notes, and snippets.

@plembo
Last activeDecember 17, 2025 09:07
    • Star(47)You must be signed in to star a gist
    • Fork(7)You must be signed in to fork a gist

    Select an option

    Save plembo/409a8d7b1bae66622dbcd26337bbb185 to your computer and use it in GitHub Desktop.
    Convert docx to markdown with pandoc

    I usepandoc to convert masses of Word documents to markdown. Still working on a generic script, but for nowhere's the "gist" of what I type into the terminal:

    $ myfilename="example"$ pandoc \-t markdown_strict \--extract-media='./attachments/$myfilename' \$myfilename.docx \-o$myfilename.md

    Pandoc markdown is nice, but with Word documents it oftenadds odd things in translation.Stick to markdown_strict to avoid that.

    I try to organize media (images, etc) embedded in documents under an attachments subdirectory with folders named for each file.This helps avoid "collision" between media file names and makes conversion out of markdown into other formats (HTML, PDF)less messy.

    @mattman-ps
    Copy link

    Nice. Thanks for the tips ;-)

    @Mjboothaus
    Copy link

    Thanks

    @iambumblehead
    Copy link

    works perfectly

    @mrtngrsbch
    Copy link

    cool, nice gist !

    @STrRedWolf
    Copy link

    This helps get me 90% of the way there. I use a mix of'markdown+bracketed_spans+backtick_code_blocks+fenced_code_attributes+fenced_divs' but I have to manually re-add the[]{custom-style="foobar"} code as well as the horizontal lines... well, close...

    @brucegl
    Copy link

    beautiful!

    @hochun836
    Copy link

    thanks a lot !

    @amalytix
    Copy link

    amalytix commentedAug 25, 2025
    edited
    Loading

    In case someone needs to process multiple files in a given directory calledinput-files this helped me:

    #!/usr/bin/env bashset -euo pipefailIFS=$'\n\t'BASE_DIR="input-files"ATTACH_ROOT="attachments"# Ensure base attachment dir existsmkdir -p"$ATTACH_ROOT"# Find all .docx files under BASE_DIR (recursively), handling spaces safelyfind"$BASE_DIR" -type f -name'*.docx' -print0|while IFS=read -r -d'' docx;do# Relative path (without BASE_DIR/ prefix if present)  rel="$docx"case"$rel"in"$BASE_DIR"/*) rel="${rel#"$BASE_DIR"/}" ;;esac# Strip extension  rel_noext="${rel%.*}"# Build a filesystem-safe prefix from the relative path:# - lowercase# - replace any non [a-z0-9] with '-'# - collapse multiple '-' and trim leading/trailing '-'  prefix="$(printf'%s'"$rel_noext" \| tr'[:upper:]''[:lower:]' \| sed -E's/[^a-z0-9]+/-/g; s/-+/-/g; s/^-+//; s/-+$//')"  media_dir="$ATTACH_ROOT/$prefix"  mkdir -p"$media_dir"# Output Markdown path: same folder next to the .docx, same basename with .md. No spaces in filename.  md_out="${docx%.*}.md"  md_out=$(echo"$md_out"| sed's/ /-/g')echo"Converting:$docx"echo"  -> Markdown:$md_out"echo"  -> Media:$media_dir"  pandoc -t markdown_strict \    --extract-media="$media_dir" \"$docx" \    -o"$md_out"doneecho"Done."
    1. Save asconvert-docx.sh
    2. Make executable:chmod +x convert-docx.sh
    3. Run:./convert-docx.sh

    @plembo
    Copy link
    Author

    Thanks for this!

    Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

    [8]ページ先頭

    ©2009-2025 Movatter.jp