Movatterモバイル変換


[0]ホーム

URL:


Please help Ukraine!
Sponsor
Pandoc  a universal document converter

Creating Custom Pandoc Readers in Lua

Introduction

If you need to parse a format not already handled by pandoc, you can create a custom reader using theLua language. Pandoc has a built-in Lua interpreter, so you needn’t install any additional software to do this.

A custom reader is a Lua file that defines a function calledReader, which takes two arguments:

  • the raw input to be parsed, as a list of sources
  • optionally, a table of reader options, e.g.{ columns = 62, standalone = true }.

TheReader function should return aPandoc AST. This can be created using functions in thepandoc module, which is automatically in scope. (Indeed, all of the utility functions that are available forLua filters are available in custom readers, too.)

Each source item corresponds to a file or stream passed to pandoc containing its text and name. E.g., if a single fileinput.txt is passed to pandoc, then the list of sources will contain just a single elements, wheres.name == 'input.txt' ands.text contains the file contents as a string.

The sources list, as well as each of its elements, can be converted to a string via the Lua standard library functiontostring.

A minimal example would be

function Reader(input)returnpandoc.Pandoc({pandoc.CodeBlock(tostring(input))})end

This just returns a document containing a big code block with all of the input. Or, to create a separate code block for each input file, one might write

function Reader(input)returnpandoc.Pandoc(input:map(function(s)returnpandoc.CodeBlock(s.text)end))end

In a nontrivial reader, you’ll want to parse the input. You can do this using standard Lua library functions (for example, thepatterns library), or with the powerful and fastlpeg parsing library, which is automatically in scope. You can also use external Lua libraries (for example, an XML parser).

A previous pandoc version passed a raw string instead of a list of sources to the Reader function. Reader functions that rely on this are obsolete, but still supported: Pandoc analyzes any script error, detecting when code assumed the old behavior. The code is rerun with raw string input in this case, thereby ensuring backwards compatibility.

Bytestring readers

In order to read binary formats, including docx, odt, and epub, pandoc supports theByteStringReader function. AByteStringReader function is similar to theReader function that processes text input. Instead of a list of sources, the ByteStringReader function is passed a bytestring, i.e., a string that contains the binary input.

-- read input as epubfunction ByteStringReader(input)returnpandoc.read(input,'epub')end

Format extensions

Custom readers can be built such that their behavior is controllable through format extensions, such assmart,citations, orhard-line-breaks. Supported extensions are those that are present as a key in the globalExtensions table. Fields of extensions that are enabled default have the valuetrue orenable, while those that are supported but disabled have valuefalse ordisable.

Example: A writer with the following global table supports the extensionssmart,citations, andfoobar, withsmart enabled and the other two disabled by default:

Extensions={smart='enable',citations='disable',foobar=true}

The users control extensions as usual, e.g.,pandoc -f my-reader.lua+citations. The extensions are accessible through the reader options’extensions field, e.g.:

function Reader(input,opts)print('The citations extension is',opts.extensions:includes'citations'and'enabled'or'disabled')-- ...end

Extensions that are neither enabled nor disabled in theExtensions field are treated as unsupported by the reader. Trying to modify such an extension via the command line will lead to an error.

Example: plain text reader

This is a simple example usinglpeg to parse the input into space-separated strings and blankline-separated paragraphs.

-- A sample custom reader that just parses text into blankline-separated-- paragraphs with space-separated words.-- For better performance we put these functions in local variables:localP,S,R,Cf,Cc,Ct,V,Cs,Cg,Cb,B,C,Cmt=lpeg.P,lpeg.S,lpeg.R,lpeg.Cf,lpeg.Cc,lpeg.Ct,lpeg.V,lpeg.Cs,lpeg.Cg,lpeg.Cb,lpeg.B,lpeg.C,lpeg.Cmtlocalwhitespacechar= S("\t\r\n")localwordchar=(1-whitespacechar)localspacechar= S("\t")localnewline= P"\r"^-1* P"\n"localblanklines=newline*(spacechar^0*newline)^1localendline=newline-blanklines-- GrammarG= P{"Pandoc",Pandoc= Ct(V"Block"^0)/pandoc.Pandoc;Block=blanklines^0* V"Para";Para= Ct(V"Inline"^1)/pandoc.Para;Inline= V"Str"+ V"Space"+ V"SoftBreak";Str=wordchar^1/pandoc.Str;Space=spacechar^1/pandoc.Space;SoftBreak=endline/pandoc.SoftBreak;}function Reader(input)returnlpeg.match(G,tostring(input))end

Example of use:

% pandoc -f plain.lua -t native*Hello there*, this is plain text with no formattingexcept paragraph breaks.- Like this one.^D[ Para    [ Str "*Hello"    , Space    , Str "there*,"    , Space    , Str "this"    , Space    , Str "is"    , Space    , Str "plain"    , Space    , Str "text"    , Space    , Str "with"    , Space    , Str "no"    , Space    , Str "formatting"    , SoftBreak    , Str "except"    , Space    , Str "paragraph"    , Space    , Str "breaks."    ], Para    [ Str "-"    , Space    , Str "Like"    , Space    , Str "this"    , Space    , Str "one."    ]]

Example: a wiki Creole reader

This is a parser forCreole common wiki markup. It uses anlpeg grammar. Fun fact: this custom reader is faster than pandoc’s built-in creole reader! This shows that high-performance readers can be designed in this way.

-- A sample custom reader for Creole 1.0 (common wiki markup)-- http://www.wikicreole.org/wiki/CheatSheet-- For better performance we put these functions in local variables:localP,S,R,Cf,Cc,Ct,V,Cs,Cg,Cb,B,C,Cmt=lpeg.P,lpeg.S,lpeg.R,lpeg.Cf,lpeg.Cc,lpeg.Ct,lpeg.V,lpeg.Cs,lpeg.Cg,lpeg.Cb,lpeg.B,lpeg.C,lpeg.Cmtlocalwhitespacechar= S("\t\r\n")localspecialchar= S("/*~[]\\{}|")localwordchar=(1-(whitespacechar+specialchar))localspacechar= S("\t")localnewline= P"\r"^-1* P"\n"localblankline=spacechar^0*newlinelocalendline=newline*#-blanklinelocalendequals=spacechar^0* P"="^0*spacechar^0*newlinelocalcellsep=spacechar^0* P"|"localfunction trim(s)return(s:gsub("^%s*(.-)%s*$","%1"))endlocalfunction ListItem(lev,ch)localstartifch==nilthenstart= S"*#"elsestart= P(ch)endlocalsubitem=function(c)iflev<6thenreturn ListItem(lev+1,c)elsereturn(1-1)-- failsendendlocalparser=spacechar^0*start^lev*#(-start)*spacechar^0* Ct((V"Inline"-(newline*spacechar^0* S"*#"))^0)*newline*(Ct(subitem("*")^1)/pandoc.BulletList+                  Ct(subitem("#")^1)/pandoc.OrderedList+                  Cc(nil))/function(ils,sublist)return{pandoc.Plain(ils),sublist}endreturnparserend-- GrammarG= P{"Doc",Doc= Ct(V"Block"^0)/pandoc.Pandoc;Block=blankline^0*( V"Header"+ V"HorizontalRule"+ V"CodeBlock"+ V"List"+ V"Table"+ V"Para");Para= Ct(V"Inline"^1)*newline/pandoc.Para;HorizontalRule=spacechar^0* P"----"*spacechar^0*newline/pandoc.HorizontalRule;Header=(P("=")^1/string.len)*spacechar^1* Ct((V"Inline"-endequals)^1)*endequals/pandoc.Header;CodeBlock= P"{{{"*blankline* C((1-(newline* P"}}}"))^0)*newline* P"}}}"/pandoc.CodeBlock;Placeholder= P"<<<"* C(P(1)- P">>>")^0* P">>>"/function()returnpandoc.Div({})end;List= V"BulletList"+ V"OrderedList";BulletList= Ct(ListItem(1,'*')^1)/pandoc.BulletList;OrderedList= Ct(ListItem(1,'#')^1)/pandoc.OrderedList;Table=(V"TableHeader"+ Cc{})* Ct(V"TableRow"^1)/function(headrow,bodyrows)localnumcolumns=#(bodyrows[1])localaligns={}localwidths={}fori=1,numcolumnsdoaligns[i]=pandoc.AlignDefaultwidths[i]=0endreturnpandoc.utils.from_simple_table(pandoc.SimpleTable({},aligns,widths,headrow,bodyrows))end;TableHeader= Ct(V"HeaderCell"^1)*cellsep^-1*spacechar^0*newline;TableRow= Ct(V"BodyCell"^1)*cellsep^-1*spacechar^0*newline;HeaderCell=cellsep* P"="*spacechar^0* Ct((V"Inline"-(newline+cellsep))^0)/function(ils)return{pandoc.Plain(ils)}end;BodyCell=cellsep*spacechar^0* Ct((V"Inline"-(newline+cellsep))^0)/function(ils)return{pandoc.Plain(ils)}end;Inline= V"Emph"+ V"Strong"+ V"LineBreak"+ V"Link"+ V"URL"+ V"Image"+ V"Str"+ V"Space"+ V"SoftBreak"+ V"Escaped"+ V"Placeholder"+ V"Code"+ V"Special";Str=wordchar^1/pandoc.Str;Escaped= P"~"* C(P(1))/pandoc.Str;Special=specialchar/pandoc.Str;Space=spacechar^1/pandoc.Space;SoftBreak=endline*#-(V"HorizontalRule"+ V"CodeBlock")/pandoc.SoftBreak;LineBreak= P"\\\\"/pandoc.LineBreak;Code= P"{{{"* C((1- P"}}}")^0)* P"}}}"/trim/pandoc.Code;Link= P"[["* C((1-(P"]]"+ P"|"))^0)*(P"|"* Ct((V"Inline"- P"]]")^1))^-1* P"]]"/function(url,desc)localtxt=descor{pandoc.Str(url)}returnpandoc.Link(txt,url)end;Image= P"{{"*#-P"{"* C((1-(S"}"))^0)*(P"|"* Ct((V"Inline"- P"}}")^1))^-1* P"}}"/function(url,desc)localtxt=descor""returnpandoc.Image(txt,url)end;URL= P"http"* P"s"^-1* P":"*(1-(whitespacechar+(S",.?!:;\"'"*#whitespacechar)))^1/function(url)returnpandoc.Link(pandoc.Str(url),url)end;Emph= P"//"* Ct((V"Inline"- P"//")^1)* P"//"/pandoc.Emph;Strong= P"**"* Ct((V"Inline"-P"**")^1)* P"**"/pandoc.Strong;}function Reader(input,reader_options)returnlpeg.match(G,tostring(input))end

Example of use:

% pandoc -f creole.lua -t markdown== Wiki CreoleYou can make things **bold** or //italic// or **//both//** or //**both**//.Character formatting extends across line breaks: **bold,this is still bold. This line deliberately does not end in star-star.Not bold. Character formatting does not cross paragraph boundaries.You can use [[internal links]] or [[http://www.wikicreole.org|external links]],give the link a [[internal links|different]] name.^D## Wiki CreoleYou can make things **bold** or *italic* or ***both*** or ***both***.Character formatting extends across line breaks: \*\*bold, this is stillbold. This line deliberately does not end in star-star.Not bold. Character formatting does not cross paragraph boundaries.You can use [internal links](internal links) or [externallinks](http://www.wikicreole.org), give the link a[different](internal links) name.

Example: parsing JSON from an API

This custom reader consumes the JSON output ofhttps://www.reddit.com/r/haskell.json and produces a document containing the current top articles on the Haskell subreddit.

It assumes that thepandoc.json library is available, which ships with pandoc versions after (not including) 3.1. It’s still possible to use this with older pandoc version by using a different JSON library. E.g.,luajson can be installed usingluarocks install luajson—but be sure you are installing it for Lua 5.4, which is the version packaged with pandoc.

-- consumes the output of https://www.reddit.com/r/haskell.jsonlocaljson=require'pandoc.json'localfunction read_inlines(raw)localdoc=pandoc.read(raw,"commonmark")returnpandoc.utils.blocks_to_inlines(doc.blocks)endlocalfunction read_blocks(raw)localdoc=pandoc.read(raw,"commonmark")returndoc.blocksendfunction Reader(input)localparsed=json.decode(tostring(input))localblocks={}for_,entryinipairs(parsed.data.children)dolocald=entry.datatable.insert(blocks,pandoc.Header(2,pandoc.Link(read_inlines(d.title),d.url)))for_,blockinipairs(read_blocks(d.selftext))dotable.insert(blocks,block)endendreturnpandoc.Pandoc(blocks)end

Similar code can be used to consume JSON output from other APIs.

Note that the content of the text fields is markdown, so we convert it usingpandoc.read().

Example: syntax-highlighted code files

This is a reader that puts the content of each input file into a code block, sets the file’s extension as the block’s class to enable code highlighting, and places the filename as a header above each code block.

function to_code_block(source)local_,lang=pandoc.path.split_extension(source.name)returnpandoc.Div{pandoc.Header(1,source.name==''and'<stdin>'orsource.name),pandoc.CodeBlock(source.text,{class=lang}),}endfunction Reader(input,opts)returnpandoc.Pandoc(input:map(to_code_block))end

Example: extracting the content from web pages

This reader uses the command-line programreadable (install vianpm install -g readability-cli) to clean out parts of HTML input that have to do with navigation, leaving only the content.

-- Custom reader that extracts the content from HTML documents,-- ignoring navigation and layout elements. This preprocesses input-- through the 'readable' program (which can be installed using-- 'npm install -g readability-cli') and then calls the HTML reader.-- In addition, Divs that seem to have only a layout function are removed-- to avoid clutter.function make_readable(source)localresultifnotpcall(function()localname=source.nameifnotname:match("http")thenname="file:///"..nameendresult=pandoc.pipe("readable",{"--keep-classes","--base",name},source.text)end)thenio.stderr:write("Error running 'readable': do you have it installed?\n")io.stderr:write("npm install -g readability-cli\n")os.exit(1)endreturnresultendlocalboring_classes={row=true,page=true,container=true}localboring_attributes={"role"}localfunction is_boring_class(cl)returnboring_classes[cl]orcl:match("col%-")orcl:match("pull%-")endlocalfunction handle_div(el)fori,classinipairs(el.classes)doif is_boring_class(class)thenel.classes[i]=nilendendfori,kinipairs(boring_attributes)doel.attributes[k]=nilendifel.identifier:match("readability%-")thenel.identifier=""endif#el.classes==0and#el.attributes==0and#el.identifier==0thenreturnel.contentelsereturnelendendfunction Reader(sources)localreadable=''for_,sourceinipairs(sources)doreadable=readable.. make_readable(source)endlocaldoc=pandoc.read(readable,"html",PANDOC_READER_OPTIONS)-- Now remove Divs used only for layoutreturndoc:walk{Div=handle_div}end

Example of use:

pandoc -f readable.lua -t markdown https://pandoc.org

and compare the output to

pandoc -f html -t markdown https://pandoc.org
Search results

[8]ページ先頭

©2009-2025 Movatter.jp