Load and Export Layout Data

Dataframe and CSV

layoutparser.io.load_dataframe(df:pandas.core.frame.DataFrame,block_type:str=None) → layoutparser.elements.layout.Layout[source]

Load the Layout object from the given dataframe.

Parameters
  • df (pd.DataFrame) –

  • block_type (str) – If there’s no block_type column in the CSV file,you must pass in a block_type variable such that layout parsercan appropriately detect the type of the layout elements.

Returns

The parsed Layout object from the CSV file.

Return type

Layout

layoutparser.io.load_csv(filename:str,block_type:str=None) → layoutparser.elements.layout.Layout[source]

Load the Layout object from the given CSV file.

Parameters
  • filename (str) – The name of the CSV file. A row of the table representsan individual layout element.

  • block_type (str) – If there’s no block_type column in the CSV file,you must pass in a block_type variable such that layout parsercan appropriately detect the type of the layout elements.

Returns

The parsed Layout object from the CSV file.

Return type

Layout

Dict and JSON

layoutparser.io.load_dict(data:Union[Dict,List[Dict]]) → Union[layoutparser.elements.base.BaseLayoutElement,layoutparser.elements.layout.Layout][source]

Load a dict of list of dict representations of some layout data,automatically parse its type, and save it as any of BaseLayoutElementor Layout datatype.

Parameters

data (Union[Dict,List]) – A dict of list of dict representations of the layout data

Raises
  • ValueError – If the data format is incompatible with the layout-data-JSON format, raise aValueError.

  • ValueError – If anyblock_type name is not in the available list of layout element names defined inBASECOORD_ELEMENT_NAMEMAP, raise aValueError.

Returns

Based on the dict format, it will automatically parse the type ofthe data and load it accordingly.

Return type

Union[BaseLayoutElement,Layout]

layoutparser.io.load_json(filename:str) → Union[layoutparser.elements.base.BaseLayoutElement,layoutparser.elements.layout.Layout][source]

Load a JSON file and save it as a layout object with appropriate data types.

Parameters

filename (str) – The name of the JSON file.

Returns

Based on the JSON file format, it will automatically parsethe type of the data and load it accordingly.

Return type

Union[BaseLayoutElement,Layout]

PDF

layoutparser.io.load_pdf(filename:str,load_images:bool=False,x_tolerance:int=1.5,y_tolerance:int=2,keep_blank_chars:bool=False,use_text_flow:bool=True,horizontal_ltr:bool=True,vertical_ttb:bool=True,extra_attrs:Optional[List[str]]=None,dpi:int=72) → Union[List[layoutparser.elements.layout.Layout],Tuple[List[layoutparser.elements.layout.Layout],List[Image.Image]]][source]

Load all tokens for each page from a PDF file, and save themin a list of Layout objects with the original page order.

Parameters
  • filename (str) – The path to the PDF file.

  • load_images (bool,optional) – Whether load screenshot for each page of the PDF file.When set to true, the function will return both the layout andscreenshot image for each page.Defaults to False.

  • x_tolerance (int,optional) – The threshold used for extracting “word tokens” from the pdf file.It will merge the pdf characters into a word token if the differencebetween the x_2 of one character and the x_1 of the next is less thanor equal to x_tolerance. See details inpdf2plumber’s documentation.Defaults to 1.5.

  • y_tolerance (int,optional) –

    The threshold used for extracting “word tokens” from the pdf file.It will merge the pdf characters into a word token if the differencebetween the y_2 of one character and the y_1 of the next is less thanor equal to y_tolerance. See details inpdf2plumber’s documentation.Defaults to 2.

  • keep_blank_chars (bool,optional) –

    When keep_blank_chars is set to True, it will treat blank charactersare treated as part of a word, not as a space between words. Seedetails inpdf2plumber’s documentation.Defaults to False.

  • use_text_flow (bool,optional) –

    When use_text_flow is set to True, it will use the PDF’s underlyingflow of characters as a guide for ordering and segmenting the words,rather than presorting the characters by x/y position. (This mimicshow dragging a cursor highlights text in a PDF; as with that, theorder does not always appear to be logical.) See details inpdf2plumber’s documentation.Defaults to True.

  • horizontal_ltr (bool,optional) – When horizontal_ltr is set to True, it means the doc should readtext from left to right, vice versa.Defaults to True.

  • vertical_ttb (bool,optional) – When vertical_ttb is set to True, it means the doc should readtext from top to bottom, vice versa.Defaults to True.

  • extra_attrs (Optional[List[str]],optional) –

    Passing a list of extra_attrs (e.g., [“fontname”, “size”]) willrestrict each words to characters that share exactly the samevalue for each of thoseattributes extracted by pdfplumber,and the resulting word dicts will indicate those attributes.See details inpdf2plumber’s documentation.Defaults to[“fontname”, “size”].

  • dpi (int,optional) – When loading images of the pdf, you can also specify the resolution(orDPI, dots per inch)for rendering the images. Higher DPI values mean clearer images (alsolarger file sizes).Setting dpi will also automatically resizes the extracted pdf_layoutto match the sizes of the images. Therefore, when visualizing thepdf_layouts, it can be rendered appropriately.Defaults toDEFAULT_PDF_DPI=72, which is also the default rendering dpifrom the pdfplumber PDF parser.

Returns

Whenload_images=False, it will only load the pdf_tokens from

the PDF file. Each element of the list denotes all the tokens appearedon a single page, and the list is ordered the same as the original PDFpage order.

Tuple[List[Layout], List[“Image.Image”]]:

Whenload_images=True, besides theall_page_layout, it will alsoreturn a list of page images.

Return type

List[Layout]

Examples::
>>>importlayoutparseraslp>>>pdf_layout=lp.load_pdf("path/to/pdf")>>>pdf_layout[0]# the layout for page 0>>>pdf_layout,pdf_images=lp.load_pdf("path/to/pdf",load_images=True)>>>lp.draw_box(pdf_images[0],pdf_layout[0])

Other Formats

Stay tuned! We are working on to support more formats.