Load and Export Layout Data¶
Dataframe and CSV¶
layoutparser.io.
load_dataframe
(df:pandas.core.frame.DataFrame,block_type:str=None) → layoutparser.elements.layout.Layout[source]¶Load the Layout object from the given dataframe.
Dict and JSON¶
layoutparser.io.
load_dict
(data:Union[Dict,List[Dict]]) → Union[layoutparser.elements.base.BaseLayoutElement,layoutparser.elements.layout.Layout][source]¶Load a dict of list of dict representations of some layout data,automatically parse its type, and save it as any of BaseLayoutElementor Layout datatype.
- Parameters
data (Union[Dict,List]) – A dict of list of dict representations of the layout data
- Raises
ValueError – If the data format is incompatible with the layout-data-JSON format, raise aValueError.
ValueError – If anyblock_type name is not in the available list of layout element names defined inBASECOORD_ELEMENT_NAMEMAP, raise aValueError.
- Returns
Based on the dict format, it will automatically parse the type ofthe data and load it accordingly.
- Return type
Union[BaseLayoutElement,Layout]
PDF¶
layoutparser.io.
load_pdf
(filename:str,load_images:bool=False,x_tolerance:int=1.5,y_tolerance:int=2,keep_blank_chars:bool=False,use_text_flow:bool=True,horizontal_ltr:bool=True,vertical_ttb:bool=True,extra_attrs:Optional[List[str]]=None,dpi:int=72) → Union[List[layoutparser.elements.layout.Layout],Tuple[List[layoutparser.elements.layout.Layout],List[Image.Image]]][source]¶Load all tokens for each page from a PDF file, and save themin a list of Layout objects with the original page order.
- Parameters
filename (str) – The path to the PDF file.
load_images (bool,optional) – Whether load screenshot for each page of the PDF file.When set to true, the function will return both the layout andscreenshot image for each page.Defaults to False.
x_tolerance (int,optional) – The threshold used for extracting “word tokens” from the pdf file.It will merge the pdf characters into a word token if the differencebetween the x_2 of one character and the x_1 of the next is less thanor equal to x_tolerance. See details inpdf2plumber’s documentation.Defaults to 1.5.
y_tolerance (int,optional) –
The threshold used for extracting “word tokens” from the pdf file.It will merge the pdf characters into a word token if the differencebetween the y_2 of one character and the y_1 of the next is less thanor equal to y_tolerance. See details inpdf2plumber’s documentation.Defaults to 2.
keep_blank_chars (bool,optional) –
When keep_blank_chars is set to True, it will treat blank charactersare treated as part of a word, not as a space between words. Seedetails inpdf2plumber’s documentation.Defaults to False.
use_text_flow (bool,optional) –
When use_text_flow is set to True, it will use the PDF’s underlyingflow of characters as a guide for ordering and segmenting the words,rather than presorting the characters by x/y position. (This mimicshow dragging a cursor highlights text in a PDF; as with that, theorder does not always appear to be logical.) See details inpdf2plumber’s documentation.Defaults to True.
horizontal_ltr (bool,optional) – When horizontal_ltr is set to True, it means the doc should readtext from left to right, vice versa.Defaults to True.
vertical_ttb (bool,optional) – When vertical_ttb is set to True, it means the doc should readtext from top to bottom, vice versa.Defaults to True.
extra_attrs (Optional[List[str]],optional) –
Passing a list of extra_attrs (e.g., [“fontname”, “size”]) willrestrict each words to characters that share exactly the samevalue for each of thoseattributes extracted by pdfplumber,and the resulting word dicts will indicate those attributes.See details inpdf2plumber’s documentation.Defaults to[“fontname”, “size”].
dpi (int,optional) – When loading images of the pdf, you can also specify the resolution(orDPI, dots per inch)for rendering the images. Higher DPI values mean clearer images (alsolarger file sizes).Setting dpi will also automatically resizes the extracted pdf_layoutto match the sizes of the images. Therefore, when visualizing thepdf_layouts, it can be rendered appropriately.Defaults toDEFAULT_PDF_DPI=72, which is also the default rendering dpifrom the pdfplumber PDF parser.
- Returns
- Whenload_images=False, it will only load the pdf_tokens from
the PDF file. Each element of the list denotes all the tokens appearedon a single page, and the list is ordered the same as the original PDFpage order.
- Tuple[List[Layout], List[“Image.Image”]]:
Whenload_images=True, besides theall_page_layout, it will alsoreturn a list of page images.
- Return type
List[Layout]
- Examples::
>>>importlayoutparseraslp>>>pdf_layout=lp.load_pdf("path/to/pdf")>>>pdf_layout[0]# the layout for page 0>>>pdf_layout,pdf_images=lp.load_pdf("path/to/pdf",load_images=True)>>>lp.draw_box(pdf_images[0],pdf_layout[0])
Other Formats¶
Stay tuned! We are working on to support more formats.