Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:Extract Tables and Sentences from PDFs with User Interface
Version:1.4.10
Author:Erik Stricker [aut, cre]
Maintainer:Erik Stricker <erik.stricker@gmx.com>
Description:The PDE (Pdf Data Extractor) allows the extraction of information and tables optionally based on search words from PDF (Portable Document Format) files and enables the visualization of the results, both by providing a convenient user-interface.
License:GPL-3 | file LICENSE
Encoding:UTF-8
Imports:tcltk
Depends:tcltk2 (≥ 1.2.11), R (≥ 3.5)
SystemRequirements:XPDF(4.02)(https://github.com/erikstricker/PDE/tree/master/inst/examples/bin)
RoxygenNote:7.3.1
Suggests:knitr, rmarkdown
VignetteBuilder:knitr
NeedsCompilation:no
Packaged:2024-06-11 17:25:29 UTC; Erik
Repository:CRAN
Date/Publication:2024-06-11 18:10:06 UTC

PDE: Extract Tables and Sentences from PDF Files.

Description

The package includes two main components: 1) The PDE analyzer performs thesentence and table extraction while 2) the PDE reader allows theuser-friendly visualization and quick-processing of the obtained results.

PDE functions

PDE_analyzer,PDE_analyzer_i,PDE_extr_data_from_pdfs,PDE_pdfs2table,PDE_pdfs2table_searchandfilter,PDE_pdfs2txt_searchandfilter,PDE_reader_i,PDE_install_Xpdftools4.02,PDE_check_Xpdf_install

_PACKAGE


Extracting data from a PDF (Protable Document Format) file

Description

PDE_extr_data_from_pdf extracts sentences or tables from a single PDFfile and writes output in the corresponding folder.

Usage

.PDE_extr_data_from_pdf(  pdf,  whattoextr,  out = ".",  filter.words = "",  regex.fw = TRUE,  ignore.case.fw = FALSE,  filter.word.times = "0.2%",  table.heading.words = "",  ignore.case.th = FALSE,  search.words,  search.word.categories = NULL,  save.tab.by.category = FALSE,  regex.sw = TRUE,  ignore.case.sw = FALSE,  eval.abbrevs = TRUE,  out.table.format = ".csv (WINDOWS-1252)",  dev_x = 20,  dev_y = 9999,  context = 0,  write.table.locations = FALSE,  exp.nondetc.tabs = TRUE,  write.tab.doc.file = TRUE,  write.txt.doc.file = TRUE,  delete = TRUE,  cpy_mv = "nocpymv",  verbose = TRUE)

Arguments

pdf

String. Path to the PDF file to be analyzed.

whattoextr

String. Eithertxt,tab, ortabandtxtfor PDFS2TXT (extract sentences from a PDF file) or PDFS2TABLE (table of a PDFfile to a Microsoft Excel file) extraction.tab allows the extractionof tables with and without search words whiletxt andtabandtxtrequire search words.

out

String. Directory chosen to save analysis results in. Default:".".

filter.words

List of strings. The list of filter words. If notNA or"" a hit will be counted every time a word from the listis detected in the article.Default:"".

regex.fw

Logical. If TRUE filter words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default =TRUE.

ignore.case.fw

Logical. Are the filter words case-sensitive (doescapitalization matter)? Default:FALSE.

filter.word.times

Numeric or string. Can either be expressed as absolute number or percentageof the total number of words (by adding the "filter.words for a paper to be further analyzed. Default:0.2%.

table.heading.words

List of strings. Different than standard (TABLE,TAB or table plus number) headings to be detected. Regex rules apply (seealsohttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default ="".

ignore.case.th

Logical. Are the additional table headings (seetable.heading.words) case-sensitive (does capitalization matter)?Default =FALSE.

search.words

List of strings. List of search words. To extract alltables from the PDF file leavesearch.words = "".

search.word.categories

List of strings. List of categories with thesame length as the list of search words. Accordingly, each search word can beassigned to a category, of which the word counts will be summarized in thePDE_analyzer_word_stats.csv file. If search.word.categories is adifferent length than search.words the parameter will be ignored.Default:NULL.

save.tab.by.category

Logical. Can only be used with search.word.categories.If set to TRUE, tables that carry search words will be saved in sub-folders according to the search word category of the detected search word.Default:FALSE.

regex.sw

Logical. If TRUE search words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default =TRUE.

ignore.case.sw

Logical. Are the search words case-sensitive (doescapitalization matter)? Default:FALSE.

eval.abbrevs

Logical. Should abbreviations for the search words beautomatically detected and then replaced with the search word + "$*"?Default:TRUE.

out.table.format

String. Output file format. Either comma separatedfile.csv or tab separated file.tsv. The encoding indicatedin parantheses should be selected according to the operational systemexported tables are opened in, i.e., Windows:"(WINDOWS-1252)"; Mac:(macintosh); Linux:(UTF-8). Default:".csv" andencoding depending on the operational system.

dev_x

Numeric. For a table the size of indention which would beconsidered the same column. Default:20.

dev_y

Numeric. For a table the vertical distance which would beconsidered the same row. Can be either a number or set to dynamic detection[9999], in which case the font size is used to detect which words are in thesame row.Default:9999.

context

Numeric. Number of sentences extracted before and after thesentence with the detected search word. If0 only the sentence withthe search word is extracted. Default:0.

write.table.locations

Logical. IfTRUE, a separate file with theheadings of all tables, their relative location in the generated html andtxt files, as well as information if search words were found will begenerated. Default:FALSE.

exp.nondetc.tabs

Logical. IfTRUE, if a table was detected in aPDF file but is an image or cannot be read, the page with the table with beexported as a png. Default:TRUE.

write.tab.doc.file

Logical. IfTRUE, if search words are usedfor table detection and no search words were found in the tables of a PDFfile, ano.table.w.search.words. Default:TRUE.

write.txt.doc.file

Logical. IfTRUE, if no search words werefound in the sentences of a PDF file, a file will be created with the PDFfilename followed byno.txt.w.search.words. If the PDF file isempty, a file will be created with the PDF filename followed byno.content.detected. If the filter word threshold is not met,a file will be created with the PDF filename followed byno.txt.w.filter.words. Default:TRUE.

delete

Logical. IfTRUE, the intermediatetxt,keeplayouttxt andhtml copies of the PDF file will bedeleted. Default:TRUE.

cpy_mv

String. Either "nocpymv", "cpy", or "mv". If filter words are used in theanalyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the/pdf/ subfolder of the output folder. Default:"nocpymv".

verbose

Logical. Indicates whether messages will be printed in theconsole. Default:TRUE.

Value

If tables were extracted from the PDF file the function returns a list offollowing tables/items: 1)htmltablelines, 2)txttablelines, 3)keeplayouttxttablelines, 4)id,5)out_msg.Thetablelines are tables that provide the heading and position ofthe detected tables. Theid provide the name of the PDF file. Theout_msg includes all messages printed to the console or the suppressedmessages ifverbose=FALSE.

See Also

PDE_pdfs2table,PDE_pdfs2table_searchandfilter,PDE_pdfs2txt_searchandfilter

Examples

## Running a simple analysis with filter and search words to extract sentences and tablesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- .PDE_extr_data_from_pdf(pdf = "/examples/Methotrexate/29973177_!.pdf", whattoextr = "tabandtxt", out = paste0(system.file(package = "PDE"),"/examples/MTX_output+-0_test/"), filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], ignore.case.fw = TRUE, regex.fw = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], ignore.case.sw = FALSE, regex.sw = TRUE)}## Running an advanced analysis with filter and search words to## extract sentences and tables and obtain documentation filesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- .PDE_extr_data_from_pdf(pdf = paste0(system.file(package = "PDE"),                       "/examples/Methotrexate/29973177_!.pdf"), whattoextr = "tabandtxt", out = paste0(system.file(package = "PDE"),"/examples/MTX_output+-1_test/"), context = 1, dev_x = 20, dev_y = 9999, filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], ignore.case.fw = TRUE, regex.fw = FALSE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], ignore.case.sw = FALSE, regex.sw = TRUE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", write.table.locations = TRUE, write.tab.doc.file = TRUE, write.txt.doc.file = TRUE, exp.nondetc.tabs = TRUE, cpy_mv = "nocpymv", delete = TRUE)}

Deprecated functions in package ‘PDE’

Description

These functions are provided for compatibility with older versionsof ‘PDE’ only, and will be defunct at the next release.

Details

The following functions are deprecated and will be made defunct; usethe replacement indicated below:


Extracting data from PDF (Portable Document Format) files

Description

ThePDE_analyzer allows the sentence and table extraction from multiplePDF files.

Usage

PDE_analyzer(PDE_parameters_file_path = NA, verbose = TRUE)

Arguments

PDE_parameters_file_path

String. This file includes all parameters torunPDE_extr_data_from_pdfs on multiple PDF files. IfPDE_parameters_file_path does not exist or isNA a dialog boxis opened prompting the user to select the parameter file.

verbose

Logical. Indicates whether messages will be printed in the console. Default:TRUE.

Value

If tables were extracted from the PDF file the function returns a list offollowing tables/items: 1)htmltablelines, 2)txttablelines, 3)keeplayouttxttablelines, 4)id,5)out_msg.Thetablelines are tables that provide the heading and position ofthe detected tables. Theid provide the name of the PDF file. Theout_msg includes all messages printed to the console or the suppressedmessages ifverbose=FALSE.

Details

The parameter file (also referred to as .tsv file) caneither manually or with the help of thePDE_analyzer_iinterface be filled.

Note

A detailed description of the parameters in the TSV file can befound in the markdown file (README_PDE.md) and in the description ofPDE_extr_data_from_pdfs.

See Also

PDE_extr_data_from_pdfs

Examples

 if(PDE_check_Xpdf_install() == TRUE){   PDE_analyzer(paste0(system.file(package = "PDE"),   "/examples/tsvs/PDE_parameters_v1.4_all_files+-0.tsv")) }## Not run:  ## requires user file choice: PDE_analyzer()## End(Not run)

Extracting data from PDF (Portable Document Format) files using a user interface

Description

ThePDE_analyzer_i provides a user interface forthe sentence and table extraction from multiple PDF files.

Usage

PDE_analyzer_i(verbose = TRUE)

Arguments

verbose

Logical. Indicates whether messages will be printed in the console. Default:TRUE.

Note

A detailed description of the elements in the user interfacecan be found in the markdown file (README_PDE.md).

Examples

PDE_analyzer_i()

Check if the Xpdftools are installed an in the system path

Description

PDE_check_Xpdf_install runs a version test for pdftotext, pdftohtml and pdftopng.

Usage

PDE_check_Xpdf_install(sysname = NULL, verbose = TRUE)

Arguments

sysname

String. In case the function returns "Unknown OS" the sysname can be set manually.Allowed options are "Windows", "Linux", "SunOS" for Solaris, and "Darwin" for Mac. Default:NULL.

verbose

Logical. Indicates whether messages will be printed in the console. Default:TRUE.

Value

The function returns a Boolean for the installation status and a message in casethe commands are not detected.

Examples

PDE_check_Xpdf_install()

Extracting data from PDF (Portable Document Format) files

Description

PDE_extr_data_from_pdfs extracts sentences or tables from a single PDFfile and writes output in the corresponding folder.

Usage

PDE_extr_data_from_pdfs(  pdfs,  whattoextr,  out = ".",  filter.words = "",  regex.fw = TRUE,  ignore.case.fw = FALSE,  filter.word.times = "0.2%",  table.heading.words = "",  ignore.case.th = FALSE,  search.words,  search.word.categories = NULL,  regex.sw = TRUE,  save.tab.by.category = FALSE,  ignore.case.sw = FALSE,  eval.abbrevs = TRUE,  out.table.format = ".csv (WINDOWS-1252)",  dev_x = 20,  dev_y = 9999,  context = 0,  write.table.locations = FALSE,  exp.nondetc.tabs = TRUE,  write.tab.doc.file = TRUE,  write.txt.doc.file = TRUE,  delete = TRUE,  cpy_mv = "nocpymv",  verbose = TRUE)

Arguments

pdfs

String. A list of paths to the PDF files to be analyzed.

whattoextr

String. Eithertxt,tab, ortabandtxtfor PDFS2TXT (extract sentences from a PDF file) or PDFS2TABLE (table of a PDFfile to a Microsoft Excel file) extraction.tab allows the extractionof tables with and without search words whiletxt andtabandtxtrequire search words.

out

String. Directory chosen to save analysis results in. Default:".".

filter.words

List of strings. The list of filter words. If notNA or"" a hit will be counted every time a word from the listis detected in the article.Default:"".

regex.fw

Logical. If TRUE filter words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default =TRUE.

ignore.case.fw

Logical. Are the filter words case-sensitive (doescapitalization matter)? Default:FALSE.

filter.word.times

Numeric or string. Can either be expressed as absolute number or percentageof the total number of words (by adding the "filter.words for a paper to be further analyzed. Default:0.2%.

table.heading.words

List of strings. Different than standard (TABLE,TAB or table plus number) headings to be detected. Regex rules apply (seealsohttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default ="".

ignore.case.th

Logical. Are the additional table headings (seetable.heading.words) case-sensitive (does capitalization matter)?Default =FALSE.

search.words

List of strings. List of search words. To extract alltables from the PDF files leavesearch.words = "".

search.word.categories

List of strings. List of categories with thesame length as the list of search words. Accordingly, each search word can beassigned to a category, of which the word counts will be summarized in thePDE_analyzer_word_stats.csv file. If search.word.categories is adifferent length than search.words the parameter will be ignored.Default:NULL.

regex.sw

Logical. If TRUE search words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default =TRUE.

save.tab.by.category

Logical. Can only be used with search.word.categories.If set to TRUE, tables that carry search words will be saved in sub-folders according to the search word category of the detected search word.Default:FALSE.

ignore.case.sw

Logical. Are the search words case-sensitive (doescapitalization matter)? Default:FALSE.

eval.abbrevs

Logical. Should abbreviations for the search words beautomatically detected and then replaced with the search word + "$*"?Default:TRUE.

out.table.format

String. Output file format. Either comma separatedfile.csv or tab separated file.tsv. The encoding indicatedin parantheses should be selected according to the operational systemexported tables are opened in, i.e., Windows:"(WINDOWS-1252)"; Mac:(macintosh); Linux:(UTF-8). Default:".csv" andencoding depending on the operational system.

dev_x

Numeric. For a table the size of indention which would beconsidered the same column. Default:20.

dev_y

Numeric. For a table the vertical distance which would beconsidered the same row. Can be either a number or set to dynamic detection[9999], in which case the font size is used to detect which words are in thesame row.Default:9999.

context

Numeric. Number of sentences extracted before and after thesentence with the detected search word. If0 only the sentence withthe search word is extracted. Default:0.

write.table.locations

Logical. IfTRUE, a separate file with theheadings of all tables, their relative location in the generated html andtxt files, as well as information if search words were found will begenerated. Default:FALSE.

exp.nondetc.tabs

Logical. IfTRUE, if a table was detected in aPDF file but is an image or cannot be read, the page with the table with beexported as a png. Default:TRUE.

write.tab.doc.file

Logical. IfTRUE, if search words are usedfor table detection and no search words were found in the tables of a PDFfile, ano.table.w.search.words. Default:TRUE.

write.txt.doc.file

Logical. IfTRUE, if no search words werefound in the sentences of a PDF file, a file will be created with the PDFfilename followed byno.txt.w.search.words. If the PDF file isempty, a file will be created with the PDF filename followed byno.content.detected. If the filter word threshold is not met,a file will be created with the PDF filename followed byno.txt.w.filter.words. Default:TRUE.

delete

Logical. IfTRUE, the intermediatetxt,keeplayouttxt andhtml copies of the PDF files will bedeleted. Default:TRUE.

cpy_mv

String. Either "nocpymv", "cpy", or "mv". If filter words are used in theanalyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the/pdf/ subfolder of the output folder. Default:"nocpymv".

verbose

Logical. Indicates whether messages will be printed in the console. Default:TRUE.

Value

If tables were extracted from the PDF file the function returns a list offollowing tables/items: 1)htmltablelines, 2)txttablelines, 3)keeplayouttxttablelines, 4)id,5)out_msg.Thetablelines are tables that provide the heading and position ofthe detected tables. Theid provide the name of the PDF file. Theout_msg includes all messages printed to the console or the suppressedmessages ifverbose=FALSE.

See Also

PDE_pdfs2table,PDE_pdfs2table_searchandfilter,PDE_pdfs2txt_searchandfilter

Examples

## Running a simple analysis with filter and search words to extract sentences and tablesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_extr_data_from_pdfs(pdfs = c(paste0(system.file(package = "PDE"),                                                 "/examples/Methotrexate/29973177_!.pdf"),                                                 paste0(system.file(package = "PDE"),                                                 "/examples/Methotrexate/31083238_!.pdf")), whattoextr = "tabandtxt", out = paste0(system.file(package = "PDE"),"/examples/MTX_output+-0_test/"), filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], ignore.case.fw = TRUE, regex.fw = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], ignore.case.sw = FALSE, regex.sw = TRUE)}## Running an advanced analysis with filter and search words to## extract sentences and tables and obtain documentation filesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_extr_data_from_pdfs(pdfs = c(paste0(system.file(package = "PDE"),                                                 "/examples/Methotrexate/29973177_!.pdf"),                                                  paste0(system.file(package = "PDE"),                                                 "/examples/Methotrexate/31083238_!.pdf")), whattoextr = "tabandtxt", out = paste0(system.file(package = "PDE"),"/examples/MTX_output+-1_test/"), context = 1, dev_x = 20, dev_y = 9999, filter.words = strsplit("cohort;case-control;group;study population;study participants",";")[[1]], ignore.case.fw = TRUE, regex.fw = FALSE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", write.table.locations = TRUE, write.tab.doc.file = TRUE, write.txt.doc.file = TRUE, exp.nondetc.tabs = TRUE, cpy_mv = "nocpymv", delete = TRUE)}

Install the Xpdf command line tools 4.02

Description

PDE_install_Xpdftools4.02 downloads and installs the XPDF command line tools 4.02.

Usage

PDE_install_Xpdftools4.02(  sysname = NULL,  bin = NULL,  verbose = TRUE,  permission = 0)

Arguments

sysname

String. In case the function returns "Unknown OS" the sysname can be set manually.Allowed options are "Windows", "Linux", "SunOS" for Solaris, and "Darwin" for Mac. Default:NULL.

bin

String. In case the function returns "Unknown OS" the bin of the operational systemcan be set manually. Allowed options are "64", and "32". Default:NULL.

verbose

Logical. Indicates whether messages will be printed in the console. Default:TRUE.

permission

Numerical. If set to 0 the user is ask for a permission todownload Xpdftools. If set to 1, no user input is required. Default:0.

Value

The function returns a Boolean for the installation status and a message in casethe commands are not installed.

Examples

## Not run: PDE_install_Xpdftools4.02()## End(Not run)

Export the installation path the PDE (PDF Data Extractor) package

Description

PDE_path is deprecated. Please run system.file(package = "PDE") instead.

Usage

PDE_path()

Value

The function returns a potential path for the PDE package. If the PDEtool was not correctly installed it returns "".


Extracting all tables from a PDF (Portable Document Format) file

Description

PDE_pdfs2table extracts all tables from a single PDFfile and writes output in the corresponding folder.

Usage

PDE_pdfs2table(  pdfs,  out = ".",  table.heading.words = "",  ignore.case.th = FALSE,  out.table.format = ".csv (WINDOWS-1252)",  dev_x = 20,  dev_y = 9999,  write.table.locations = FALSE,  exp.nondetc.tabs = TRUE,  delete = TRUE,  verbose = TRUE)

Arguments

pdfs

String. A list of paths to the PDF files to be analyzed.

out

String. Directory chosen to save tables in. Default:".".

table.heading.words

List of strings. Different than standard (TABLE,TAB or table plus number) headings to be detected. Regex rules apply (seealsohttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default ="".

ignore.case.th

Logical. Are the additional table headings (seetable.heading.words) case-sensitive (does capitalization matter)?Default =FALSE.

out.table.format

String. Output file format. Either comma separatedfile.csv or tab separated file.tsv. The encoding indicatedin parantheses should be selected according to the operational systemexported tables are opened in, i.e., Windows:"(WINDOWS-1252)"; Mac:(macintosh); Linux:(UTF-8). Default:".csv" andencoding depending on the operational system.

dev_x

Numeric. For a table the size of indention which would beconsidered the same column. Default:20.

dev_y

Numeric. For a table the vertical distance which would beconsidered the same row. Can be either a number or set to dynamic detection[9999], in which case the font size is used to detect which words are in thesame row.Default:9999.

write.table.locations

Logical. IfTRUE, a separate file with theheadings of all tables, their relative location in the generated html andtxt files, as well as information if search words were found will begenerated. Default:FALSE.

exp.nondetc.tabs

Logical. IfTRUE, if a table was detected in aPDF file but is an image or cannot be read, the page with the table with beexported as a png. Default:FALSE.

delete

Logical. IfTRUE, the intermediatetxt,keeplayouttxt andhtml copies of the PDF file will bedeleted. Default:TRUE.

verbose

Logical. Indicates whether messages will be printed in the console. Default:TRUE.

See Also

PDE_extr_data_from_pdfs,PDE_pdfs2table_searchandfilter

Examples

## Running a simple table extractionif(PDE_check_Xpdf_install() == TRUE){outputtables <- PDE_pdfs2table(pdf = paste0(system.file(package = "PDE"),                 "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"))}## Running a the same table extraction as above with all paramaters shownif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2table(pdf = paste0(system.file(package = "PDE"),                                 "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"), dev_x = 20, dev_y = 9999, table.heading.words = "", ignore.case.th = FALSE, out.table.format = ".csv (WINDOWS-1252)", write.table.locations = FALSE, exp.nondetc.tabs = FALSE, delete = TRUE)}

Extracting tables from a PDF (Portable Document Format) file

Description

PDE_pdfs2table_searchandfilter extracts tables from a single PDF fileaccording to filter and search words and writes output in the correspondingfolder.

Usage

PDE_pdfs2table_searchandfilter(  pdfs,  out = ".",  filter.words = "",  regex.fw = TRUE,  ignore.case.fw = FALSE,  filter.word.times = "0.2%",  table.heading.words = "",  ignore.case.th = FALSE,  search.words,  search.word.categories = NULL,  save.tab.by.category = FALSE,  regex.sw = TRUE,  ignore.case.sw = FALSE,  eval.abbrevs = TRUE,  out.table.format = ".csv (WINDOWS-1252)",  dev_x = 20,  dev_y = 9999,  write.table.locations = FALSE,  exp.nondetc.tabs = TRUE,  write.tab.doc.file = TRUE,  delete = TRUE,  cpy_mv = "nocpymv",  verbose = TRUE)

Arguments

pdfs

String. A list of paths to the PDF files to be analyzed.

out

String. Directory chosen to save analysis results in. Default:".".

filter.words

List of strings. The list of filter words. If notNA or"" a hit will be counted every time a word from the listis detected in the article.Default:"".

regex.fw

Logical. If TRUE filter words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default =TRUE.

ignore.case.fw

Logical. Are the filter words case-sensitive (doescapitalization matter)? Default:FALSE.

filter.word.times

Numeric or string. Can either be expressed as absolute number or percentageof the total number of words (by adding the "filter.words for a paper to be further analyzed. Default:0.2%.

table.heading.words

List of strings. Different than standard (TABLE,TAB or table plus number) headings to be detected. Regex rules apply (seealsohttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default ="".

ignore.case.th

Logical. Are the additional table headings (seetable.heading.words) case-sensitive (does capitalization matter)?Default =FALSE.

search.words

List of strings. List of search words. To extract alltables from the PDF file leavesearch.words = "".

search.word.categories

List of strings. List of categories with thesame length as the list of search words. Accordingly, each search word can beassigned to a category, of which the word counts will be summarized in thePDE_analyzer_word_stats.csv file. If search.word.categories is adifferent length than search.words the parameter will be ignored.Default:NULL.

save.tab.by.category

Logical. Can only be used with search.word.categories.If set to TRUE, tables that carry search words will be saved in sub-folders according to the search word category of the detected search word.Default:FALSE.

regex.sw

Logical. If TRUE search words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default =TRUE.

ignore.case.sw

Logical. Are the search words case-sensitive (doescapitalization matter)? Default:FALSE.

eval.abbrevs

Logical. Should abbreviations for the search words beautomatically detected and then replaced with the search word + "$*"?Default:TRUE.

out.table.format

String. Output file format. Either comma separatedfile.csv or tab separated file.tsv. The encoding indicatedin parantheses should be selected according to the operational systemexported tables are opened in, i.e., Windows:"(WINDOWS-1252)"; Mac:(macintosh); Linux:(UTF-8). Default:".csv" andencoding depending on the operational system.

dev_x

Numeric. For a table the size of indention which would beconsidered the same column. Default:20.

dev_y

Numeric. For a table the vertical distance which would beconsidered the same row. Can be either a number or set to dynamic detection[9999], in which case the font size is used to detect which words are in thesame row.Default:9999.

write.table.locations

Logical. IfTRUE, a separate file with theheadings of all tables, their relative location in the generated html andtxt files, as well as information if search words were found will begenerated. Default:FALSE.

exp.nondetc.tabs

Logical. IfTRUE, if a table was detected in aPDF file but is an image or cannot be read, the page with the table with beexported as a png. Default:TRUE.

write.tab.doc.file

Logical. IfTRUE, if search words are usedfor table detection and no search words were found in the tables of a PDFfile, ano.table.w.search.words. Default:TRUE.

delete

Logical. IfTRUE, the intermediatetxt,keeplayouttxt andhtml copies of the PDF file will bedeleted. Default:TRUE.

cpy_mv

String. Either "nocpymv", "cpy", or "mv". If filter words are used in theanalyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the/pdf/ subfolder of the output folder. Default:"nocpymv".

verbose

Logical. Indicates whether messages will be printed in the console. Default:TRUE.

Value

If tables were extracted from the PDF file the function returns a list offollowing tables/items: 1)htmltablelines, 2)txttablelines, 3)keeplayouttxttablelines, 4)id,5)out_msg.Thetablelines are tables that provide the heading and position ofthe detected tables. Theid provide the name of the PDF file. Theout_msg includes all messages printed to the console or the suppressedmessages ifverbose=FALSE.

See Also

PDE_extr_data_from_pdfs,PDE_pdfs2table

Examples

## Running a simple analysis with filter and search words to extract tablesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2table_searchandfilter(pdf = paste0(system.file(package = "PDE"),                                   "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"), filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE)}## Running an advanced analysis with filter and search words to## extract tables and obtain documentation filesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2table_searchandfilter(pdf = paste0(system.file(package = "PDE"),                                   "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"), dev_x = 20, dev_y = 9999, filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", write.table.locations = TRUE, write.tab.doc.file = TRUE, exp.nondetc.tabs = TRUE, cpy_mv = "nocpymv", delete = TRUE)}

Extracting sentences from a PDF (Portable Document Format) file

Description

PDE_pdfs2txt_searchandfilter extracts sentences from a single PDF fileaccording to search and filter words and writes output in the correspondingfolder.

Usage

PDE_pdfs2txt_searchandfilter(  pdfs,  out = ".",  filter.words = "",  regex.fw = TRUE,  ignore.case.fw = FALSE,  filter.word.times = "0.2%",  search.words,  search.word.categories = NULL,  regex.sw = TRUE,  ignore.case.sw = FALSE,  eval.abbrevs = TRUE,  out.table.format = ".csv (WINDOWS-1252)",  context = 0,  write.txt.doc.file = TRUE,  delete = TRUE,  cpy_mv = "nocpymv",  verbose = TRUE)

Arguments

pdfs

String. A list of paths to the PDF files to be analyzed.

out

String. Directory chosen to save analysis results in. Default:".".

filter.words

List of strings. The list of filter words. If notNA or"" a hit will be counted every time a word from the listis detected in the article.Default:"".

regex.fw

Logical. If TRUE filter words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default =TRUE.

ignore.case.fw

Logical. Are the filter words case-sensitive (doescapitalization matter)? Default:FALSE.

filter.word.times

Numeric or string. Can either be expressed as absolute number or percentageof the total number of words (by adding the "filter.words for a paper to be further analyzed. Default:0.2%.

search.words

List of strings. List of search words.

search.word.categories

List of strings. List of categories with thesame length as the list of search words. Accordingly, each search word can beassigned to a category, of which the word counts will be summarized in thePDE_analyzer_word_stats.csv file. If search.word.categories is adifferent length than search.words the parameter will be ignored.Default:NULL.

regex.sw

Logical. If TRUE search words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default =TRUE.

ignore.case.sw

Logical. Are the search words case-sensitive (doescapitalization matter)? Default:FALSE.

eval.abbrevs

Logical. Should abbreviations for the search words beautomatically detected and then replaced with the search word + "$*"?Default:TRUE.

out.table.format

String. Output file format. Either comma separatedfile.csv or tab separated file.tsv. The encoding indicatedin parantheses should be selected according to the operational systemexported tables are opened in, i.e., Windows:"(WINDOWS-1252)"; Mac:(macintosh); Linux:(UTF-8). Default:".csv" andencoding depending on the operational system.

context

Numeric. Number of sentences extracted before and after thesentence with the detected search word. If0 only the sentence withthe search word is extracted. Default:0.

write.txt.doc.file

Logical. IfTRUE, if no search words werefound in the sentences of a PDF file, a file will be created with the PDFfilename followed byno.txt.w.search.words. If the PDF file isempty, a file will be created with the PDF filename followed byno.content.detected. If the filter word threshold is not met,a file will be created with the PDF filename followed byno.txt.w.filter.words. Default:TRUE.

delete

Logical. IfTRUE, the intermediatetxt,keeplayouttxt andhtml copies of the PDF file will bedeleted. Default:TRUE.

cpy_mv

String. Either "nocpymv", "cpy", or "mv". If filter words are used in theanalyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the/pdf/ subfolder of the output folder. Default:"nocpymv".

verbose

Logical. Indicates whether messages will be printed in the console. Default:TRUE.

See Also

PDE_extr_data_from_pdfs

Examples

## Running a simple analysis with filter and search words to extract sentencesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),                                      "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-0/"), filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE)}## Running an advanced analysis with filter and search words to## extract sentences and obtain documentation filesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),                                       "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-1/"), context = 1, filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, filter.word.times = "0.2%", search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", write.txt.doc.file = TRUE, cpy_mv = "nocpymv", delete = TRUE)}

Browsing the PDE (PDF Data Extractor) analyzer results.

Description

ThePDE_reader_i allows the user-friendly visualization and quick-processing of the obtained results.

Usage

PDE_reader_i(verbose = TRUE)

Arguments

verbose

Logical. Indicates whether messages will be printed in the console. Default:TRUE.

Note

A detailed description of the elements in the user interface can be found in the markdown file (README_PDE.md)

Examples

 PDE_reader_i()

[8]ページ先頭

©2009-2025 Movatter.jp