| Type: | Package |
| Title: | Extract Tables and Sentences from PDFs with User Interface |
| Version: | 1.4.10 |
| Author: | Erik Stricker [aut, cre] |
| Maintainer: | Erik Stricker <erik.stricker@gmx.com> |
| Description: | The PDE (Pdf Data Extractor) allows the extraction of information and tables optionally based on search words from PDF (Portable Document Format) files and enables the visualization of the results, both by providing a convenient user-interface. |
| License: | GPL-3 | file LICENSE |
| Encoding: | UTF-8 |
| Imports: | tcltk |
| Depends: | tcltk2 (≥ 1.2.11), R (≥ 3.5) |
| SystemRequirements: | XPDF(4.02)(https://github.com/erikstricker/PDE/tree/master/inst/examples/bin) |
| RoxygenNote: | 7.3.1 |
| Suggests: | knitr, rmarkdown |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2024-06-11 17:25:29 UTC; Erik |
| Repository: | CRAN |
| Date/Publication: | 2024-06-11 18:10:06 UTC |
PDE: Extract Tables and Sentences from PDF Files.
Description
The package includes two main components: 1) The PDE analyzer performs thesentence and table extraction while 2) the PDE reader allows theuser-friendly visualization and quick-processing of the obtained results.
PDE functions
PDE_analyzer,PDE_analyzer_i,PDE_extr_data_from_pdfs,PDE_pdfs2table,PDE_pdfs2table_searchandfilter,PDE_pdfs2txt_searchandfilter,PDE_reader_i,PDE_install_Xpdftools4.02,PDE_check_Xpdf_install
_PACKAGE
Extracting data from a PDF (Protable Document Format) file
Description
PDE_extr_data_from_pdf extracts sentences or tables from a single PDFfile and writes output in the corresponding folder.
Usage
.PDE_extr_data_from_pdf( pdf, whattoextr, out = ".", filter.words = "", regex.fw = TRUE, ignore.case.fw = FALSE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words, search.word.categories = NULL, save.tab.by.category = FALSE, regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", dev_x = 20, dev_y = 9999, context = 0, write.table.locations = FALSE, exp.nondetc.tabs = TRUE, write.tab.doc.file = TRUE, write.txt.doc.file = TRUE, delete = TRUE, cpy_mv = "nocpymv", verbose = TRUE)Arguments
pdf | String. Path to the PDF file to be analyzed. |
whattoextr | String. Eithertxt,tab, ortabandtxtfor PDFS2TXT (extract sentences from a PDF file) or PDFS2TABLE (table of a PDFfile to a Microsoft Excel file) extraction.tab allows the extractionof tables with and without search words whiletxt andtabandtxtrequire search words. |
out | String. Directory chosen to save analysis results in. Default: |
filter.words | List of strings. The list of filter words. If not |
regex.fw | Logical. If TRUE filter words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.fw | Logical. Are the filter words case-sensitive (doescapitalization matter)? Default: |
filter.word.times | Numeric or string. Can either be expressed as absolute number or percentageof the total number of words (by adding the " |
table.heading.words | List of strings. Different than standard (TABLE,TAB or table plus number) headings to be detected. Regex rules apply (seealsohttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.th | Logical. Are the additional table headings (see |
search.words | List of strings. List of search words. To extract alltables from the PDF file leave |
search.word.categories | List of strings. List of categories with thesame length as the list of search words. Accordingly, each search word can beassigned to a category, of which the word counts will be summarized in the |
save.tab.by.category | Logical. Can only be used with search.word.categories.If set to TRUE, tables that carry search words will be saved in sub-folders according to the search word category of the detected search word.Default: |
regex.sw | Logical. If TRUE search words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.sw | Logical. Are the search words case-sensitive (doescapitalization matter)? Default: |
eval.abbrevs | Logical. Should abbreviations for the search words beautomatically detected and then replaced with the search word + "$*"?Default: |
out.table.format | String. Output file format. Either comma separatedfile |
dev_x | Numeric. For a table the size of indention which would beconsidered the same column. Default: |
dev_y | Numeric. For a table the vertical distance which would beconsidered the same row. Can be either a number or set to dynamic detection[9999], in which case the font size is used to detect which words are in thesame row.Default: |
context | Numeric. Number of sentences extracted before and after thesentence with the detected search word. If |
write.table.locations | Logical. If |
exp.nondetc.tabs | Logical. If |
write.tab.doc.file | Logical. If |
write.txt.doc.file | Logical. If |
delete | Logical. If |
cpy_mv | String. Either "nocpymv", "cpy", or "mv". If filter words are used in theanalyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the/pdf/ subfolder of the output folder. Default: |
verbose | Logical. Indicates whether messages will be printed in theconsole. Default: |
Value
If tables were extracted from the PDF file the function returns a list offollowing tables/items: 1)htmltablelines, 2)txttablelines, 3)keeplayouttxttablelines, 4)id,5)out_msg.Thetablelines are tables that provide the heading and position ofthe detected tables. Theid provide the name of the PDF file. Theout_msg includes all messages printed to the console or the suppressedmessages ifverbose=FALSE.
See Also
PDE_pdfs2table,PDE_pdfs2table_searchandfilter,PDE_pdfs2txt_searchandfilter
Examples
## Running a simple analysis with filter and search words to extract sentences and tablesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- .PDE_extr_data_from_pdf(pdf = "/examples/Methotrexate/29973177_!.pdf", whattoextr = "tabandtxt", out = paste0(system.file(package = "PDE"),"/examples/MTX_output+-0_test/"), filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], ignore.case.fw = TRUE, regex.fw = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], ignore.case.sw = FALSE, regex.sw = TRUE)}## Running an advanced analysis with filter and search words to## extract sentences and tables and obtain documentation filesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- .PDE_extr_data_from_pdf(pdf = paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), whattoextr = "tabandtxt", out = paste0(system.file(package = "PDE"),"/examples/MTX_output+-1_test/"), context = 1, dev_x = 20, dev_y = 9999, filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], ignore.case.fw = TRUE, regex.fw = FALSE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], ignore.case.sw = FALSE, regex.sw = TRUE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", write.table.locations = TRUE, write.tab.doc.file = TRUE, write.txt.doc.file = TRUE, exp.nondetc.tabs = TRUE, cpy_mv = "nocpymv", delete = TRUE)}Deprecated functions in package ‘PDE’
Description
These functions are provided for compatibility with older versionsof ‘PDE’ only, and will be defunct at the next release.
Details
The following functions are deprecated and will be made defunct; usethe replacement indicated below:
PDE_path:
system.file(package = "PDE")
Extracting data from PDF (Portable Document Format) files
Description
ThePDE_analyzer allows the sentence and table extraction from multiplePDF files.
Usage
PDE_analyzer(PDE_parameters_file_path = NA, verbose = TRUE)Arguments
PDE_parameters_file_path | String. This file includes all parameters torun |
verbose | Logical. Indicates whether messages will be printed in the console. Default: |
Value
If tables were extracted from the PDF file the function returns a list offollowing tables/items: 1)htmltablelines, 2)txttablelines, 3)keeplayouttxttablelines, 4)id,5)out_msg.Thetablelines are tables that provide the heading and position ofthe detected tables. Theid provide the name of the PDF file. Theout_msg includes all messages printed to the console or the suppressedmessages ifverbose=FALSE.
Details
The parameter file (also referred to as .tsv file) caneither manually or with the help of thePDE_analyzer_iinterface be filled.
Note
A detailed description of the parameters in the TSV file can befound in the markdown file (README_PDE.md) and in the description ofPDE_extr_data_from_pdfs.
See Also
Examples
if(PDE_check_Xpdf_install() == TRUE){ PDE_analyzer(paste0(system.file(package = "PDE"), "/examples/tsvs/PDE_parameters_v1.4_all_files+-0.tsv")) }## Not run: ## requires user file choice: PDE_analyzer()## End(Not run)Extracting data from PDF (Portable Document Format) files using a user interface
Description
ThePDE_analyzer_i provides a user interface forthe sentence and table extraction from multiple PDF files.
Usage
PDE_analyzer_i(verbose = TRUE)Arguments
verbose | Logical. Indicates whether messages will be printed in the console. Default: |
Note
A detailed description of the elements in the user interfacecan be found in the markdown file (README_PDE.md).
Examples
PDE_analyzer_i()Check if the Xpdftools are installed an in the system path
Description
PDE_check_Xpdf_install runs a version test for pdftotext, pdftohtml and pdftopng.
Usage
PDE_check_Xpdf_install(sysname = NULL, verbose = TRUE)Arguments
sysname | String. In case the function returns "Unknown OS" the sysname can be set manually.Allowed options are "Windows", "Linux", "SunOS" for Solaris, and "Darwin" for Mac. Default: |
verbose | Logical. Indicates whether messages will be printed in the console. Default: |
Value
The function returns a Boolean for the installation status and a message in casethe commands are not detected.
Examples
PDE_check_Xpdf_install()Extracting data from PDF (Portable Document Format) files
Description
PDE_extr_data_from_pdfs extracts sentences or tables from a single PDFfile and writes output in the corresponding folder.
Usage
PDE_extr_data_from_pdfs( pdfs, whattoextr, out = ".", filter.words = "", regex.fw = TRUE, ignore.case.fw = FALSE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words, search.word.categories = NULL, regex.sw = TRUE, save.tab.by.category = FALSE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", dev_x = 20, dev_y = 9999, context = 0, write.table.locations = FALSE, exp.nondetc.tabs = TRUE, write.tab.doc.file = TRUE, write.txt.doc.file = TRUE, delete = TRUE, cpy_mv = "nocpymv", verbose = TRUE)Arguments
pdfs | String. A list of paths to the PDF files to be analyzed. |
whattoextr | String. Eithertxt,tab, ortabandtxtfor PDFS2TXT (extract sentences from a PDF file) or PDFS2TABLE (table of a PDFfile to a Microsoft Excel file) extraction.tab allows the extractionof tables with and without search words whiletxt andtabandtxtrequire search words. |
out | String. Directory chosen to save analysis results in. Default: |
filter.words | List of strings. The list of filter words. If not |
regex.fw | Logical. If TRUE filter words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.fw | Logical. Are the filter words case-sensitive (doescapitalization matter)? Default: |
filter.word.times | Numeric or string. Can either be expressed as absolute number or percentageof the total number of words (by adding the " |
table.heading.words | List of strings. Different than standard (TABLE,TAB or table plus number) headings to be detected. Regex rules apply (seealsohttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.th | Logical. Are the additional table headings (see |
search.words | List of strings. List of search words. To extract alltables from the PDF files leave |
search.word.categories | List of strings. List of categories with thesame length as the list of search words. Accordingly, each search word can beassigned to a category, of which the word counts will be summarized in the |
regex.sw | Logical. If TRUE search words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
save.tab.by.category | Logical. Can only be used with search.word.categories.If set to TRUE, tables that carry search words will be saved in sub-folders according to the search word category of the detected search word.Default: |
ignore.case.sw | Logical. Are the search words case-sensitive (doescapitalization matter)? Default: |
eval.abbrevs | Logical. Should abbreviations for the search words beautomatically detected and then replaced with the search word + "$*"?Default: |
out.table.format | String. Output file format. Either comma separatedfile |
dev_x | Numeric. For a table the size of indention which would beconsidered the same column. Default: |
dev_y | Numeric. For a table the vertical distance which would beconsidered the same row. Can be either a number or set to dynamic detection[9999], in which case the font size is used to detect which words are in thesame row.Default: |
context | Numeric. Number of sentences extracted before and after thesentence with the detected search word. If |
write.table.locations | Logical. If |
exp.nondetc.tabs | Logical. If |
write.tab.doc.file | Logical. If |
write.txt.doc.file | Logical. If |
delete | Logical. If |
cpy_mv | String. Either "nocpymv", "cpy", or "mv". If filter words are used in theanalyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the/pdf/ subfolder of the output folder. Default: |
verbose | Logical. Indicates whether messages will be printed in the console. Default: |
Value
If tables were extracted from the PDF file the function returns a list offollowing tables/items: 1)htmltablelines, 2)txttablelines, 3)keeplayouttxttablelines, 4)id,5)out_msg.Thetablelines are tables that provide the heading and position ofthe detected tables. Theid provide the name of the PDF file. Theout_msg includes all messages printed to the console or the suppressedmessages ifverbose=FALSE.
See Also
PDE_pdfs2table,PDE_pdfs2table_searchandfilter,PDE_pdfs2txt_searchandfilter
Examples
## Running a simple analysis with filter and search words to extract sentences and tablesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_extr_data_from_pdfs(pdfs = c(paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), paste0(system.file(package = "PDE"), "/examples/Methotrexate/31083238_!.pdf")), whattoextr = "tabandtxt", out = paste0(system.file(package = "PDE"),"/examples/MTX_output+-0_test/"), filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], ignore.case.fw = TRUE, regex.fw = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], ignore.case.sw = FALSE, regex.sw = TRUE)}## Running an advanced analysis with filter and search words to## extract sentences and tables and obtain documentation filesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_extr_data_from_pdfs(pdfs = c(paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), paste0(system.file(package = "PDE"), "/examples/Methotrexate/31083238_!.pdf")), whattoextr = "tabandtxt", out = paste0(system.file(package = "PDE"),"/examples/MTX_output+-1_test/"), context = 1, dev_x = 20, dev_y = 9999, filter.words = strsplit("cohort;case-control;group;study population;study participants",";")[[1]], ignore.case.fw = TRUE, regex.fw = FALSE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", write.table.locations = TRUE, write.tab.doc.file = TRUE, write.txt.doc.file = TRUE, exp.nondetc.tabs = TRUE, cpy_mv = "nocpymv", delete = TRUE)}Install the Xpdf command line tools 4.02
Description
PDE_install_Xpdftools4.02 downloads and installs the XPDF command line tools 4.02.
Usage
PDE_install_Xpdftools4.02( sysname = NULL, bin = NULL, verbose = TRUE, permission = 0)Arguments
sysname | String. In case the function returns "Unknown OS" the sysname can be set manually.Allowed options are "Windows", "Linux", "SunOS" for Solaris, and "Darwin" for Mac. Default: |
bin | String. In case the function returns "Unknown OS" the bin of the operational systemcan be set manually. Allowed options are "64", and "32". Default: |
verbose | Logical. Indicates whether messages will be printed in the console. Default: |
permission | Numerical. If set to 0 the user is ask for a permission todownload Xpdftools. If set to 1, no user input is required. Default: |
Value
The function returns a Boolean for the installation status and a message in casethe commands are not installed.
Examples
## Not run: PDE_install_Xpdftools4.02()## End(Not run)Export the installation path the PDE (PDF Data Extractor) package
Description
PDE_path is deprecated. Please run system.file(package = "PDE") instead.
Usage
PDE_path()Value
The function returns a potential path for the PDE package. If the PDEtool was not correctly installed it returns "".
Extracting all tables from a PDF (Portable Document Format) file
Description
PDE_pdfs2table extracts all tables from a single PDFfile and writes output in the corresponding folder.
Usage
PDE_pdfs2table( pdfs, out = ".", table.heading.words = "", ignore.case.th = FALSE, out.table.format = ".csv (WINDOWS-1252)", dev_x = 20, dev_y = 9999, write.table.locations = FALSE, exp.nondetc.tabs = TRUE, delete = TRUE, verbose = TRUE)Arguments
pdfs | String. A list of paths to the PDF files to be analyzed. |
out | String. Directory chosen to save tables in. Default: |
table.heading.words | List of strings. Different than standard (TABLE,TAB or table plus number) headings to be detected. Regex rules apply (seealsohttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.th | Logical. Are the additional table headings (see |
out.table.format | String. Output file format. Either comma separatedfile |
dev_x | Numeric. For a table the size of indention which would beconsidered the same column. Default: |
dev_y | Numeric. For a table the vertical distance which would beconsidered the same row. Can be either a number or set to dynamic detection[9999], in which case the font size is used to detect which words are in thesame row.Default: |
write.table.locations | Logical. If |
exp.nondetc.tabs | Logical. If |
delete | Logical. If |
verbose | Logical. Indicates whether messages will be printed in the console. Default: |
See Also
PDE_extr_data_from_pdfs,PDE_pdfs2table_searchandfilter
Examples
## Running a simple table extractionif(PDE_check_Xpdf_install() == TRUE){outputtables <- PDE_pdfs2table(pdf = paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"))}## Running a the same table extraction as above with all paramaters shownif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2table(pdf = paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"), dev_x = 20, dev_y = 9999, table.heading.words = "", ignore.case.th = FALSE, out.table.format = ".csv (WINDOWS-1252)", write.table.locations = FALSE, exp.nondetc.tabs = FALSE, delete = TRUE)}Extracting tables from a PDF (Portable Document Format) file
Description
PDE_pdfs2table_searchandfilter extracts tables from a single PDF fileaccording to filter and search words and writes output in the correspondingfolder.
Usage
PDE_pdfs2table_searchandfilter( pdfs, out = ".", filter.words = "", regex.fw = TRUE, ignore.case.fw = FALSE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words, search.word.categories = NULL, save.tab.by.category = FALSE, regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", dev_x = 20, dev_y = 9999, write.table.locations = FALSE, exp.nondetc.tabs = TRUE, write.tab.doc.file = TRUE, delete = TRUE, cpy_mv = "nocpymv", verbose = TRUE)Arguments
pdfs | String. A list of paths to the PDF files to be analyzed. |
out | String. Directory chosen to save analysis results in. Default: |
filter.words | List of strings. The list of filter words. If not |
regex.fw | Logical. If TRUE filter words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.fw | Logical. Are the filter words case-sensitive (doescapitalization matter)? Default: |
filter.word.times | Numeric or string. Can either be expressed as absolute number or percentageof the total number of words (by adding the " |
table.heading.words | List of strings. Different than standard (TABLE,TAB or table plus number) headings to be detected. Regex rules apply (seealsohttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.th | Logical. Are the additional table headings (see |
search.words | List of strings. List of search words. To extract alltables from the PDF file leave |
search.word.categories | List of strings. List of categories with thesame length as the list of search words. Accordingly, each search word can beassigned to a category, of which the word counts will be summarized in the |
save.tab.by.category | Logical. Can only be used with search.word.categories.If set to TRUE, tables that carry search words will be saved in sub-folders according to the search word category of the detected search word.Default: |
regex.sw | Logical. If TRUE search words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.sw | Logical. Are the search words case-sensitive (doescapitalization matter)? Default: |
eval.abbrevs | Logical. Should abbreviations for the search words beautomatically detected and then replaced with the search word + "$*"?Default: |
out.table.format | String. Output file format. Either comma separatedfile |
dev_x | Numeric. For a table the size of indention which would beconsidered the same column. Default: |
dev_y | Numeric. For a table the vertical distance which would beconsidered the same row. Can be either a number or set to dynamic detection[9999], in which case the font size is used to detect which words are in thesame row.Default: |
write.table.locations | Logical. If |
exp.nondetc.tabs | Logical. If |
write.tab.doc.file | Logical. If |
delete | Logical. If |
cpy_mv | String. Either "nocpymv", "cpy", or "mv". If filter words are used in theanalyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the/pdf/ subfolder of the output folder. Default: |
verbose | Logical. Indicates whether messages will be printed in the console. Default: |
Value
If tables were extracted from the PDF file the function returns a list offollowing tables/items: 1)htmltablelines, 2)txttablelines, 3)keeplayouttxttablelines, 4)id,5)out_msg.Thetablelines are tables that provide the heading and position ofthe detected tables. Theid provide the name of the PDF file. Theout_msg includes all messages printed to the console or the suppressedmessages ifverbose=FALSE.
See Also
PDE_extr_data_from_pdfs,PDE_pdfs2table
Examples
## Running a simple analysis with filter and search words to extract tablesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2table_searchandfilter(pdf = paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"), filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE)}## Running an advanced analysis with filter and search words to## extract tables and obtain documentation filesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2table_searchandfilter(pdf = paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"), dev_x = 20, dev_y = 9999, filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", write.table.locations = TRUE, write.tab.doc.file = TRUE, exp.nondetc.tabs = TRUE, cpy_mv = "nocpymv", delete = TRUE)}Extracting sentences from a PDF (Portable Document Format) file
Description
PDE_pdfs2txt_searchandfilter extracts sentences from a single PDF fileaccording to search and filter words and writes output in the correspondingfolder.
Usage
PDE_pdfs2txt_searchandfilter( pdfs, out = ".", filter.words = "", regex.fw = TRUE, ignore.case.fw = FALSE, filter.word.times = "0.2%", search.words, search.word.categories = NULL, regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", context = 0, write.txt.doc.file = TRUE, delete = TRUE, cpy_mv = "nocpymv", verbose = TRUE)Arguments
pdfs | String. A list of paths to the PDF files to be analyzed. |
out | String. Directory chosen to save analysis results in. Default: |
filter.words | List of strings. The list of filter words. If not |
regex.fw | Logical. If TRUE filter words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.fw | Logical. Are the filter words case-sensitive (doescapitalization matter)? Default: |
filter.word.times | Numeric or string. Can either be expressed as absolute number or percentageof the total number of words (by adding the " |
search.words | List of strings. List of search words. |
search.word.categories | List of strings. List of categories with thesame length as the list of search words. Accordingly, each search word can beassigned to a category, of which the word counts will be summarized in the |
regex.sw | Logical. If TRUE search words will follow the regex rules(seehttps://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).Default = |
ignore.case.sw | Logical. Are the search words case-sensitive (doescapitalization matter)? Default: |
eval.abbrevs | Logical. Should abbreviations for the search words beautomatically detected and then replaced with the search word + "$*"?Default: |
out.table.format | String. Output file format. Either comma separatedfile |
context | Numeric. Number of sentences extracted before and after thesentence with the detected search word. If |
write.txt.doc.file | Logical. If |
delete | Logical. If |
cpy_mv | String. Either "nocpymv", "cpy", or "mv". If filter words are used in theanalyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the/pdf/ subfolder of the output folder. Default: |
verbose | Logical. Indicates whether messages will be printed in the console. Default: |
See Also
Examples
## Running a simple analysis with filter and search words to extract sentencesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-0/"), filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE)}## Running an advanced analysis with filter and search words to## extract sentences and obtain documentation filesif(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-1/"), context = 1, filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, filter.word.times = "0.2%", search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", write.txt.doc.file = TRUE, cpy_mv = "nocpymv", delete = TRUE)}Browsing the PDE (PDF Data Extractor) analyzer results.
Description
ThePDE_reader_i allows the user-friendly visualization and quick-processing of the obtained results.
Usage
PDE_reader_i(verbose = TRUE)Arguments
verbose | Logical. Indicates whether messages will be printed in the console. Default: |
Note
A detailed description of the elements in the user interface can be found in the markdown file (README_PDE.md)
Examples
PDE_reader_i()