lebebr01/pdfsearchPublic

NotificationsYou must be signed in to change notification settings
Fork4
Star41

Search pdf files for keywords.

License

View license

41 stars 4 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
.github/workflows		.github/workflows
R		R
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.zenodo.json		.zenodo.json
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
appveyor.yml		appveyor.yml
conduct.md		conduct.md
contributing.md		contributing.md
cran-comments.md		cran-comments.md
pdfsearch.Rproj		pdfsearch.Rproj

Repository files navigation

pdfsearch

This package defines a few useful functions for keyword searching using thepdftools package developed byrOpenSci.

The package can be installed from CRAN directly:

install.packages("pdfsearch")

To install the development version you use devtools:

install.packages("devtools")devtools::install_github('lebebr01/pdfsearch')

Basic Usage

There are currently two functions in this package of use to users. The firstkeyword_search takes a single pdf and searches for keywords from the pdf. The secondkeyword_directory does the same search over a directory of pdfs.

Example with`keyword_search`

The package comes with two pdf files fromarXiv to use as test cases. Below is an example of using thekeyword_search function.

library(pdfsearch)file<- system.file('pdf','1610.00147.pdf',package='pdfsearch')result<- keyword_search(file,keyword= c('measurement','error'),path=TRUE)head(result$line_text,n=2)

## [[1]]## [1] "Reiter, Maria DeYoreo∗ arXiv:1610.00147v1 [stat.ME] 1 Oct 2016 Abstract Often in surveys, key items are subject to measurement errors. "## ## [[2]]## [1] "In some settings, however, analysts have access to a data source on different individuals with high quality measurements of the error-prone survey items. "

The location of the keyword match, including page number and line number, the actual line of text, and a tokenized version of the text (raw text split by individual words) are returned by default.

In addition, by default the hyphenated words at the end of the text are combined with the continued word at the start of the next line. If this behavior is not of interest, set theremove_hyphen argument toFALSE.

Surrounding lines of text

It may be useful to extract not just the line of text that the keyword is found in, but also surrounding text to have additional context when looking at the keyword results. This can be added by using the argumentsurround_lines as follows:

file<- system.file('pdf','1610.00147.pdf',package='pdfsearch')result<- keyword_search(file,keyword= c('measurement','error'),path=TRUE,surround_lines=1)head(result)head(result$line_text,n=2)

Example with`keyword_directory`

Thekeyword_directory function allows users to search for keywords in multiple PDF files in one function call. The same functionality from thekeyword_search function can be invoked, specificallyremove_hyphen andsurround_lines. Below is an example of searching a single directory.

directory<- system.file('pdf',package='pdfsearch')# do search over two filesdirectory_result<- keyword_directory(directory,keyword= c('repeated measures','measurement error'),surround_lines=1)head(directory_result,n=2)

A few other useful arguments are possible when searching for keywords within multiple PDF files in a directory. One is therecursive (default isFALSE), where if set toTRUE will search within subdirectories as well, the default function behavior will not venture into subdirectories. Finally, if the directory has many PDF files, testing the function first on a handful of PDF files may be desired. The number of PDF files can be limited with the argumentmax_search where a positive integer can be specified indicating the number of PDF files to search. For example, ismax_search = 2, only the first two PDF files will be searched within the directory.

Shiny App

The package also has a simple Shiny app that can be called using the following command

run_shiny()

Usage in Research

The pdfsearch package may be most useful to those conducting research syntheses or meta-analyses. The package can allow users to search for keywords related to a research question; therefore, instead of searching the entire text of a document, specific portions of the text can be identified to be searched. This could increase the reproducibility and reduce the time needed to collect the data for the research synthesis or meta-analysis.

As an example, the package is currently being used to explore the evolution of statistical software and quantitative methods used in published social science research (https://ww2.amstat.org/meetings/jsm/2018/onlineprogram/AbstractDetails.cfm?abstractid=330777). This process involves getting PDF files from published research articles and using pdfsearch to search for specific software and quantitative methods keywords within the research articles. The results of the keyword matches will be explored using research synthesis methods. A pre-print of the paper and slides from the presentation will be posted to the GitHub repo as part of the package later this summer.

About

Search pdf files for keywords.

Releases4

JOSS Acceptance Latest

Jun 20, 2018

+ 3 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

pdfsearch

Basic Usage

Example with`keyword_search`

Surrounding lines of text

Example with`keyword_directory`

Shiny App

Usage in Research

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases4

Packages

Uh oh!

Languages

Movatterモバイル変換

License

lebebr01/pdfsearch

Folders and files

Latest commit

History

Repository files navigation

pdfsearch

Basic Usage

Example withkeyword_search

Surrounding lines of text

Example withkeyword_directory

Shiny App

Usage in Research

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases4

Packages0

Uh oh!

Languages

Example with`keyword_search`

Example with`keyword_directory`

Packages