TheR packagePGRdup was developed as a tool to aid genebankmanagers in the identification of probable duplicate accessions fromplant genetic resources (PGR) passport databases.
This package primarily implements a workflow designed to fetch groups orsets of germplasm accessions with similar passport data particularly infields associated with accession names within or across PGR passportdatabases.
The functions in this package are primarily built using the following Rpackages:
The package can be installed from CRAN as follows:
# Install from CRANinstall.packages('PGRdup', dependencies=TRUE)The development version can be installed from github as follows:# Install development version from Githubdevtools::install_github("aravind-j/PGRdup")The series of steps involve in the workflow along with the associatedfunctions are are illustrated below:
Function(s) :
DataCleanMergeKWMergePrefixMergeSuffixUse these functions for the appropriate data standardisation of therelevant fields in the passport databases to harmonize punctuation,leading zeros, prefixes, suffixes etc. associated with accession names.
Function(s) :
KWICUse this function to extract the information in the relevant fields askeywords or text strings in the form of a searchable Keyword in Context(KWIC) index.
Function(s) :
ProbDupExecute fuzzy, phonetic and semantic matching of keywords to identifyprobable duplicate sets either within a single KWIC index or between twoindexes using this function. For fuzzy matching the levenshtein editdistance is used, while for phonetic matching, the double metaphonealgorithm is used. For semantic matching, synonym sets or ‘synsets’ ofaccession names can be supplied as an input and the text strings in suchsets will be treated as being identical for matching. Various options totweak the matching strategies used are also available in this function.
Function(s) :
DisProbDupReviewProbDupReconstructProbDupInspect, revise and improve the retrieved sets using these functions. Ifconsiderable intersections exist between the initially identified sets,thenDisProbDup may be used to get the disjoint sets. The identifiedsets may be subjected to clerical review after transforming them into anappropriate spreadsheet format which contains the raw data from theoriginal database(s) usingReviewProbDup and subsequently convertedback usingReconstructProbDup.
Function(s) :
ValidatePrimKeyDoubleMetaphoneParseProbDupAddProbDupSplitProbDupMergeProbDupViewProbDupKWCountsread.genesysUse these helper functions if needed.ValidatePrimKey can be used tocheck whether a column chosen in a data.frame as the primary primarykey/ID confirms to the constraints of absence of duplicates and NULLvalues.
DoubleMetaphone is an implementation of the Double Metaphone phoneticalgorithm inR and is used for phonetic matching.
ParseProbDup andAddProbDup work with objects of classProbDup.The former can be used to parse the probable duplicate sets in aProbDup object to adata.frame while the latter can be used to addthese sets data fields to the passport databases.SplitProbDup can beused to split an object of classProbDup according to set counts.MergeProbDup can be used to merge together two objects of classProbDup.ViewProbDup can be used to plot the summary visualizationsof probable duplicate sets retrieved in an object of classProbDup.
KWCounts can be used to compute keyword counts from PGR passportdatabase fields(columns), which can give a rough indication of thecompleteness of the data.
read.genesys can be used to import PGR data in a Darwin Core -germplasm zip archive downloaded from genesys database into the Renvironment.
fread torapidly read large files instead ofread.csv orread.table inbase.R-database interfacepackages to getthe required table as adata.frame inR.ProbDup function can be memory hungry with large passportdatabases. In such cases, ensure that the system has sufficient memoryfor smooth functioning (See?ProbDup).For a detailed tutorial (vignette) on how to used this package type:
browseVignettes(package = 'PGRdup')The vignette for the latest version is also availableonline.
To know whats new in this version type:
news(package='PGRdup')| r-devel-linux-x86_64-debian-clang | |
| r-devel-linux-x86_64-debian-gcc | |
| r-devel-linux-x86_64-fedora-clang | |
| r-devel-linux-x86_64-fedora-gcc | |
| r-patched-linux-x86_64 | |
| r-release-linux-x86_64 |
| r-devel-windows-x86_64 | |
| r-release-windows-x86_64 | |
| r-oldrel-windows-x86_64 |
| r-release-macos-x86_64 | |
| r-oldrel-macos-x86_64 |
PGRdupTo cite the methods in the package use:
citation("PGRdup")To cite the R package 'PGRdup' in publications use: Aravind, J., Radhamani, J., Kalyani Srinivasan, Ananda Subhash, B., and Tyagi, R. K. (2025). PGRdup: Discover Probable Duplicates in Plant Genetic Resources Collections. R package version 0.2.4.0, https://github.com/aravind-j/PGRdup,https://cran.r-project.org/package=PGRdup.A BibTeX entry for LaTeX users is @Manual{, title = {PGRdup: Discover Probable Duplicates in Plant Genetic Resources Collections}, author = {J. Aravind and J. Radhamani and {Kalyani Srinivasan} and B. {Ananda Subhash} and Rishi Kumar Tyagi}, note = {R package version 0.2.4.0 https://github.com/aravind-j/PGRdup, https://cran.r-project.org/package=PGRdup}, year = {2025}, }This free and open-source software implements academic research by the authors and co-workers. If you useit, please support the project by citing the package.install.packages('PGRdup')ProbDupKWIC object.ProbDup objectProbDup object.ProbDup to a data frame.ProbDup