Movatterモバイル変換

Version :0.2.4.0; Copyright (C) 2014-2025:ICAR-NBPGR; License:GPL-2|GPL-3

Aravind, J.1, Radhamani, J.1, Kalyani Srinivasan1, Ananda Subhash, B.2, and Tyagi, R. K.1

ICAR-National Bureau of Plant Genetic Resources, New Delhi, India
Centre for Development of Advanced Computing, Thiruvananthapuram,Kerala, India

Introduction

TheR packagePGRdup was developed as a tool to aid genebankmanagers in the identification of probable duplicate accessions fromplant genetic resources (PGR) passport databases.

This package primarily implements a workflow designed to fetch groups orsets of germplasm accessions with similar passport data particularly infields associated with accession names within or across PGR passportdatabases.

The functions in this package are primarily built using the following Rpackages:

Installation

The package can be installed from CRAN as follows:

# Install from CRANinstall.packages('PGRdup', dependencies=TRUE)The development version can be installed from github as follows:# Install development version from Githubdevtools::install_github("aravind-j/PGRdup")

Workflow

The series of steps involve in the workflow along with the associatedfunctions are are illustrated below:

Step 1

Function(s) :

DataClean
MergeKW
MergePrefix
MergeSuffix

Use these functions for the appropriate data standardisation of therelevant fields in the passport databases to harmonize punctuation,leading zeros, prefixes, suffixes etc. associated with accession names.

Step 2

Function(s) :

KWIC

Use this function to extract the information in the relevant fields askeywords or text strings in the form of a searchable Keyword in Context(KWIC) index.

Step 3

Function(s) :

ProbDup

Execute fuzzy, phonetic and semantic matching of keywords to identifyprobable duplicate sets either within a single KWIC index or between twoindexes using this function. For fuzzy matching the levenshtein editdistance is used, while for phonetic matching, the double metaphonealgorithm is used. For semantic matching, synonym sets or ‘synsets’ ofaccession names can be supplied as an input and the text strings in suchsets will be treated as being identical for matching. Various options totweak the matching strategies used are also available in this function.

Step 4

Function(s) :

DisProbDup
ReviewProbDup
ReconstructProbDup

Inspect, revise and improve the retrieved sets using these functions. Ifconsiderable intersections exist between the initially identified sets,thenDisProbDup may be used to get the disjoint sets. The identifiedsets may be subjected to clerical review after transforming them into anappropriate spreadsheet format which contains the raw data from theoriginal database(s) usingReviewProbDup and subsequently convertedback usingReconstructProbDup.

Adjuncts

Function(s) :

ValidatePrimKey
DoubleMetaphone
ParseProbDup
AddProbDup
SplitProbDup
MergeProbDup
ViewProbDup
KWCounts
read.genesys

Use these helper functions if needed.ValidatePrimKey can be used tocheck whether a column chosen in a data.frame as the primary primarykey/ID confirms to the constraints of absence of duplicates and NULLvalues.

DoubleMetaphone is an implementation of the Double Metaphone phoneticalgorithm inR and is used for phonetic matching.

ParseProbDup andAddProbDup work with objects of classProbDup.The former can be used to parse the probable duplicate sets in aProbDup object to adata.frame while the latter can be used to addthese sets data fields to the passport databases.SplitProbDup can beused to split an object of classProbDup according to set counts.MergeProbDup can be used to merge together two objects of classProbDup.ViewProbDup can be used to plot the summary visualizationsof probable duplicate sets retrieved in an object of classProbDup.

KWCounts can be used to compute keyword counts from PGR passportdatabase fields(columns), which can give a rough indication of thecompleteness of the data.

read.genesys can be used to import PGR data in a Darwin Core -germplasm zip archive downloaded from genesys database into the Renvironment.

Tips

Usefread torapidly read large files instead ofread.csv orread.table inbase.
In case the PGR passport data is in any DBMS, use the appropriateR-database interfacepackages to getthe required table as adata.frame inR.

Note

TheProbDup function can be memory hungry with large passportdatabases. In such cases, ensure that the system has sufficient memoryfor smooth functioning (See?ProbDup).

Detailed tutorial

For a detailed tutorial (vignette) on how to used this package type:

browseVignettes(package = 'PGRdup')

The vignette for the latest version is also availableonline.

What’s new

To know whats new in this version type:

news(package='PGRdup')

Links

CRAN page

Github page

Documentation website

Zenodo DOI

CRAN checks


r-devel-linux-x86_64-debian-clang
r-devel-linux-x86_64-debian-gcc
r-devel-linux-x86_64-fedora-clang
r-devel-linux-x86_64-fedora-gcc
r-patched-linux-x86_64
r-release-linux-x86_64


r-devel-windows-x86_64
r-release-windows-x86_64
r-oldrel-windows-x86_64


r-release-macos-x86_64
r-oldrel-macos-x86_64

Citing`PGRdup`

To cite the methods in the package use:

citation("PGRdup")

To cite the R package 'PGRdup' in publications use:  Aravind, J., Radhamani, J., Kalyani Srinivasan, Ananda Subhash, B., and Tyagi, R. K.  (2025).  PGRdup:  Discover Probable Duplicates in Plant Genetic Resources Collections. R package version 0.2.4.0,  https://github.com/aravind-j/PGRdup,https://cran.r-project.org/package=PGRdup.A BibTeX entry for LaTeX users is  @Manual{,    title = {PGRdup: Discover Probable Duplicates in Plant Genetic Resources Collections},    author = {J. Aravind and J. Radhamani and {Kalyani Srinivasan} and B. {Ananda Subhash} and Rishi Kumar Tyagi},    note = {R package version 0.2.4.0 https://github.com/aravind-j/PGRdup, https://cran.r-project.org/package=PGRdup},    year = {2025},  }This free and open-source software implements academic research by the authors and co-workers. If you useit, please support the project by citing the package.