Movatterモバイル変換


[0]ホーム

URL:


Learn R Programming

Version :0.2.4.0; Copyright (C) 2014-2025:ICAR-NBPGR; License:GPL-2|GPL-3
Aravind, J.1, Radhamani, J.1, Kalyani Srinivasan1, Ananda Subhash, B.2, and Tyagi, R. K.1
  1. ICAR-National Bureau of Plant Genetic Resources, New Delhi, India
  2. Centre for Development of Advanced Computing, Thiruvananthapuram,Kerala, India


Introduction

TheR packagePGRdup was developed as a tool to aid genebankmanagers in the identification of probable duplicate accessions fromplant genetic resources (PGR) passport databases.

This package primarily implements a workflow designed to fetch groups orsets of germplasm accessions with similar passport data particularly infields associated with accession names within or across PGR passportdatabases.

The functions in this package are primarily built using the following Rpackages:

Installation

The package can be installed from CRAN as follows:

# Install from CRANinstall.packages('PGRdup', dependencies=TRUE)The development version can be installed from github as follows:# Install development version from Githubdevtools::install_github("aravind-j/PGRdup")

Workflow

The series of steps involve in the workflow along with the associatedfunctions are are illustrated below:

Step 1

Function(s) :

  • DataClean
  • MergeKW
  • MergePrefix
  • MergeSuffix

Use these functions for the appropriate data standardisation of therelevant fields in the passport databases to harmonize punctuation,leading zeros, prefixes, suffixes etc. associated with accession names.

Step 2

Function(s) :

  • KWIC

Use this function to extract the information in the relevant fields askeywords or text strings in the form of a searchable Keyword in Context(KWIC) index.

Step 3

Function(s) :

  • ProbDup

Execute fuzzy, phonetic and semantic matching of keywords to identifyprobable duplicate sets either within a single KWIC index or between twoindexes using this function. For fuzzy matching the levenshtein editdistance is used, while for phonetic matching, the double metaphonealgorithm is used. For semantic matching, synonym sets or ‘synsets’ ofaccession names can be supplied as an input and the text strings in suchsets will be treated as being identical for matching. Various options totweak the matching strategies used are also available in this function.

Step 4

Function(s) :

  • DisProbDup
  • ReviewProbDup
  • ReconstructProbDup

Inspect, revise and improve the retrieved sets using these functions. Ifconsiderable intersections exist between the initially identified sets,thenDisProbDup may be used to get the disjoint sets. The identifiedsets may be subjected to clerical review after transforming them into anappropriate spreadsheet format which contains the raw data from theoriginal database(s) usingReviewProbDup and subsequently convertedback usingReconstructProbDup.

Adjuncts

Function(s) :

  • ValidatePrimKey
  • DoubleMetaphone
  • ParseProbDup
  • AddProbDup
  • SplitProbDup
  • MergeProbDup
  • ViewProbDup
  • KWCounts
  • read.genesys

Use these helper functions if needed.ValidatePrimKey can be used tocheck whether a column chosen in a data.frame as the primary primarykey/ID confirms to the constraints of absence of duplicates and NULLvalues.

DoubleMetaphone is an implementation of the Double Metaphone phoneticalgorithm inR and is used for phonetic matching.

ParseProbDup andAddProbDup work with objects of classProbDup.The former can be used to parse the probable duplicate sets in aProbDup object to adata.frame while the latter can be used to addthese sets data fields to the passport databases.SplitProbDup can beused to split an object of classProbDup according to set counts.MergeProbDup can be used to merge together two objects of classProbDup.ViewProbDup can be used to plot the summary visualizationsof probable duplicate sets retrieved in an object of classProbDup.

KWCounts can be used to compute keyword counts from PGR passportdatabase fields(columns), which can give a rough indication of thecompleteness of the data.

read.genesys can be used to import PGR data in a Darwin Core -germplasm zip archive downloaded from genesys database into the Renvironment.

Tips

  • Usefread torapidly read large files instead ofread.csv orread.table inbase.
  • In case the PGR passport data is in any DBMS, use the appropriateR-database interfacepackages to getthe required table as adata.frame inR.

Note

  • TheProbDup function can be memory hungry with large passportdatabases. In such cases, ensure that the system has sufficient memoryfor smooth functioning (See?ProbDup).

Detailed tutorial

For a detailed tutorial (vignette) on how to used this package type:

browseVignettes(package = 'PGRdup')

The vignette for the latest version is also availableonline.

What’s new

To know whats new in this version type:

news(package='PGRdup')

Links

CRAN page

Github page

Documentation website

Zenodo DOI

CRAN checks

r-devel-linux-x86_64-debian-clang
r-devel-linux-x86_64-debian-gcc
r-devel-linux-x86_64-fedora-clang
r-devel-linux-x86_64-fedora-gcc
r-patched-linux-x86_64
r-release-linux-x86_64

r-devel-windows-x86_64
r-release-windows-x86_64
r-oldrel-windows-x86_64

r-release-macos-x86_64
r-oldrel-macos-x86_64

CitingPGRdup

To cite the methods in the package use:

citation("PGRdup")
To cite the R package 'PGRdup' in publications use:  Aravind, J., Radhamani, J., Kalyani Srinivasan, Ananda Subhash, B., and Tyagi, R. K.  (2025).  PGRdup:  Discover Probable Duplicates in Plant Genetic Resources Collections. R package version 0.2.4.0,  https://github.com/aravind-j/PGRdup,https://cran.r-project.org/package=PGRdup.A BibTeX entry for LaTeX users is  @Manual{,    title = {PGRdup: Discover Probable Duplicates in Plant Genetic Resources Collections},    author = {J. Aravind and J. Radhamani and {Kalyani Srinivasan} and B. {Ananda Subhash} and Rishi Kumar Tyagi},    note = {R package version 0.2.4.0 https://github.com/aravind-j/PGRdup, https://cran.r-project.org/package=PGRdup},    year = {2025},  }This free and open-source software implements academic research by the authors and co-workers. If you useit, please support the project by citing the package.

Copy Link

Version

Install

install.packages('PGRdup')

Monthly Downloads

233

Version

0.2.4.0

License

GPL-2 | GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

J. Aravind

Last Published

December 14th, 2025

Functions in PGRdup (0.2.4.0)

read.genesys

Convert 'Darwin Core - Germplasm' zip archive to a flat file
SplitProbDup

Split an object of classProbDup
print.KWIC

Prints summary ofKWIC object.
ValidatePrimKey

Validate if a data frame column confirms to primary key/ID constraints
ViewProbDup

Visualize the probable duplicate sets retrieved in aProbDup object
print.ProbDup

Prints summary ofProbDup object.
ReconstructProbDup

Reconstruct an object of class ProbDup
ParseProbDup

Parse an object of classProbDup to a data frame.
ProbDup

Identify probable duplicates of accessions
ReviewProbDup

Retrieve probable duplicate set information from PGR passport database forreview
PGRdup-package

The PGRdup Package
MergeKW

Merge keyword strings
KWIC

Create a KWIC index
AddProbDup

Add probable duplicate sets fields to the PGR passport database
DoubleMetaphone

'Double Metaphone' phonetic algorithm
DisProbDup

Get disjoint probable duplicate sets
KWCounts

Generate keyword counts
MergeProbDup

Merge two objects of classProbDup
DataClean

Clean PGR passport data
GN1000

Sample groundnut PGR passport data

[8]ページ先頭

©2009-2025 Movatter.jp