| Title: | Biodiversity Data Cleaning |
| Version: | 1.1.5 |
| Description: | It brings together several aspects of biodiversity data-cleaning in one place. 'bdc' is organized in thematic modules related to different biodiversity dimensions, including 1) Merge datasets: standardization and integration of different datasets; 2) Pre-filter: flagging and removal of invalid or non-interpretable information, followed by data amendments; 3) Taxonomy: cleaning, parsing, and harmonization of scientific names from several taxonomic groups against taxonomic databases locally stored through the application of exact and partial matching algorithms; 4) Space: flagging of erroneous, suspect, and low-precision geographic coordinates; and 5) Time: flagging and, whenever possible, correction of inconsistent collection date. In addition, it contains features to visualize, document, and report data quality – which is essential for making data quality assessment transparent and reproducible. The reference for the methodology is Bruno et al. (2022) <doi:10.1111/2041-210X.13868>. |
| License: | GPL (≥ 3) |
| URL: | https://brunobrr.github.io/bdc/ (website)https://github.com/brunobrr/bdc |
| BugReports: | https://github.com/brunobrr/bdc/issues |
| Imports: | CoordinateCleaner, doParallel, dplyr, DT, foreach, fs,ggplot2, here, magrittr, purrr, qs, readr, rgnparser,rnaturalearth, sf (≥ 1.0.5), stringdist, stringi, stringr,taxadb (≥ 0.1.3), tibble, tidyselect |
| Suggests: | contentid (≥ 0.0.15), covr, cowplot, DBI, duckdb (≥ 0.3.2),knitr (≥ 1.31), maps, markdown, rappdirs, raster, remotes,rlang (≥ 1.0.1), rmarkdown, rnaturalearthdata, sp, rvest,xml2, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| Encoding: | UTF-8 |
| Language: | en-gb |
| RoxygenNote: | 7.3.2 |
| NeedsCompilation: | no |
| Packaged: | 2024-12-17 17:20:06 UTC; brunoribeiro |
| Author: | Bruno Ribeiro |
| Maintainer: | Bruno Ribeiro <ribeiro.brr@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2024-12-17 17:40:02 UTC |
Pipe operator
Description
Seemagrittr::%>% for details.
Usage
lhs %>% rhsArguments
lhs | A value or the magrittr placeholder. |
rhs | A function call using the magrittr semantics. |
Value
The result of callingrhs(lhs).
Identify records from doubtful source (e.g., 'fossil', MachineObservation')
Description
This function flags records with an informed basis of records (i.e., therecords type, for example, a specimen, a human observation, or a fossilspecimen) not interpretable, which does not comply with Darwin Corevocabulary, or unreliable or unsuitable for specific analyses.
Usage
bdc_basisOfRecords_notStandard( data, basisOfRecord = "basisOfRecord", names_to_keep = "all")Arguments
data | data.frame. Containing information about the basis of records. |
basisOfRecord | character string. The column name with information aboutbasis of records. Default = "basisOfRecord". |
names_to_keep | character string. Elements of the column BasisOfRecordsto keep. Default is "all", which considers a selected list of recommendedstandard Darwin Core classes (and their spelling variations, see details).By default, records missing (i.e., NA) or with "unknown" information aboutbasis of records are kept. |
Details
Users are encourage to select the set of basis of records classesto keep. Default = c("Event","HUMAN_OBSERVATION", "HumanObservation","LIVING_SPECIMEN", "LivingSpecimen", "MACHINE_OBSERVATION","MachineObservation", "MATERIAL_SAMPLE", "O", "Occurrence","MaterialSample", "OBSERVATION", "Preserved Specimen","PRESERVED_SPECIMEN", "preservedspecimen Specimen", "Preservedspecimen","PreservedSpecimen", "preservedspecimen", "S", "Specimen", "Taxon","UNKNOWN", "", NA)
Value
A data.frame containing the column ".basisOfRecords_notStandard".Compliant (TRUE) if 'basisOfRecord' is standard; otherwise "FALSE".
See Also
Other prefilter:bdc_coordinates_country_inconsistent(),bdc_coordinates_empty(),bdc_coordinates_from_locality(),bdc_coordinates_outOfRange(),bdc_coordinates_transposed(),bdc_country_standardized(),bdc_scientificName_empty()
Examples
x <- data.frame(basisOfRecord = c( "FOSSIL_SPECIMEN", "UNKNOWN", "RON", NA, "Specimen", "PRESERVED_SPECIMEN"))bdc_basisOfRecords_notStandard( data = x, basisOfRecord = "basisOfRecord", names_to_keep = "all")Clean and parse scientific names
Description
This function is composed of a series of name-checking routines for cleaningand parsing scientific names; i.e., unify writing style. It removes 1) familynames of animals or plants pre-pended to species names, 2) qualifiersdenoting the uncertain or provisional status of taxonomic identification(e.g., confer, species, affinis), and 3) infraspecific terms, for example,variety (var.), subspecies (subsp), forma (f.), and their spellingvariations. It also includes applications to 4) standardize names, i.e.,capitalize only the first letter of the genus name and remove extrawhitespaces), and 5) parse names, i.e., separate author, date, annotationsfrom taxon name.
Usage
bdc_clean_names(sci_names, save_outputs = FALSE)Arguments
sci_names | character string. Containing scientific names. |
save_outputs | logical. Should the outputs be saved? Default = FALSE. |
Details
Terms denoting uncertainty or provisional status of taxonomicidentification as well as infraspecific terms were obtained from Sigovinietal. (2016; doi: 10.1111/2041-210X.12594).
Value
A five-column data.frame including
scientificName: original names supplied
.uncer_terms: indicates the presence of taxonomic uncertainty terms
.infraesp_names: indicates the presence of infraspecific terms
name_clean: scientific names resulting from the cleaning and parsingprocesses
quality: an index indicating the quality of parsing process. Itranges from 0 to 4, being 1 no problem detected, 4 serious problems detected;a value of 0 indicates no interpretable name that was not parsed).
If save_outputs == TRUE, a data.frame containing all tests of the cleaningnames process and the results of the parsing names process is saved in"Output/Check/02_parse_names.csv".
See Also
Other taxonomy:bdc_filter_out_names(),bdc_query_names_taxadb()
Examples
## Not run: scientificName <- c( "Fridericia bahiensis (Schauer ex. DC.) L.G.Lohmann", "Peltophorum dubium (Spreng.) Taub. (Griseb.) Barneby", "Gymnanthes edwalliana (Pax & K.Hoffm.) Laurenio-Melo & M.F.Sales", "LEGUMINOSAE Senna aff. organensis (Glaz. ex Harms) H.S.Irwin & Barneby")bdc_clean_names(scientificName, save_outputs = FALSE)## End(Not run)Identify records within a reference country
Description
This function flags geographic coordinates within a reference country. Aspatial buffer can be added to the reference country to ensure thatrecords in mangroves, marshes, estuaries, and records with lowcoordinate precision are not flagged as invalid.
Usage
bdc_coordinates_country_inconsistent( data, country_name, country = "country_suggested", lat = "decimalLatitude", lon = "decimalLongitude", dist = 0.1)Arguments
data | data.frame. Containing longitude and latitude. Coordinates mustbe expressed in decimal degrees and WGS84. |
country_name | character string. Name of the country or countries to beconsidered. |
country | character string. The column name with the country assignmentof each record. It isrecommended use a column with corrected and homogenized country names.Default = "country_suggested". |
lat | character string. The column name with the latitude coordinates.Default = “decimallatitude”. |
lon | character string. The column name with the longitude coordinates.Default = “decimallongitude”. |
dist | numeric. The distance in decimal degrees used to created a bufferaround the country. Default = 0.1 (~11 km at the equator). |
Details
Multiple countries can be informed, but they are tested separately.The distance reported in the argument 'dist' is used to create abuffer around the reference country. Records within the reference countryor at a specified distance from the coastline of the reference country(i.e., records within the buffer) are flagged as valid (TRUE). Note thatrecords within the buffer but in other countries are flagged as invalid(FALSE). Records with invalid (e.g., NA or empty) and out-of-rangecoordinates are not tested and returned as TRUE.
Value
A data.frame containing the column'.coordinates_country_inconsistent'. Compliant (TRUE) if coordinates fallwithin the boundaries plus a specified distance (if 'dist' is supplied) of'country_name'; otherwise "FALSE".
See Also
Other prefilter:bdc_basisOfRecords_notStandard(),bdc_coordinates_empty(),bdc_coordinates_from_locality(),bdc_coordinates_outOfRange(),bdc_coordinates_transposed(),bdc_country_standardized(),bdc_scientificName_empty()
Examples
## Not run: x <- data.frame( country = c("Brazil", "Brazil", "Bolivia", "Argentina", "Peru"), decimalLongitude = c(-40.6003, -39.6, -77.689288, NA, -76.352930), decimalLatitude = c(-19.9358, -13.016667, -20.5243, -35.345940, -11.851872))bdc_coordinates_country_inconsistent( data = x, country_name = c("Brazil", "Peru", "Argentina"), country = "country", lon = "decimalLongitude", lat = "decimalLatitude", dist = 0.1 )## End(Not run)Identify records with empty geographic coordinates
Description
This function flags records missing latitude or longitude coordinates.
Usage
bdc_coordinates_empty(data, lat = "decimalLatitude", lon = "decimalLongitude")Arguments
data | data.frame. Containing geographical coordinates. |
lat | character string. The column name with latitude in decimal degreesand WGS84. Default = "decimalLatitude". |
lon | character string. The column with longitude in decimal degrees andWGS84. Default = "decimalLongitude". |
Details
This test identifies records missing geographic coordinates (i.e.,empty or not applicableNA longitude or latitude)
Value
A data.frame containing the column ".coordinates_empty". Compliant(TRUE) if 'lat' and 'lon' are not empty; otherwise "FALSE".
See Also
Other prefilter:bdc_basisOfRecords_notStandard(),bdc_coordinates_country_inconsistent(),bdc_coordinates_from_locality(),bdc_coordinates_outOfRange(),bdc_coordinates_transposed(),bdc_country_standardized(),bdc_scientificName_empty()
Examples
x <- data.frame( decimalLatitude = c(19.9358, -13.016667, NA, ""), decimalLongitude = c(-40.6003, -39.6, -20.5243, NA))bdc_coordinates_empty( data = x, lat = "decimalLatitude", lon = "decimalLongitude")Identify records lacking or with invalid coordinates but containing localityinformation
Description
This function Identifies records whose coordinates can potentially beextracted from locality information.
Usage
bdc_coordinates_from_locality( data, lat = "decimalLatitude", lon = "decimalLongitude", locality = "locality", save_outputs = FALSE)Arguments
data | data.frame. Containing geographical coordinates and the column"locality'. |
lat | character string. The column name with latitude in decimal degreesand WGS84. Default = "decimalLatitude". |
lon | character string. The column with longitude in decimal degrees andWGS84. Default = "decimalLongitude". |
locality | character string. The column name with locality information.Default = "locality". |
save_outputs | logical. Should a table containing transposed coordinatessaved for further inspection? Default = FALSE. |
Details
According to DarwinCore terminology, locality refers to "thespecific description of the place" where an organism was recorded.
Value
A data.frame containing records missing or with invalid coordinatesbut with potentially useful locality information. When save_outputs = FALSEthe data.frame is saved in Output/Check/01_coordinates_from_locality.csv
See Also
Other prefilter:bdc_basisOfRecords_notStandard(),bdc_coordinates_country_inconsistent(),bdc_coordinates_empty(),bdc_coordinates_outOfRange(),bdc_coordinates_transposed(),bdc_country_standardized(),bdc_scientificName_empty()
Examples
x <- data.frame( lat = c(NA, NA, ""), lon = c("", NA, NA), locality = c("PARAGUAY: ALTO PARAGUAY: CO.; 64KM W PUERTO SASTRE", "Parque Estadual da Serra de Caldas Novas, Goias, Brazil", "Parque Nacional Iguazu"))bdc_coordinates_from_locality(data = x, lat = "lat", lon = "lon", locality = "locality", save_outputs = FALSE)Identify records with out-of-range geographic coordinates
Description
This function identifies records with out-of-range coordinates (not between-90 and 90 for latitude; between -180 and 180 for longitude).
Usage
bdc_coordinates_outOfRange( data, lat = "decimalLatitude", lon = "decimalLongitude")Arguments
data | data.frame. Containing geographical coordinates. Coordinates mustbe expressed in decimal degrees and WGS84. |
lat | character string. The column name with latitude in decimal degreeand in WGS84. Default = "decimalLatitude". |
lon | character string. The column with longitude in decimal degree andin WGS84. Default = "decimalLongitude". |
Value
A data.frame containing the column ".coordinates_outOfRange".Compliant (TRUE) if 'lat' and 'lon' are not out-of-range; otherwise"FALSE".
See Also
Other prefilter:bdc_basisOfRecords_notStandard(),bdc_coordinates_country_inconsistent(),bdc_coordinates_empty(),bdc_coordinates_from_locality(),bdc_coordinates_transposed(),bdc_country_standardized(),bdc_scientificName_empty()
Examples
x <- data.frame( decimalLatitude = c(-185.111, -43.34, "", -21.8069444), decimalLongitude = c(-45.4, -39.6, -20.5243, -440.9055555))bdc_coordinates_outOfRange( data = x, lat = "decimalLatitude", lon = "decimalLongitude")Flag low-precise geographic coordinates
Description
This function flags records with a coordinate precision below aspecified number of decimal places. Coordinates with one, two, or threedecimal places present a precision of~11.1 km, ~1.1 km, and ~111 m at theequator, respectively.
Usage
bdc_coordinates_precision( data, lat = "decimalLatitude", lon = "decimalLongitude", ndec = c(0, 1, 2))Arguments
data | data.frame. A data.frame containing geographic coordinates. |
lat | character string. The column with latitude in decimal degrees andWGS84. Default = "decimalLatitude". |
lon | character string. The column with longitude in decimal degrees andWGS84. Default = "decimalLongitude". |
ndec | numeric. The minimum number of decimal places that thecoordinates shouldhave to be considered valid. Default = 2. |
Value
A data.frame with logical values indicating whether values are equalor higher than the specified minimum decimal number (ndec). Coordinatesflagged as FALSE in .rou column are considered imprecise.
Examples
x <- data.frame( lat = c(-21.34, 23.567, 16.798, -10.468), lon = c(-55.38, -13.897, 30.8, 90.675))bdc_coordinates_precision( data = x, lat = "lat", lon = "lon", ndec = 3)Identify transposed geographic coordinates
Description
This function flags and corrects records when latitude and longitude appearto be transposed.
Usage
bdc_coordinates_transposed( data, id = "database_id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE)Arguments
data | data.frame. Containing a unique identifier for each record,geographical coordinates, and country names. Coordinates must be expressedin decimal degrees and WGS84. |
id | character string. The column name with a unique record identifier.Default = "database_id". |
sci_names | character string. The column name with species scientificname. Default = "scientificName". |
lat | character string. The column name with latitude. Coordinates mustbe expressed in decimal degrees and WGS84. Default = "decimalLatitude". |
lon | character string. The column with longitude. Coordinates must beexpressed in decimal degrees and WGS84. Default = "decimalLongitude". |
country | character string. The column name with the countryassignment of each record. Default = "country". |
countryCode | character string. The column name with an ISO-2 countrycode. |
border_buffer | numeric >= 0. A distance in decimal degrees used tocreated a buffer around the country. Records within a given country and ata specified distance from the border will be not be corrected.Default = 0.2 (~22 km at the equator). |
save_outputs | logical. Should a table containing transposed coordinatessaved for further inspection? Default = FALSE. |
Details
This test identifies transposed coordinates resulted from mismatchesbetween the country informed for a record and coordinates. Transposedcoordinates often fall outside of the indicated country (i.e., in othercountries or in the sea). Different coordinate transformations areperformed to correct country/coordinates mismatches. Importantly, verbatimcoordinates are replaced by the corrected ones in the returned database. Adatabase containing verbatim and corrected coordinates is created in"Output/Check/01_coordinates_transposed.csv" if save_outputs == TRUE. Thecolumns "country" and "countryCode" can be retrieved by using the functionbdc_country_standardized.
Value
A data.frame containing the column "coordinates_transposed"indicating if verbatim coordinates were not transposed (TRUE). Otherwiserecords are flagged as (FALSE) and, in this case, verbatim coordinates arereplaced by corrected coordinates.
See Also
Other prefilter:bdc_basisOfRecords_notStandard(),bdc_coordinates_country_inconsistent(),bdc_coordinates_empty(),bdc_coordinates_from_locality(),bdc_coordinates_outOfRange(),bdc_country_standardized(),bdc_scientificName_empty()
Examples
## Not run: id <- c(1, 2, 3, 4)scientificName <- c( "Rhinella major", "Scinax ruber", "Siparuna guianensis", "Psychotria vellosiana")decimalLatitude <- c(63.43333, -14.43333, -41.90000, -46.69778)decimalLongitude <- c(-17.90000, -67.91667, -13.25000, -13.82444)country <- c("BOLIVIA", "bolivia", "Brasil", "Brazil")x <- data.frame( id, scientificName, decimalLatitude, decimalLongitude, country)# Get country codex <- bdc_country_standardized(data = x, country = "country")bdc_coordinates_transposed( data = x, id = "id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country_suggested", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE ) ## End(Not run)Get country names from coordinates
Description
Country names derived from valid geographic coordinates are added to recordsmissing country names.
Usage
bdc_country_from_coordinates( data, lat = "decimalLatitude", lon = "decimalLongitude", country = "country")Arguments
data | data.frame. Containing geographical coordinates and countrynames. |
lat | character string. The column name with latitude in decimaldegreesand WGS84. Default = "decimalLatitude". |
lon | character string. The column with longitude in decimal degrees andWGS84. Default = "decimalLongitude". |
country | character string. The column name with the country assignmentof each record. Default = "country". If no column name is provided a newcolumn "country" is created. |
Details
This function assigns a country name for records missing suchinformation. Country names are extracted from valid geographic coordinatesusing a high-quality map of the world (rnaturalearth package). Nocountry name is added to records whose coordinates are in the sea.
Value
A tibble containing country names for records missing suchinformation.
Examples
## Not run: x <- data.frame( decimalLatitude = c(-22.9834, -39.857030, -17.06811, -46.69778), decimalLongitude = c(-69.095, -68.443588, 37.438108, -13.82444), country = c("", NA, NA, "Brazil"))bdc_country_from_coordinates( data = x, lat = "decimalLatitude", lon = "decimalLongitude", country = "country")## End(Not run)Standardizes country names and gets country code
Description
This function standardizes country names and adds a new column to thedatabase containing two-letter country codes (ISO 3166-1 alpha-2).
Usage
bdc_country_standardized(data, country = "country")Arguments
data | data.frame. Containing country names |
country | character string. The column name with the country assignmentof each record. Default = "country". |
Details
Country names are standardized using an exact matching against alist of country names in several languages from International Organization for Standardization. If any unmatchednames remain, a fuzzy matching algorithm is used to find potentialcandidates for each misspelled countries names.
Value
A data.frame containing two columns: country_suggested (standardizedcountry names) and country_code (two-letter country codes; more details inWorld Countries, International Organization for Standardization).
See Also
Other prefilter:bdc_basisOfRecords_notStandard(),bdc_coordinates_country_inconsistent(),bdc_coordinates_empty(),bdc_coordinates_from_locality(),bdc_coordinates_outOfRange(),bdc_coordinates_transposed(),bdc_scientificName_empty()
Examples
## Not run: country <- c("BOLIVIA", "bolivia", "Brasil", "Brazil", "BREZIL")x <- data.frame(country)bdc_country_standardized( data = x, country = "country")## End(Not run)Create figures reporting the results of the bdc package
Description
Creates figures (i.e., bar plots, maps, and histograms) reporting the resultsof data quality tests implemented in the bdc package.
Usage
bdc_create_figures( data, database_id = "database_id", workflow_step = NULL, bins_maps = 15, save_figures = FALSE)Arguments
data | data.frame. Containing the results of data quality tests; thatis, columns starting wit ".". |
database_id | character string. The column name with a unique recordidentifier. Default = "database_id". |
workflow_step | character string. Name of the workflow step. Optionsavailable are "prefilter", "space", and "time". |
bins_maps | character. Number of bins used to create the map. |
save_figures | logical. Should the figures be saved for furtherinspection? Default = FALSE. |
Details
This function creates figures based on the results of data qualitytests implemented. A pre-defined list of test names is used for creatingfigures depending on the name of the workflow step informed. Figures aresaved in "Output/Figures" if save_figures == TRUE.
Value
List containing figures showing the results of data quality testimplemented in one module of bdc. When save_figures = TRUE, figures arealso saved locally in a png format.
Examples
## Not run: database_id <- c("GBIF_01", "GBIF_02", "GBIF_03", "FISH_04", "FISH_05")lat <- c(-19.93580, -13.01667, -22.34161, -6.75000, -15.15806)lon <- c(-40.60030, -39.60000, -49.61017, -35.63330, -39.52861).scientificName_emptys <- c(TRUE, TRUE, TRUE, FALSE, FALSE).coordinates_empty <- c(TRUE, TRUE, TRUE, TRUE, TRUE).invalid_basis_of_records <- c(TRUE, FALSE, TRUE, FALSE, TRUE).summary <- c(TRUE, FALSE, TRUE, FALSE, FALSE)x <- data.frame( database_id, lat, lon, .scientificName_emptys, .coordinates_empty, .invalid_basis_of_records, .summary)figures <- bdc_create_figures( data = x, database_id = "database_id", workflow_step = "prefilter", save_figures = FALSE)## End(Not run)Create a report summarizing the results of data quality tests
Description
Create a report summarizing the results of data quality tests
Usage
bdc_create_report( data, database_id = "database_id", workflow_step, save_report = FALSE)Arguments
data | data.frame. Containing a unique identifier for each record andthe results of data quality tests. |
database_id | character string. The column name with a unique recordidentifier.Default = "database_id". |
workflow_step | character string containing the followingoptions("prefilter", "taxonomy", "space" or "time"). |
save_report | logical. Should the report be saved for furtherinspection? Default = FALSE. |
Value
A data.frame containing a report summarizing the results of dataquality assessment.
Examples
## Not run: database_id <- c("test_1", "test_2", "test_3", "test_4", "test_5").missing_names <- c(TRUE, TRUE, TRUE, FALSE, FALSE).missing_coordinates <- c(TRUE, FALSE, FALSE, TRUE, FALSE).basisOfRecords_notStandard <- c(TRUE, TRUE, FALSE, TRUE, TRUE).summary <- c(TRUE, FALSE, FALSE, FALSE, FALSE)x <- data.frame( database_id, .missing_names, .missing_coordinates, .basisOfRecords_notStandard, .summary)report <- bdc_create_report( data = x, database_id = "database_id", workflow_step = "prefilter", save_report = FALSE)## End(Not run)Identify records with empty event date
Description
This function identifies records missing information on an event date (i.e.,when a record was collected or observed).
Usage
bdc_eventDate_empty(data, eventDate = "eventDate")Arguments
data | A data frame containing column with event date information. |
eventDate | Numeric or date. The column with event date information. |
Details
This test identifies records missing event date information (i.e.,empty or not applicableNA).
Value
A data.frame containing the column ".eventDate_empty". Compliant(TRUE) if 'eventDate' is not empty; otherwise "FALSE".
See Also
Other time:bdc_year_from_eventDate(),bdc_year_outOfRange()
Examples
collection_date <- c( NA, "31/12/2015", "2013-06-13T00:00:00Z", "2013-06-20", "", "2013", "0001-01-00")x <- data.frame(collection_date)bdc_eventDate_empty(data = x, eventDate = "collection_date")Remove columns with the results of data quality tests
Description
This function filters out columns containing the results of data qualitytests (i.e., columns starting with '.') or other columns specified.
Usage
bdc_filter_out_flags(data, col_to_remove = "all")Arguments
data | data.frame. Containing columns to be removed. |
col_to_remove | logical. Which columns should be removed? Default ="all", which means that all columns containing the results of data qualitytests are removed. |
Value
A data.frame without columns specified in 'col_to_remove'.
Examples
x <- data.frame( database_id = c("test_1", "test_2", "test_3", "test_4", "test_5"), kindom = c("Plantae", "Plantae", "Animalia", "Animalia", "Plantae"), .bdc_scientificName_empty = c(TRUE, TRUE, TRUE, FALSE, FALSE), .bdc_coordinates_empty = c(TRUE, FALSE, FALSE, FALSE, FALSE), .bdc_coordinates_outOfRange = c(TRUE, FALSE, FALSE, FALSE, FALSE), .summary = c(TRUE, FALSE, FALSE, FALSE, FALSE))bdc_filter_out_flags( data = x, col_to_remove = "all")Filter out records according to their taxonomic status
Description
This function is useful for selecting records according to their taxonomicstatus. By default, only records with accepted scientific names arereturned.
Usage
bdc_filter_out_names( data, col_name = "notes", taxonomic_status = "accepted", opposite = FALSE)Arguments
data | data.frame. Containing the column "notes" with information on thetaxonomic status of scientific names. |
col_name | character string. The column name containing notesabout the taxonomic status of a name. Default = "notes". |
taxonomic_status | character string. Taxonomic status of a name. Default= "accepted". |
opposite | logical. Should taxonomic status different from those listedin 'taxonomic_status' be returned? Default = FALSE |
Details
By default, only records with accepted scientific names are kept inthe database. Such records are listed in the column 'taxonomic_status' as"accepted", "accepted | replaceSynonym", "accepted | wasMisspelled" or"accepted | wasMisspelled | replaceSynonym". It is also possible tocustomize the list of taxonomic notes to be kept in the argument'taxonomic_status'. See 'notes' in the data.frame resulted from the functionbdc_create_report. If 'opposite' is TRUE, records with notesdifferent from names listed in 'taxonomic_status' are returned.
Value
A data.frame filtered out according to names listed in'taxonomic_status'.
See Also
Other taxonomy:bdc_clean_names(),bdc_query_names_taxadb()
Examples
df_notes <- data.frame( notes = c( "notFound", "accepted", "accepted | replaceSynonym", "accepted | wasMisspelled", "accepted | wasMisspelled | replaceSynonym", "multipleAccepted", "heterotypic synonym" ) )bdc_filter_out_names( data = df_notes, taxonomic_status = "accepted", col_name = "notes", opposite = FALSE)Harmonizing taxon names against local stored taxonomic databases
Description
Harmonizing taxon names against local stored taxonomic databases
Usage
bdc_query_names_taxadb( sci_name, replace_synonyms = TRUE, suggest_names = TRUE, suggestion_distance = 0.9, db = "gbif", rank_name = NULL, rank = NULL, parallel = FALSE, ncores = 2, export_accepted = FALSE)Arguments
sci_name | character string. Containing scientific names to be queried. |
replace_synonyms | logical. Should synonyms be replaced by acceptednames? Default = TRUE. |
suggest_names | logical. Tries to find potential candidate names formisspelled names not resolved by an exact match. Default = TRUE. |
suggestion_distance | numeric. A threshold value determining theacceptable orthographical distance between searched and candidate names.Names with matching distance value lower threshold informed are returned asNA. Default = 0.9. |
db | character string. The name of the taxonomic database tobe used in harmonizing taxon names. Default = "gbif".Use "all" to install all available taxonomic databases automatically. |
rank_name | character string. Taxonomic rank name (e.g. "Plantae","Animalia", "Aves", "Carnivora". Default = NULL. |
rank | character string. A taxonomic rank used to filter thetaxonomic database. Options available are: "kingdom", "phylum", "class","order", "family", and "genus". |
parallel | logical. Should a parallelization process be used?Default=FALSE |
ncores | numeric. The number of cores to run in parallel. |
export_accepted | logical. Should a table containing recordswith names linked to multiple accepted names saved for furtherinspection. Default = FALSE. |
Details
The taxonomic harmonization is based upon one taxonomic authority database.The lastest version of each database is used to perform queries, butnote that only older versions are available for some taxonomic databases. Thedatabase version is shown in parenthesis. Note that some databases aremomentary unavailable in taxadb.
itis: Integrated Taxonomic Information System (v. 2022)
ncbi: National Center for Biotechnology Information (v. 2022)
col: Catalogue of Life (v. 2022)
tpl: The Plant List (v. 2019)
gbif: Global Biodiversity Information Facility (v. 2022)
fb: FishBase (v. 2019)
slb: SeaLifeBase (unavailable)
wd: Wikidata (unavailable)
ott: OpenTree Taxonomy (v. 2021)
iucn: International Union for Conservation of Nature (v. 2019)
The bdc_query_names_taxadb processes as this:
Creation of a local taxonomic database
This is a one-time setup used to download, extract, and import the taxonomicdatabases specified in the argument "db". The downloading process may take afew minutes depending on your connection and database size. By default, the"gbif" database following a Darwin Core schema is installed. (see?taxadb::td_create for details).
Taxonomic harmonization
The taxonomic harmonization is divided into two distinct phases according tothe matching type to be undertaken.
Exact matching
Firstly, the algorithm attempts to find an exact matchingfor each original scientific name supplied using the function "filter_name"from taxadb package. If an exact matching cannot be found, names are returnedas Not Available (NA). Also, it is possible that a scientific name matchmultiple accepted names. In such cases, the "bdc_clean_duplicates" functionis used to flag and remove names with multiple accepted names.
Information on higher taxa (e.g., kingdom or phylum) can be used todisambiguate names linked to multiple accepted names. For example, the genus"Casearia" is present in both Animalia and Plantae kingdoms. When handlingnames of Plantae, it would be helpful to get rid of names belonging to theAnimalia to avoid flagging "Caseria" as having multiple accepted names.Following Norman et al. (2020), such cases are left to be fixed by the user.If "export_accepted" = TRUE a database containing a list of all records withnames linked to multiple accepted names is saved in the "Output" folder.
Fuzzy matching
Fuzzy matching will be applied when "suggest_names" is TRUE and only fornames not resolved by an exact match. In such cases, a fuzzy matchingalgorithm processes name-matching queries to find a potential matchingcandidate from the specified taxonomic database. Fuzzy matching identifiesprobable names (here identified as suggested names) for original names via ameasure of orthographic similarity (i.e., distance). Orthographic distance iscalculated by optimal string alignment (restricted Damerau-Levenshteindistance) that counts the number of deletions, insertions, substitutions, andadjacent characters' transpositions. It ranges from 0 to 1, being 1 anindicative of a perfect match. A threshold distance, i.e. the lower value ofmatch acceptable, can be informed by user (in the "suggest_distance"argument). If the distance of a candidate name is equal or higher than thedistance informed by user, the candidate name is returned as suggested name.Otherwise, names are returned as NA.
To increase the probability of finding a potential match candidate and tosave time, two steps are taken before conducting fuzzy matching. First, ifsupplied, information on higher taxon (e.g., kingdom, family) is used tofilter the taxonomic database. This step removes matching ambiguity byavoiding matching names from unrelated taxonomic ranks (e.g., match a plantspecies against a taxonomic database containing animal names) and decreasesthe number of names in the taxonomic database used to calculate the matchingdistance. Then, the taxonomic database is filtered according to a set offirsts letters of all input names. This process reduces the number of namesin the taxonomic database to which each original name should be compared Whena same suggested name is returned for different input names, a warning isreturned asking users to check whether the suggested name is valid.
Report
The name harmonization processes' quality can be accessed in the column"notes" placed in the table resulting from the name harmonization process.The column "notes" contains assertions on the name harmonization processbased on Carvalho (2017). The notes can be grouped in two categories:accepted names and those with a taxonomic issue or warning, needing furtherinspections. Accepted names can be returned as "accepted" (valid acceptedname), "replaceSynonym" (a synonym replaced by an accepted name),"wasMisspelled" (original name was misspelled), "wasMisspelled |replaceSynonym" (misspelled synonym replaced by an accepted name), and"synonym" (original names is a synonym without accepted names in thedatabase). Similarly, the following notes are used to flag taxonomic issues:"notFound" (no matching name found), "multipleAccepted" (name with multipleaccepted names), "noAcceptedName" (no accepted name found), and ambiguoussynonyms such as "heterotypic synonym", "homotypic synonym", and "pro-partesynonym". Ambiguous synonyms, names that have been published more than oncedescribing different species, have more than one accepted name and cannot beresolved. Such cases are flagged and left to be determined by the user.
Value
This function returns data.frame containing the results of thetaxonomic harmonization process. The database is returned in the same orderof sci_name.
See Also
Other taxonomy:bdc_clean_names(),bdc_filter_out_names()
Examples
if (interactive()) { sci_name <- c( "Polystachya estrellensis", "Tachigali rubiginosa", "Oxalis rhombeo ovata", "Axonopus canescens", "Prosopis", "Haematococcus salinus", "Monas pulvisculus", "Cryptomonas lenticulari", "Poincianella pyramidalis", "Hymenophyllum polyanthos" ) names_harmonization <- bdc_query_names_taxadb( sci_name, replace_synonyms = TRUE, suggest_names = TRUE, suggestion_distance = 0.9, db = "gbif", parallel = TRUE, ncores = 2, export_accepted = FALSE )}Create a map of points using ggplot2
Description
Creates a map of points using ggplot2 useful for inspecting the results oftests implemented in the bdc package.
Usage
bdc_quickmap( data, lat = "decimalLatitude", lon = "decimalLongitude", col_to_map = "red", size = 1)Arguments
data | data.frame. Containing geographical coordinates. Coordinates mustbe expressed in decimal degree and in WGS84. |
lat | character string. The column name with latitude. Coordinates mustbe expressed in decimal degree and in WGS84. Default = "decimalLatitude". |
lon | character string. The column with longitude. Coordinates must beexpressed in decimal degree and in WGS84. Default = "decimalLongitude". |
col_to_map | character string. Defining the column or color used to map.It can be a color name (e.g., "red") or the name of a column of data. Default= "blue" |
size | numeric. The size of the points. |
Details
Only records with valid coordinates can be plotted. Records missingor containing invalid coordinates are removed prior creating the map.
Value
A map of points created using ggplot2.
Examples
decimalLatitude <- c(19.9358, -13.016667, -19.935800)decimalLongitude <- c(-40.6003, -39.6, -40.60030).coordinates_out_country <- c(FALSE, TRUE, TRUE)x <- data.frame(decimalLatitude, decimalLongitude, .coordinates_out_country)bdc_quickmap( data = x, lat = "decimalLatitude", lon = "decimalLongitude", col_to_map = ".coordinates_out_country", size = 1)Identify records with empty scientific names
Description
Flags records with empty or not interpretable scientific names.
Usage
bdc_scientificName_empty(data, sci_names = "scientificName")Arguments
data | data.frame. Containing the species scientific names. |
sci_names | character string. The column name with the speciesscientific name. Default = "scientificName". |
Details
This test identifies records missing scientific names (i.e., emptyor not applicableNA names)
Value
A data.frame containing the column ".scientificName_empty". Compliant(TRUE) if 'sci_names' is not empty; otherwise "FALSE".
See Also
Other prefilter:bdc_basisOfRecords_notStandard(),bdc_coordinates_country_inconsistent(),bdc_coordinates_empty(),bdc_coordinates_from_locality(),bdc_coordinates_outOfRange(),bdc_coordinates_transposed(),bdc_country_standardized()
Examples
x <- data.frame(scientificName = c("Ocotea odorifera", NA, "Panthera onca", ""))bdc_scientificName_empty(data = x, sci_names = "scientificName")Standardize datasets columns based on metadata
Description
This function's main goal is to merge and standardize differentdatasets into a new dataset with column names following the DarwinCore terminology. All the process is based on a metadata fileprovided by the user.
Usage
bdc_standardize_datasets( metadata, format = "csv", overwrite = FALSE, save_database = FALSE)Arguments
metadata | A data frame with metadata containing information about thename, path, and columns of the original data set which need to berenamed. See @details. |
format | a character setting the output file type. Option available are"csv" and "qs" (recommenced to save large datasets). Default == "csv". |
overwrite | A logical vector indicating whether the final merged datasetshould be overwritten. The default is FALSE. |
save_database | logical. Should the standardized database be locallysaved? Default = FALSE. |
Details
bdc_standardize_datasets() facilitate the standardization of datasets withdifferent column names by converting them into a new dataset following theDarwin Core terminology. The standardization process relies on a metadatafile containing the name, path, and columns that need to be renamed. Themetadata file can be constructed using built-in functions (e.g.,data.frame()) or storing the information in a CSV file and importing itinto R. Regardless of the method chosen, the data frame with metadata needsto contain the following column names (this is a list of required columnnames; for a comprehensive list of column names following Darwin Coreterminology
datasetName: A short name identifying the dataset (e.g., GBIF)fileName: The relative path containing the name of the input dataset(e.g., Input_files/GBIF.csv)scientificName: Name of the column in the original database presentingthe taxon scientific names with or without authorship information, dependingon the format of the source dataset (e.g., Myrcia acuminata)decimalLatitude: Name of the column in the original database presentingthe geographic latitude in decimal degrees (e.g., -6.370833)decimalLongitude: Name of the column in the original database presentingthe geographic longitude in decimal degrees (e.g., -3.25500)
Value
A merged data.frame with column names following Darwin Coreterminology.
Examples
## Not run: metadata <- readr::read_csv(system.file("extdata/Config/DatabaseInfo.csv", package = "bdc"))db_standardized <-bdc_standardize_datasets( metadata = metadata, format = "csv", overwrite = TRUE, save_database = FALSE)## End(Not run)Create or update the column summarizing the results of data quality tests
Description
This function creates or updates the column ".summary" summarizing theresults of data quality tests (i.e., columns starting with "."). Records thathave failed in at least one test are flagged for further inspection (i.e.,flagged as "FALSE") in the ".summary" column.
Usage
bdc_summary_col(data)Arguments
data | data.frame. Containing the results of data quality tests (i.e.,columns starting with "."). |
Details
If existing, the column ".summary" will be removed and then updatedconsidering all test names available in the supplied database.
Value
A data.frame containing a new or an updated column ".summary".
Examples
.missing_names <- c(TRUE, TRUE, TRUE, FALSE, FALSE).missing_coordinates <- c(TRUE, FALSE, FALSE, TRUE, FALSE)x <- data.frame(.missing_names, .missing_coordinates)bdc_summary_col(data = x)Extract year from eventDate
Description
This function extracts a four-digit year from unambiguously interpretablecollecting dates.
Usage
bdc_year_from_eventDate(data, eventDate = "eventDate")Arguments
data | A data frame containing a column with event date information. |
eventDate | Numeric or date. The column with event date information. |
Value
A data.frame containing the column "year". Year information isreturned only if "eventDate" can be unambiguously interpretable from"eventDate". Years in the future (e.g., 2050) are returned as NA as well asyears before 1600, which is the lower limit for collecting dates ofbiological specimens.
See Also
Other time:bdc_eventDate_empty(),bdc_year_outOfRange()
Examples
collection_date <- c( NA, "31/12/2015", "2013-06-13T00:00:00Z", "2019-05-20", "", "2013", "0001-01-00", "20", "1200")x <- data.frame(collection_date)bdc_year_from_eventDate(data = x, eventDate = "collection_date")Identify records with year out-of-range
Description
This function identifies records out-of-range collecting year (e.g., in thefuture) or old records collected before a year informed in 'year_threshold'.
Usage
bdc_year_outOfRange(data, eventDate, year_threshold = 1900)Arguments
data | A data frame containing a column with event date information. |
eventDate | numeric or date. The column containing event dateinformation. |
year_threshold | numeric. A four-digit year threshold used to flag old(potentially invalid) records. Default = 1900 |
Details
Following the "VALIDATION:YEAR_OUTOFRANGE"Biodiversity data qualitygroup, the results of this test are time-dependent. While the user mayprovide a lower limit to the year, the upper limit is defined based on theyear when the test is run. Lower limits can be used to flag old, oftenimprecise, records. For example, records collected before GPS advent(1980). If 'year_threshold' is not provided, the lower limit to the year isby default 1600, a lower limit for collecting dates of biological specimens.Records with empty or NA 'eventDate' are not tested and returned as NA.
Value
A data.frame containing the column ".year_outOfRange". Compliant(TRUE) if 'eventDate' is not out-of-range; otherwise "FALSE".
See Also
Other time:bdc_eventDate_empty(),bdc_year_from_eventDate()
Examples
collection_date <- c( NA, "31/12/2029", "2013-06-13T00:00:00Z", "2013-06-20", "", "2013", 1650, "0001-01-00")x <- data.frame(collection_date)bdc_year_outOfRange(data = x, eventDate = "collection_date", year_threshold = 1900)