
Handle biodiversity data from several different sources is not aneasy task. Here, we present theBiodiversityDataCleaning (bdc), an Rpackage to address quality issues and improve the fitness-for-use ofbiodiversity datasets.bdc contains functions to harmonize andintegrate data from different sources following common standards andprotocols, and implements various tests and tools to flag, document,clean, and correct taxonomic, spatial, and temporal data.
Compared to other available R packages, the main strengths of thebdc package are that it brings together available tools – and aseries of new ones – to assess the quality of different dimensions ofbiodiversity data into a single and flexible toolkit. The functions canbe applied to a multitude of taxonomic groups, datasets (includingregional or local repositories), countries, or worldwide.
Thebdc toolkit is organized in thematic modules related todifferent biodiversity dimensions.
:warning: The modules illustrated, andfunctionswithin,were linked to form a proposed reproducibleworkflow (seevignettes).However, all functionscan also be executedindependently.
Standardization and integration of different datasets into a standarddatabase.
bdc_standardize_datasets() Standardization andintegration of different datasets into a new dataset with column namesfollowing Darwin Core terminologyFlagging and removal of invalid or non-interpretable information,followed by data amendments (e.g., correct transposed coordinates andstandardize country names).
bdc_scientificName_empty() Identification of recordslacking names or with names not interpretablebdc_coordinates_empty() Identification of recordslacking information on latitude or longitudebdc_coordinates_outOfRange() Identification of recordswith out-of-range coordinates (latitude > 90 or -90; longitude>180 or -180)bdc_basisOfRecords_notStandard() Identification ofrecords from doubtful sources (e.g., fossil or machine observation)impossible to interpret and not compatible with Darwin Core recommendedvocabularybdc_country_from_coordinates() Derive country name fromvalid geographic coordinatesbdc_country_standardized() Standardization of countrynames and retrieve country codebdc_coordinates_transposed() Identification of recordswith potentially transposed latitude and longitudebdc_coordinates_country_inconsistent() Identificationof coordinates in other countries or far from a specified distance fromthe coast of a reference country (i.e., in the ocean)bdc_coordinates_from_locality() Identification ofrecords lacking coordinates but with a detailed description of thelocality associate with records from which coordinates can bederivedCleaning, parsing, and harmonization of scientific names againstmultiple taxonomic references.
bdc_clean_names() Name-checking routines to clean andsplit a taxonomic name into its binomial and authority componentsbdc_query_names_taxadb() Harmonization of scientificnames by correcting spelling errors and converting nomenclaturalsynonyms to currently accepted names.bdc_filter_out_names() Function used to filter outrecords according to their taxonomic status present in the column“notes”. For example, to filter only valid accepted names categorized as“accepted”Flagging of erroneous, suspicious, and low-precision geographiccoordinates.
bdc_coordinates_precision() Identification of recordswith a coordinate precision below a specified number of decimalplacesclean_coordinates() (FromCoordinateCleanerpackage and part of the data-cleaning workflow). Identification ofpotentially problematic geographic coordinates based on geographicgazetteers and metadata. Include tests for flagging records: aroundcountry capitals or country or province centroids, duplicated, withequal coordinates, around biodiversity institutions, within urban areas,plain zeros in the coordinates, and suspect geographic outliersFlagging and, whenever possible, correction of inconsistentcollection date.
bdc_eventDate_empty() Identification of records lackinginformation on event date (i.e., when a record was collected orobserved)bdc_year_outOfRange() Identification of records withillegitimate or potentially imprecise collecting year. The year providedcan be out-of-range (e.g., in the future) or collected before aspecified year supplied by the user (e.g., 1900)bdc_year_from_eventDate() This function extractsfour-digit year from unambiguously interpretable collecting datesAim to facilitate thedocumentation, visualization, andinterpretation of results of data quality tests the packagecontains functions for documenting the results of the data-cleaningtests, including functions for saving i) records needing furtherinspection, ii) figures, and iii) data-quality reports.
bdc_create_report() Creation of data-quality reportsdocumenting the results of data-quality tests and the taxonomicharmonization processbdc_create_figures() Creation of figures (i.e., barplots and maps) reporting the results of data-quality testsbdc_filter_out_flags() Removal of columns containingthe results of data quality tests (i.e., column starting with “.”) orother columns specifiedbdc_quickmap() Creation of a map of points usingggplot2. Helpful in inspecting the results of data-cleaning testsbdc_summary_col() This function creates or updates thecolumn summarizing the results of data quality tests (i.e., the column“.summary”)install.packages("bdc")library(bdc)or the development version fromGitHub using:
install.packages("remotes")remotes::install_github("brunobrr/bdc")Load the package with:
library(bdc)Seebdc package website (https://brunobrr.github.io/bdc/) for detailedexplanation on each module.
If you encounter a clear bug, please file an issuehere.For questions or suggestion, please send us a email(ribeiro.brr@gmail.com).
Ribeiro, BR; Velazco, SJE; Guidoni-Martins, K; Tessarolo, G; Jardim,Lucas; Bachman, SP; Loyola, R (2022). bdc: A toolkit for standardizing,integrating, and cleaning biodiversity data. Methods in Ecology andEvolution.doi.org/10.1111/2041-210X.13868