| Version: | 1.0.0 |
| Title: | High Performance Interface to 'GBIF' |
| Description: | A high performance interface to the Global Biodiversity Information Facility, 'GBIF'. In contrast to 'rgbif', which can access small subsets of 'GBIF' data through web-based queries to a central server, 'gbifdb' provides enhanced performance for R users performing large-scale analyses on servers and cloud computing providers, providing full support for arbitrary 'SQL' or 'dplyr' operations on the complete 'GBIF' data tables (now over 1 billion records, and over a terabyte in size). 'gbifdb' accesses a copy of the 'GBIF' data in 'parquet' format, which is already readily available in commercial computing clouds such as the Amazon Open Data portal and the Microsoft Planetary Computer, or can be accessed directly without downloading, or downloaded to any server with suitable bandwidth and storage space. The high-performance techniques for local and remote access are described inhttps://duckdb.org/why_duckdb andhttps://arrow.apache.org/docs/r/articles/fs.html respectively. |
| License: | Apache License (≥ 2) |
| Encoding: | UTF-8 |
| ByteCompile: | true |
| Depends: | R (≥ 4.0) |
| Imports: | arrow (≥ 8.0.0), dplyr, duckdbfs |
| Suggests: | spelling, dbplyr, testthat (≥ 3.0.0), covr, knitr,rmarkdown, minioclient |
| URL: | https://docs.ropensci.org/gbifdb/,https://github.com/ropensci/gbifdb |
| BugReports: | https://github.com/ropensci/gbifdb |
| Language: | en-US |
| RoxygenNote: | 7.2.3 |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2023-10-19 19:47:13 UTC; cboettig |
| Author: | Carl Boettiger |
| Maintainer: | Carl Boettiger <cboettig@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2023-10-19 20:30:03 UTC |
gbifdb: High Performance Interface to 'GBIF'
Description
A high performance interface to the Global Biodiversity Information Facility, 'GBIF'. In contrast to 'rgbif', which can access small subsets of 'GBIF' data through web-based queries to a central server, 'gbifdb' provides enhanced performance for R users performing large-scale analyses on servers and cloud computing providers, providing full support for arbitrary 'SQL' or 'dplyr' operations on the complete 'GBIF' data tables (now over 1 billion records, and over a terabyte in size). 'gbifdb' accesses a copy of the 'GBIF' data in 'parquet' format, which is already readily available in commercial computing clouds such as the Amazon Open Data portal and the Microsoft Planetary Computer, or can be accessed directly without downloading, or downloaded to any server with suitable bandwidth and storage space. The high-performance techniques for local and remote access are described inhttps://duckdb.org/why_duckdb andhttps://arrow.apache.org/docs/r/articles/fs.html respectively.
Author(s)
Maintainer: Carl Boettigercboettig@gmail.com (ORCID)
See Also
Useful links:
Report bugs athttps://github.com/ropensci/gbifdb
Default storage location
Description
Default location can be set with the env var GBIF_HOME,otherwise will use the default provided bytools::R_user_dir()
Usage
gbif_dir()Value
path to the gbif home directory directory
Examples
gbif_dir()Download GBIF data using minioclient
Description
Sync a local directory with selected release of the AWS copy of GBIF
Usage
gbif_download( version = gbif_version(), dir = gbif_dir(), bucket = gbif_default_bucket(), region = "")Arguments
version | Release date (YYYY-MM-DD) which should be synced. Willdetect latest version by default. |
dir | path to local directory where parquet files should be stored.Fine to leave at default, see |
bucket | Name of the regional S3 bucket desired.Default is "gbif-open-data-us-east-1". Select a bucket closer to yourcompute location for improved performance, e.g. European researchers mayprefer "gbif-open-data-eu-central-1" etc. |
region | bucket region (usually ignored? Just set the bucket appropriately) |
Details
Sync parquet files from GBIF public data catalog,https://registry.opendata.aws/gbif/.
Note that data can also be found on the Microsoft Cloud,https://planetarycomputer.microsoft.com/dataset/gbif
Also, some users may prefer to download this data using an alternativeinterface or work on a cloud-host machine where data is already available.Note, these data include all CC0 and CC-BY licensed data in GBIF that havecoordinates which passed automated quality checks,seehttps://github.com/gbif/occurrence/blob/master/aws-public-data.md.
Value
logical indicating success or failure.
Examples
gbif_download()Return a path to the directory containing GBIF example parquet data
Description
Return a path to the directory containing GBIF example parquet data
Usage
gbif_example_data()Details
example data is taken from the first 1000 rows of the2011-11-01 release of the parquet data.
Value
path to the example occurrence data installed with the package.
Examples
gbif_example_data()Local connection to a downloaded GBIF Parquet database
Description
Local connection to a downloaded GBIF Parquet database
Usage
gbif_local( dir = gbif_parquet_dir(version = gbif_version(local = TRUE)), tblname = "gbif", backend = c("arrow", "duckdb"), safe = TRUE)Arguments
dir | location of downloaded GBIF parquet files |
tblname | name for the database table |
backend | choose duckdb or arrow. |
safe | logical. Should we exclude columns |
Details
A summary of this GBIF data, along with column meanings can be found athttps://github.com/gbif/occurrence/blob/master/aws-public-data.md
Value
a remote tibbletbl_sql class object
Examples
gbif <- gbif_local(gbif_example_data())gbif remote
Description
Connect to GBIF remote directly. Can be much faster than downloadingfor one-off use or when using the package from a server in the same regionas the data. See Details.
Usage
gbif_remote( version = gbif_version(), bucket = gbif_default_bucket(), safe = TRUE, unset_aws = getOption("gbif_unset_aws", TRUE), endpoint_override = Sys.getenv("AWS_S3_ENDPOINT", "s3.amazonaws.com"), backend = c("arrow", "duckdb"), ...)Arguments
version | GBIF snapshot date |
bucket | GBIF bucket name (including region). A default can also be set usingthe option |
safe | logical, default TRUE. Should we exclude columns |
unset_aws | Unset AWS credentials? GBIF is provided in a public bucket,so credentials are not needed, but having a AWS_ACCESS_KEY_ID or other AWSenvironmental variables set can cause the connection to fail. By default,this will unset any set environmental variables for the duration of the R session.This behavior can also be turned off globally by setting the option |
endpoint_override | optional parameter to |
backend | duckdb or arrow |
... | additional parameters passed to the |
Details
Query performance is dramatically improved in queries that return onlya subset of columns. Consider using explicitselect() commands to return onlythe columns you need.
A summary of this GBIF data, along with column meanings can be found athttps://github.com/gbif/occurrence/blob/master/aws-public-data.md
Value
a remote tibbletbl_sql class object.
Examples
gbif <- gbif_remote()gbif()Get the latest gbif version string
Description
Can also return latest locally downloaded version, or list all versions
Usage
gbif_version( local = FALSE, dir = gbif_dir(), bucket = gbif_default_bucket(), all = FALSE, ...)Arguments
local | Search only local versions? logical, default |
dir | local directory ( |
bucket | Which remote bucket (region) should be checked |
all | show all versions? (logical, default |
... | additional arguments toarrow::s3_bucket |
Details
A default version can be set using optiongbif_default_version
Value
latest available gbif version, string
Examples
## Latest local version available:gbif_version(local=TRUE)## default versionoptions(gbif_default_version="2021-01-01")gbif_version()## Latest online version available:gbif_version()## All online versions:gbif_version(all=TRUE)