| Title: | Automatic Database Normalisation for Data Frames |
| Version: | 3.2.4 |
| Description: | Automatic normalisation of a data frame to third normal form, with the intention of easing the process of data cleaning. (Usage to design your actual database for you is not advised.) Originally inspired by the 'AutoNormalize' library for 'Python' by 'Alteryx' (https://github.com/alteryx/autonormalize), with various changes and improvements. Automatic discovery of functional or approximate dependencies, normalisation based on those, and plotting of the resulting "database" via 'Graphviz', with options to exclude some attributes at discovery time, or remove discovered dependencies at normalisation time. |
| License: | BSD_3_clause + file LICENSE |
| Encoding: | UTF-8 |
| Language: | en-GB |
| RoxygenNote: | 7.3.3 |
| Depends: | R (≥ 4.1.0) |
| Suggests: | rlang, spelling, DiagrammeR (≥ 1.0.7), testthat (≥ 3.3.0),R.utils (≥ 2.11.0), hedgehog (≥ 0.1), tibble (≥ 3.2.1),knitr, rmarkdown |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| LazyData: | true |
| URL: | https://charnelmouse.github.io/autodb/,https://github.com/CharnelMouse/autodb |
| BugReports: | https://github.com/CharnelMouse/autodb/issues |
| NeedsCompilation: | no |
| Packaged: | 2025-11-17 21:06:50 UTC; Mark |
| Author: | Mark Webster |
| Maintainer: | Mark Webster <markwebster204@yahoo.co.uk> |
| Repository: | CRAN |
| Date/Publication: | 2025-11-17 21:40:02 UTC |
Database-style normalisation for data.frames
Description
Automatic normalisation of a data.frame to third normal form, with theintention of easing the process of data cleaning. (Usage to design youractual database for you is not advised.) Originally inspired by Alteryx'sAutoNormalize Python library (https://github.com/alteryx/autonormalize), withvarious changes and improvements. Automatic discovery of functional orapproximate dependencies, normalisation based on those, and plotting of theresulting "database" via Graphviz, with options to exclude some attributes atdiscovery time, or remove discovered dependencies at normalisation time.
Author(s)
Maintainer: Mark Webstermarkwebster204@yahoo.co.uk (ORCID)
See Also
Useful links:
Report bugs athttps://github.com/CharnelMouse/autodb/issues
Relational data attributes
Description
Generic function, for fetching attribute sets for elements of a relationalobject.
Usage
attrs(x, ...)attrs(x, ...) <- valueArguments
x | a relational schema object, such as a |
... | further arguments passed on to methods. |
value | A character vector of the same length as |
Value
A list, containing a character vector for each element ofx.
Relational data attribute order
Description
Generic function, fetching attribute order for relational objects.
Usage
attrs_order(x, ...)attrs_order(x, ...) <- valueArguments
x | an R object, such as a |
... | further arguments passed on to methods. |
value | A character vector of the same length as |
Details
All classes inautodb contain anattrs_order attribute. Itgives an easy way to find a list of all attributes/variables involved in anobject, but its main purpose is to also assign those attributes a consistentorder when printing or plotting the object.
Value
A character vector, giving attributes in the order in which they'reprioritised for sorting withinx.
Create a normalised database from a data frame
Description
This is a wrapper function for applyingnormalise,autoref, anddecompose. This takes a data frameand converts it straight into a database, which is the main intended use casefor the package.
Usage
autodb( df, keep_rownames = FALSE, digits = getOption("digits"), single_ref = FALSE, ensure_lossless = TRUE, remove_avoidable = FALSE, constants_name = "constants", progress = FALSE, progress_file = "", ...)Arguments
df | a data.frame, containing the data to be normalised. |
keep_rownames | a logical or a string, indicating whether to include therow names as a column. If a string is given, it is used as the name for thecolumn, otherwise the column is named "row". Like with the other columnnames, the function returns an error if this results in duplicate columnnames. Set to FALSE by default. |
digits | a positive integer, indicating how many significant digits areto be used for numeric and complex variables. This is used for bothpre-formatting in |
single_ref | a logical, FALSE by default. If TRUE, then only onereference between each relation pair is kept when generating foreign keyreferences. If a pair has multiple references, the kept reference refers tothe earliest key for the child relation, as sorted by priority order. |
ensure_lossless | a logical, indicating whether to check whether thenormalisation is lossless. If it is not, then an additional relation isadded to the final "database", containing a key for |
remove_avoidable | a logical, indicating whether to remove avoidableattributes in relations. If so, then an attribute are removed fromrelations if the keys can be changed such that it is not needed to preservethe given functional dependencies. |
constants_name | a scalar character, giving the name for any relationcreated to store constant attributes. If this is the same as a generatedrelation name, it will be changed, with a warning, to ensure that allrelations have a unique name. |
progress | a logical, for whether to display progress to the user duringdependency search in |
progress_file | a scalar character or a connection. If |
... | further arguments passed on to |
Details
Sincedecompose only works with functional dependencies, not approximatedependencies, the accuracy indiscover is fixed as 1.
Value
Adatabase, containing the data indf within theinferred database schema.
Examples
# simple exampleautodb(ChickWeight)Add foreign key references to a normalised database
Description
Adds foreign key references to arelation_schema objectautomatically, replacing any existing references.
Usage
autoref(schema, single_ref = FALSE)Arguments
schema | a |
single_ref | a logical, FALSE by default. If TRUE, then only onereference between each relation pair is kept when generating foreign keyreferences. If a pair has multiple references, the kept reference refers tothe earliest key for the child relation, as sorted by priority order. |
Details
The method for generating references is simple. First, it finds every linkbetween two relation schemas, where the parent contains all the attributes inone of the child's keys. This can be done separately for all of the child'skeys, so there can be multiple links with the same parent and child ifsingle_ref isTRUE.
Second, any transitive references are removed: if there are link relationpairs a -> b, b -> c, and a -> c, then the latter is transitive, and so isremoved. If there is a cyclic reference, e.g. where c -> a, then the choiceof which link to remove is arbitrary. Cycles cannot occur in sets of relationschemas resulting from decomposing a single table.
Value
Adatabase_schema object, containing the given relationschemas and the created foreign key references.
Examples
rs <- relation_schema( list( a_b_c = list(c("a", "b", "c", "d"), list(c("a", "b", "c"))), a_b = list(c("a", "b", "d"), list(c("a", "b"), c("b", "d"))) ), letters[1:4])autoref(rs, single_ref = FALSE)autoref(rs, single_ref = TRUE)Create instance of a schema
Description
Create a relation data object, using the given relational schema object, withthe resulting relations empty and ready for data insertion usinginsert.
Usage
create(x, ...)Arguments
x | a relational schema object, representing the schema to create aninstance of, such as a |
... | further arguments passed on to methods. |
Value
An instance of the schema. For example, callingcreate on adatabase_schema creates adatabase, where allthe relations contain zero records.
Generate D2 input text to plot objects
Description
Produces text input for D2 to make a diagram of a given object, usuallyrendered with SVG.
Usage
d2(x, ...)Arguments
x | an object to be plotted. |
... | further arguments passed to or from other methods. |
Details
The D2 language is in an early stage of development (pre-v1.0), so it may besubject to changes that make it unable to use output from the current versionofd2.
Details of what is plotted are given in individual methods. There areexpected commonalities, which are described below.
The object is expected to be one of the following:
an object whose elements have the same length. Examples would bedata frames, matrices, and other objects that can represent relations, withnames for the elements, and an optional name for the object itself.
a graph of sub-objects, each of which represent a relation asdescribed above, possibly with connections between the objects, and anoptional name for the graph as a whole.
Each relation is presented as a record-like shape, with the following elements:
A optional header with the relation's name, and the number of (unique)records.
A set of rows, one for each attribute in the relation. These rowshave the following contents:
the attribute names.
a depiction of the relation's (candidate) keys. Eachcolumn represents a key, and a filled cell indicates that the attributein that row is in that key. The keys are given in lexical order, withprecedence given to keys with fewer attributes, and keys with attributesthat appear earlier in the original data frame's attribute order. Defaultoutput from other package functions will thus have the primary key givenfirst. In the future, this will be changed to always give the primary keyfirst.
optionally, the attribute types: specifically, the first elementwhen passing the attribute's values into
class.
Any foreign key references between relations are represented by one-way arrows,one per attribute in the foreign key.
If the object has a name, then currently the name is not used, except as asingle data frame's name. In the future, this will be used to give a name tothe generated board, to make use of D2's composition features.
Value
A scalar character, containing text input for D2.
See Also
d2.data.frame,d2.relation_schema,d2.database_schema,d2.relation, andd2.database for individual methods.
D2 language site:https://d2lang.com
Playground for running online without installation:https://play.d2lang.com/
Quarto extension:https://github.com/data-intuitive/quarto-d2
Examples
# simple data.frame examplecat(d2(ChickWeight, "chick"))Generate D2 input text to plot a data frame
Description
Produces text input for D2 to make a diagram of a given data frame, usuallyrendered with SVG.
Usage
## S3 method for class 'data.frame'd2(x, name = NA_character_, ...)Arguments
x | a data.frame. |
name | a character scalar, giving the name of the record, if any. Thename must be non-empty, since it is also used to name the single table inthe plot. Defaults to |
... | further arguments passed to or from other methods. |
Details
The rows in the plotted data frame include information about the attributeclasses.
Value
A scalar character, containing text input for Graphviz.
See Also
The genericd2.
Generate D2 input text to plot databases
Description
Produces text input for D2 to make a diagram of a given database, usuallyrendered with SVG.
Usage
## S3 method for class 'database'd2(x, name = NA_character_, reference_level = c("attr", "relation"), ...)Arguments
x | a |
name | a character scalar, giving the name of the schema, if any. |
reference_level | a character scalar, indicating the format to use forforeign key references. "relation" only specifies the relations involved;"attr" also specifies the attributes involved, one pair at a time. |
... | further arguments passed to or from other methods. |
Details
Each relation in the database is presented as a set of rows, one for eachattribute in the relation. These rows include information about the attributeclasses.
Any foreign key references are represented by arrows between either theattribute pairs or the relation pairs, depending on the value ofreference_level. This allows the output to be geared towards aspecific layout engine. Of the engines currently available for D2, Dagre cannot plot references between relation attributes, just the attributesthemselves, so usingreference_level = "relation" prevents compoundforeign keys resulting in duplicate reference arrows. ELK and Talacan plot between relation attributes, so the defaultreference_level ="attr" works as intended.
Value
A scalar character, containing text input for D2.
See Also
The genericd2.
Generate D2 input text to plot database schemas
Description
Produces text input for D2 to make a diagram of a given database schema,usually rendered with SVG.
Usage
## S3 method for class 'database_schema'd2(x, name = NA_character_, reference_level = c("attr", "relation"), ...)Arguments
x | a database schema, as given by |
name | a character scalar, giving the name of the schema, if any. |
reference_level | a character scalar, indicating the format to use forforeign key references. "relation" only specifies the relations involved;"attr" also specifies the attributes involved, one pair at a time. |
... | further arguments passed to or from other methods. |
Details
Each relation in the schema is presented as a set of rows, one for eachattribute in the relation. These rows do not include information about theattribute classes.
Any foreign key references are represented by arrows between either theattribute pairs or the relation pairs, depending on the value ofreference_level. This allows the output to be geared towards aspecific layout engine. Of the engines currently available for D2, Dagre cannot plot references between relation attributes, just the attributesthemselves, so usingreference_level = "relation" prevents compoundforeign keys resulting in duplicate reference arrows. ELK and Talacan plot between relation attributes, so the defaultreference_level ="attr" works as intended.
Value
A scalar character, containing text input for D2.
See Also
The genericd2.
Generate D2 input text to plot relations
Description
Produces text input for D2 to make a diagram of a given relation, usuallyrendered with SVG.
Usage
## S3 method for class 'relation'd2(x, name = NA_character_, ...)Arguments
x | a |
name | a character scalar, giving the name of the schema, if any. |
... | further arguments passed to or from other methods. |
Details
Each relation is presented as a set of rows, one for eachattribute in the relation. These rows include information about theattribute classes.
Value
A scalar character, containing text input for D2.
See Also
The genericd2.
Generate D2 input text to plot relation schemas
Description
Produces text input for D2 to make a diagram of a given relation schema,usually rendered with SVG.
Usage
## S3 method for class 'relation_schema'd2(x, name = NA_character_, ...)Arguments
x | a relation schema, as given by |
name | a character scalar, giving the name of the schema, if any. |
... | further arguments passed to or from other methods. |
Details
Each relation in the schema is presented as a set of rows, one for eachattribute in the relation. These rows do not include information about theattribute classes.
Value
A scalar character, containing text input for D2.
See Also
The genericd2.
Databases
Description
Enhances arelation object with foreign key referenceinformation.
Usage
database(relations, references, check = TRUE)Arguments
relations | a |
references | a list of references, eachrepresented by a list containing four character elements. In order, theelements are a scalar giving the name of the child (referrer) schema, avector giving the child attribute names, a scalar giving the name of theparent (referee) schema, and a vector giving the parent attribute names. Thevectors must be of the same length and contain names for attributes presentin their respective schemas, and the parent attributes must form a key. |
check | a logical, indicating whether to check that |
Details
Unlikerelation_schema andrelation, and likedatabase_schema,database is not designed to bevector-like: it only holds a single database. This adheres to the usualpackage use case, where a single data frame is being analysed at a time.However, it inherits fromrelation, so is vectorised withrespect to its relations.
As withrelation, duplicate relations, after ordering byattribute, are allowed, and can be removed withunique.
References, i.e. foreign key references, are allowed to have differentattribute names in the child and parent relations; this can't occur in theoutput forautoref andnormalise.
Subsetting removes any references that involve removed relations.Removing duplicates withunique changes references involvingduplicates to involve the kept equivalent relations instead. Renamingrelations withnames<- also changes their namesin the references.
Value
Adatabase object, containingrelations withreferences stored in an attribute of the same name. Referencesare stored with their attributes in the order they appear in theirrespective relations.
Examples
rels <- relation( list( a = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b = list( df = data.frame(b = logical(), c = logical()), keys = list("b", "c") ) ), attrs_order = c("a", "b", "c", "d"))db <- database( rels, list(list("a", "b", "b", "b")))print(db)attrs(db)stopifnot(identical( attrs(db), lapply(records(db), names)))keys(db)attrs_order(db)names(db)references(db)# relations can't reference themselves## Not run: database( relation( list(a = list(df = data.frame(a = 1:5), keys = list("a"))), c("a", "b") ), list(list("a", "a", "a", "a")) ) database( relation( list(a = list(df = data.frame(a = 1:5, b = 6:10), keys = list("a"))), c("a", "b") ), list(list("a", "b", "a", "a")) )## End(Not run)# an example with references between differently-named attributesprint(database( relation( list( citation = list( df = data.frame(citer = 1:5, citee = 6:10), keys = list(c("citer", "citee")) ), article = list(df = data.frame(article = 1:10), keys = list("article")) ), c("citer", "citee", "article") ), list( list("citation", "citer", "article", "article"), list("citation", "citee", "article", "article") )))# inserting datainsert(db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L))# data is only inserted into relations where all columns are given...insert(db, data.frame(a = 1L, b = 2L, c = 3L))# and that are listed in relations argumentinsert( db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L), relations = "b")# inserted data can't violate keys## Not run: insert( db, data.frame(a = 1L, b = 1:2) )## End(Not run)# inserted data can't violate foreign key references## Not run: insert( db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L), relations = "a" )## End(Not run)# vector operationsdb2 <- database( relation( list( e = list(df = data.frame(a = 1:5, e = 6:10), keys = list("e")) ), attrs_order = c("a", "e") ), list())c(db, db2) # attrs_order attributes are mergedunique(c(db, db))# subsettingdb[1]stopifnot(identical(db[[1]], db[1]))db[c(1, 2, 1, 2)] # replicates the foreign key referencesc(db[c(1, 2)], db[c(1, 2)]) # doesn't reference between separate copies of dbunique(db[c(1, 2, 1, 2)]) # unique() also merges references# another example of unique() merging referencesdb_merge <- database( relation( list( a = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b = list( df = data.frame(b = logical(), c = logical(), d = logical()), keys = list("b") ), c_d = list( df = data.frame(c = logical(), d = logical(), e = logical()), keys = list(c("c", "d")) ), a.1 = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b.1 = list( df = data.frame(b = logical(), c = logical(), d = logical()), keys = list("b") ) ), c("a", "b", "c", "d", "e") ), list( list("a", "b", "b", "b"), list("b.1", c("c", "d"), "c_d", c("c", "d")) ))print(db_merge)unique(db_merge)# reassignment# can't change keys included in references## Not run: keys(db)[[2]] <- list("c")# can't remove attributes included in keys## Not run: attrs(db)[[2]] <- list("c", "d")# can't remove attributes included in references## Not run: attrs(db)[[1]] <- c("a", "d")db3 <- db# can change subset of schema, but loses references between altered and# non-altered subsetsdb3[2] <- database( relation( list(d = list(df = data.frame(d = logical(), c = logical()), keys = list("d"))), attrs_order(db3) ), list())print(db3) # note the schema's name doesn't change# names(db3)[2] <- "d" # this would change the namekeys(db3)[[2]] <- list(character()) # removing keys first...# for a database_schema, we could then change the attrs for# the second database. For a created relation, this is not# allowed.## Not run: attrs(db3)[[2]] <- c("b", "c") names(records(db3)[[2]]) <- c("b", "c")## End(Not run)# changing appearance priority for attributesattrs_order(db3) <- c("d", "c", "b", "a")print(db3)# changing relation schema names changes them in referencesnames(db3) <- paste0(names(db3), "_long")print(db3)# reconstructing from componentsdb_recon <- database( relation( Map(list, df = records(db), keys = keys(db)), attrs_order(db) ), references(db))stopifnot(identical(db_recon, db))db_recon2 <- database( subrelations(db), references(db))stopifnot(identical(db_recon2, db))# can be a data frame columndata.frame(id = 1:2, relation = db)# setting check = FALSE can give invalid databaseschickfds <- discover(ChickWeight)chickschema <- synthesise(chickfds)chickrels <- insert(create(chickschema), ChickWeight)badrefs <- list(list("Time_Chick", "weight", "Chick", "Chick"))# check = TRUE stops on non-satisfied references## Not run: database(chickrels, badrefs)# check = FALSE returns an invalid databasebaddb <- database(chickrels, badrefs, check = FALSE)# only returns an error when further manipulated by certain methods## Not run: insert(baddb, ChickWeight)Database schemas
Description
Enhances arelation_schema object with foreign key referenceinformation.
Usage
database_schema(relation_schemas, references)Arguments
relation_schemas | a |
references | a list of references, eachrepresented by a list containing four character elements. In order, theelements are a scalar giving the name of the child (referrer) schema, avector giving the child attribute names, a scalar giving the name of theparent (referee) schema, and a vector giving the parent attribute names. Thevectors must be of the same length and contain names for attributes presentin their respective schemas, and the parent attributes must form a key. |
Details
Unlikefunctional_dependency andrelation_schema,database_schema is not designed to be vector-like: it only holds asingle database schema. This adheres to the usual package use case, where asingle data frame is being analysed at a time. However, it inherits fromrelation_schema, so is vectorised with respect to its relationschemas.
As withrelation_schema, duplicate relation schemas, afterordering by attribute, are allowed, and can be removed withunique.
References, i.e. foreign key references, are allowed to have differentattribute names in the child and parent relations; this can't occur in theoutput forautoref andnormalise.
Subsetting removes any references that involve removed relation schemas.Removing duplicates withunique changes references involvingduplicates to involve the kept equivalent schemas instead. Renaming relationschemas withnames<- also changes their names inthe references.
Value
Adatabase_schema object, containingrelation_schemaswithreferences stored in an attribute of the same name.References are stored with their attributes in the order they appear intheir respective relation schemas.
See Also
attrs,keys,attrs_order,andreferences for extracting parts of the information in adatabase_schema;create for creating adatabase object that uses the given schema;gvfor converting the schema into Graphviz code;rename_attrsfor renaming the attributes inattrs_order;reduce forfiltering a schema's relations to those connected to a given relation byforeign key references;subschemas to return therelation_schema that the given schema contains;merge_empty_keys for combining relations with an empty key;merge_schemas for combining relations with matching sets ofkeys.
Examples
rs <- relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c"), list("b", "c")) ), attrs_order = c("a", "b", "c", "d"))ds <- database_schema( rs, list(list("a", "b", "b", "b")))print(ds)attrs(ds)keys(ds)attrs_order(ds)names(ds)references(ds)# relations can't reference themselves## Not run: database_schema( relation_schema( list(a = list("a", list("a"))), c("a", "b") ), list(list("a", "a", "a", "a")) ) database_schema( relation_schema( list(a = list(c("a", "b"), list("a"))), c("a", "b") ), list(list("a", "b", "a", "a")) )## End(Not run)# an example with references between differently-named attributesprint(database_schema( relation_schema( list( citation = list(c("citer", "citee"), list(c("citer", "citee"))), article = list("article", list("article")) ), c("citer", "citee", "article") ), list( list("citation", "citer", "article", "article"), list("citation", "citee", "article", "article") )))# vector operationsds2 <- database_schema( relation_schema( list( e = list(c("a", "e"), list("e")) ), attrs_order = c("a", "e") ), list())c(ds, ds2) # attrs_order attributes are mergedunique(c(ds, ds))# subsettingds[1]stopifnot(identical(ds[[1]], ds[1]))ds[c(1, 2, 1, 2)] # replicates the foreign key referencesc(ds[c(1, 2)], ds[c(1, 2)]) # doesn't reference between separate copies of dsunique(ds[c(1, 2, 1, 2)]) # unique() also merges references# another example of unique() merging referencesds_merge <- database_schema( relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c", "d"), list("b")), c_d = list(c("c", "d", "e"), list(c("c", "d"))), a.1 = list(c("a", "b"), list("a")), b.1 = list(c("b", "c", "d"), list("b")) ), c("a", "b", "c", "d", "e") ), list( list("a", "b", "b", "b"), list("b.1", c("c", "d"), "c_d", c("c", "d")) ))print(ds_merge)unique(ds_merge)# reassignment# can't change keys included in references## Not run: keys(ds)[[2]] <- list("c")# can't remove attributes included in keys## Not run: attrs(ds)[[2]] <- list("c", "d")# can't remove attributes included in references## Not run: attrs(ds)[[1]] <- c("a", "d")ds3 <- ds# can change subset of schema, but loses references between altered and# non-altered subsetsds3[2] <- database_schema( relation_schema( list(d = list(c("d", "c"), list("d"))), attrs_order(ds3) ), list())print(ds3) # note the schema's name doesn't change# names(ds3)[2] <- "d" # this would change the namekeys(ds3)[[2]] <- list(character()) # removing keys first...attrs(ds3)[[2]] <- c("b", "c") # so we can change the attrs legallykeys(ds3)[[2]] <- list("b", "c") # add the new keys# add the reference lost during subset replacementreferences(ds3) <- c(references(ds3), list(list("a", "b", "b", "b")))stopifnot(identical(ds3, ds))# changing appearance priority for attributesattrs_order(ds3) <- c("d", "c", "b", "a")print(ds3)# changing relation schema names changes them in referencesnames(ds3) <- paste0(names(ds3), "_long")print(ds3)# reconstructing from componentsds_recon <- database_schema( relation_schema( Map(list, attrs(ds), keys(ds)), attrs_order(ds) ), references(ds))stopifnot(identical(ds_recon, ds))ds_recon2 <- database_schema( subschemas(ds), references(ds))stopifnot(identical(ds_recon2, ds))# can be a data frame columndata.frame(id = 1:2, schema = ds)Decompose a data frame based on given normalised dependencies
Description
Decomposes a data frame into several relations, based on the given databaseschema. It's intended that the data frame satisfies all the functionaldependencies implied by the schema, such as if the schema was constructedfrom the same data frame. If this is not the case, the function will returnsan error.
Usage
decompose( df, schema, keep_rownames = FALSE, digits = getOption("digits"), check = TRUE)Arguments
df | a data.frame, containing the data to be normalised. |
schema | a database schema with foreign key references, such as given by |
keep_rownames | a logical or a string, indicating whether to include therow names as a column. If a string is given, it is used as the name for thecolumn, otherwise the column is named "row". Set to FALSE by default. |
digits | a positive integer, indicating how many significant digits areto be used for numeric and complex variables. A value of |
check | a logical, indicating whether to check that |
Details
If the schema was constructed using approximate dependencies for the samedata frame,decompose returns an error, to prevent either duplicate recordsor lossy decompositions. This is temporary: for the next update, we plan toadd an option to allow this, or to add "approximate" equivalents of databasesand database schemas.
Value
Adatabase object, containing the data indfwithin the database schema given inschema.
Dependants
Description
Generic function, with the only given method fetching dependants forfunctional dependencies.
Usage
dependant(x, ...)dependant(x, ...) <- valueArguments
x | an R object. For the given method, a |
... | further arguments passed on to methods. |
value | A character vector of the same length as |
Value
A character vector containing dependants.
Determinant sets
Description
Generic function, with the only given method fetching determinant sets forfunctional dependencies.
Usage
detset(x, ...)detset(x, ...) <- valueArguments
x | an R object. For the given method, a |
... | further arguments passed on to methods. |
value | A character vector of the same length as |
Value
A list containing determinant sets, each consisting of a charactervector with unique elements.
Determine Duplicate Elements
Description
duplicated "determines which elements of a vector or data frameare duplicates of elements with smaller subscripts, and returns a logicalvector indicating which elements (rows) are duplicates". However, as of R4.1, calling this on a data frame with zero columns always returns an emptylogical vector. This has repercussions on other functions that useduplicated, such asunique andanyDuplicated.These functions add zero-column data frames as a special case.
Usage
df_duplicated(x, incomparables = FALSE, fromLast = FALSE, ...)df_unique(x, incomparables = FALSE, fromLast = FALSE, ...)df_anyDuplicated(x, incomparables = FALSE, fromLast = FALSE, ...)df_records(x, use_rownames = FALSE, use_colnames = FALSE)Arguments
x | a data frame. |
incomparables | a vector of values that cannot be compared. |
fromLast | logical indicating if duplication should be consideredfrom the reverse side, i.e., the last (or rightmost) of identicalelements would correspond to |
... | arguments for particular methods. |
use_rownames | a logical, FALSE by default, indicating whether rowvalues should keep the row names from |
use_colnames | a logical, FALSE by default, indicating whether rowvalues should keep the column names from |
Value
Fordf_duplicated, a logical vector with one element for eachrow.
Fordf_unique, a data frame is returned with the samecolumns, but possible fewer rows (and with row names from the firstoccurrences of the unique rows).
Fordf_anyDuplicated, an integer or real vector of lengthone with value the 1-based index of the first duplicate if any, otherwise0.
Fordf_records, a list of the row values inx. This isbased on a step induplicated.data.frame. However, for dataframes with zero columns, special handling returns a list of empty rowvalues, one for each row inx. Without special handling, this stepreturns an empty list. This was the cause forduplicatedreturning incorrect results for zero-column data frames in older versionsof R.
See Also
Examples
# row values for a 5x0 data framex <- data.frame(a = 1:5)[, FALSE, drop = FALSE]do.call(Map, unname(c(list, x))) # original step returns empty listdf_records(x) # corrected version preserves row countTest data frames for equivalence under row reordering
Description
A convenience function, mostly used to testing thatrejoinworks as intended. It checks that data frames have the same dimensions andcolumn names, with duplicates allowed, then checks they contain the samedata. For the latter step, column names are made unique first, so columnswith duplicate names must be presented in the same order in both data frames.
Usage
df_equiv(df1, df2, digits = getOption("digits"))Arguments
df1,df2 | Data frames. |
digits | a positive integer, indicating how many significant digits areto be used for numeric and complex variables. A value of NA results in norounding. By default, this uses |
Value
A logical.
Combine R Objects by Rows or Columns
Description
rbind takes "a sequence of vector, matrix or data-framearguments", and combines by rows for the latter. However, as of R 4.1,calling this on data frame with zero columns always returns zero rows, due tothe issue mentioned fordf_duplicated. This function addszero-column data frames as a special case.
Usage
df_rbind(...)Arguments
... | data frames. |
Value
A data frame containing the... arguments row-wise.
See Also
Dependency discovery with DFD
Description
Finds all the minimal functional dependencies represented in a data frame.
Usage
discover( df, method = c("FDHitsSep", "FDHitsJoint", "DFD"), keep_rownames = FALSE, digits = getOption("digits"), exclude = character(), exclude_class = character(), dependants = names(df), detset_limit = ncol(df) - 1L, progress = FALSE, progress_file = "", skip_bijections = FALSE, accuracy = 1, full_cache = TRUE, store_cache = TRUE)Arguments
df | a data.frame, the relation to evaluate. |
method | a string, indicating which search algorithm to use. Currently,this defaults to DFD. Alternative options are FDHitsSep and FDHitsJoint. |
keep_rownames | a logical or a string, indicating whether to include therow names as a column. If a string is given, it is used as the name for thecolumn, otherwise the column is named "row". Like with the other columnnames, the function returns an error if this results in duplicate columnnames. Set to FALSE by default. |
digits | a positive integer, indicating how many significant digits areto be used for numeric and complex variables. A value of |
exclude | a character vector, containing names of attributes to notconsider as members of determinant sets. If names are given that aren'tpresent in |
exclude_class | a character vector, indicating classes of attributes tonot consider as members of determinant_sets. Attributes are excluded ifthey inherit from any given class. |
dependants | a character vector, containing names of all attributes forwhich to find minimal functional dependencies for which they are thedependant. By default, this is all of the attribute names. A smaller set ofattribute names reduces the amount of searching required, so can reduce thecomputation time if only some potential dependencies are of interest. |
detset_limit | an integer, indicating the largest determinant set sizethat should be searched for. By default, this is large enough to allow allpossible determinant sets. See Details for comments about the effect on theresult, and on the computation time. |
progress | a logical, for whether to display progress to the user duringdependency search in |
progress_file | a scalar character or a connection. If |
skip_bijections | a logical, indicating whether to skip some dependencysearches that are made redundant by discovered bijections betweenattributes. This can significantly speed up the search if |
accuracy | a numeric in (0, 1]: the accuracy threshold required in orderto conclude a dependency. Accuracy thresholds less than one are onlysupported in DFD. |
full_cache | a logical, indicating whether to store information abouthow sets of attributes group the relation records (stripped partitions).Otherwise, only the number of groups is stored. Storing the strippedpartition is expected to let the algorithm run more quickly, but might beinefficient for small data frames or small amounts of memory. Only relevantfor DFD. |
store_cache | a logical, indicating whether to keep cached informationto use when finding dependencies for other dependants. This allows thealgorithm to run more quickly by not having to re-calculate information,but takes up more memory. Only relevant for DFD. |
Details
Column names fordf must be unique.
There are two search algorithms available for finding dependencies: DFD, andFDHits. These are described below.
DFD
The DFD algorithm searches for determinant sets for each dependant attributeseparately, by traversing the powerset of the other (non-excluded)attributes. It can roughly be considered as a depth-first search overcandidate determinant sets.
The implementation for DFD differs a little from the algorithm presented inthe original paper:
Some attributes, or attribute types, can be designated, ahead oftime, as not being candidate members for determinant sets. This reduces thenumber of candidate determinant sets to be searched, saving time by notsearching for determinant sets that the user would remove later anyway. Thedependants can also be trimmed in similar fashion, although this reducesthe search space linearly rather than exponentially.
Attributes that have a single unique value, i.e. areconstant, get attributed a single empty determinant set. In the standardDFD algorithm, they would be assigned all the other non-excluded attributesas length-one determinant sets. Assigning them the empty set distinguishesthem as constant, allowing for special treatment at normalisation and latersteps.
As was done in the original Python library, there is an extra case inseed generation for when there are no discovered maximal non-dependencies.In this case, we take all of the single-attribute nodes, then filter out byminimal dependencies as usual. This is equivalent to taking the empty setas the single maximal non-dependency.
There are three results when checking whether a candidate node isminimal/maximal. TRUE indicates the node is minimal/maximal, as usual.FALSE has been split into FALSE and NA. NA indicates that we can not yetdetermine whether the node is minimal/maximal. FALSE indicates that we havedetermined that it is not minimal/maximal, and we can set its category assuch. This is done by checking whether any of its adjacentsubsets/supersets are dependencies/non-dependencies, instead of waiting toexhaust the adjacent subsets/supersets to visit when picking the next nodeto visit.
We do not yet keep hashmaps to manage subset/superset relationships,as described in Section 3.5 of the original paper.
skip_bijectionsallows for additional optimisation for findingfunctional dependencies when there are pairwise-equivalent attributes.Missing values (NA) are treated as a normal value, with NA = NA beingtrue, and x = NA being false for any non-NA value of x.
FDHits
FDHits begins by sampling pairs of records for attributes in which theirvalues differ (difference sets). These difference sets render somedeterminant-dependant pairings invalid, pruning the search space. If acandidate pairing is not rendered invalid by the known difference sets, thenit's validated, resulting in either confirmation as a minimal functionaldependency, or discovery of new difference sets that make it invalid.
There are two variants. FDHitsSep performs a search for each dependantattribute separately, in the same way as DFD. FDHitsJoint handles alldependant attributes at the same time.
The implementation for FDHits differs a little from the algorithm presentedin the original paper:
Some attributes, or attribute types, can be designated, ahead oftime, as not being candidate members for determinant sets. This reduces thenumber of candidate determinant sets to be searched, saving time by notsearching for determinant sets that the user would remove later anyway. Thedependants can also be trimmed in similar fashion.
The search can be limited to determinant sets up to a given size.This is also an option for DFD, but it's more useful for FDHits, due to thesearch order.
As described in the paper, FDHitsSep and FDHitsJoint branch a searchinto several child nodes:
\mu_0for FDHitsJoint, and\mu_1,\mu_2etc. for both. The paper implies that the latter nodes arevisited in order. However, this causes the algorithm to not always workcorrectly: for guaranteed correctness, they must be visited in reverseorder, with\mu_0able to be visited at any point. This correction isexpected to appear in a future paper.The final algorithm in the paper automatically chooses betweenFDHitsSep and FDHits Joint, depending on the number of initially-sampleddifference sets. This is not yet implemented.
Floating-point variables
Numerical/complex values, i.e. floating-point values, represent difficultiesfor stating functional dependencies. A fundamental condition for statingfunctional dependencies is that we can compare two values for the samevariable, and they are equivalent or not equivalent.
Usually, this is done by checking they're equal – this is the approach usedindiscover – but we can use any comparison that is an equivalencerelation.
However, checking floating-point values for equality is not simple.==is not appropriate, even when comparing non-calculated values we've read froma file, because how a given number is converted into a float can vary bycomputer architecture, meaning that two values can be considered equal on onecomputer, and not equal on another. This can happen even if they're bothusing 64-bit R, and even though all R platforms work with values conformingto the same standard (seedouble). For example,8.54917750000000076227 and8.54917749999999898591 are convertedinto different floating-point representations on x86, but the samerepresentation on ARM, resulting in inequality and equality respectively.
For this and other reasons, checking numerical/complex values for(near-)equality in R is usually done withall.equal. Thisdetermines valuesx andy to be equal if their absolute/relativeabsolute difference is within some tolerance value. However, we can not usethis. Equivalence relations must be transitive: if we have valuesx,y, andz, andx is equivalent to bothy andz,theny andz must also be equivalent. This tolerance-basedequivalence is not transitive: it is reasonably straightforward to set upthree values so that the outer values are far enough apart to be considerednon-equivalent, but the middle value is close enough to be consideredequivalent to both of them. Using this to determine functional dependencies,therefore, could easily result in a large number of inconsistencies.
This means we have no good option for comparing numerical/complex valuesas-is for equivalence, with consistent results across different machines, sowe must treat them differently. We have three options:
Round/truncate the values, before comparison, to some low degree of precision;
Coerce the values to another class before passing them into
discover;Read values as characters if reading data from a file.
discover takes the first option, with a default number of significantdigits low enough to ensure consistency across different machines. However,the user can also use any of these options when processing the data beforepassing it todiscover. The third option, in particular, isrecommended if reading data from a file.
Skipping bijections
Skipping bijections allows skipping redundant searches. For example, if thesearch discovers thatA -> B andB -> A, then only one of thoseattributes is considered for the remainder of the search. Since the searchtime increases exponentially with the number of attributes considered, thiscan significantly speed up search times. At the moment, this is only be donefor bijections between single attributes, such asA <-> B; ifA<-> {B, C}, nothing is skipped. Whether bijections are skipped doesn'taffect which functional dependencies are present in the output, but it mightaffect their order.
Skipping bijections for approximate dependencies, i.e. whenaccuracy < 1,should be avoided: it can result in incorrect output, since an approximatebijection doesn't imply equivalent approximate dependencies.
Limiting the determinant set size
Settingdetset_limit smaller than the largest-possible value hasdifferent behaviour for different search algorithms, but the result is alwaysthatdiscover(x, 1, detset_limit = n) is equivalent to doing a fullsearch,fds <- discover(x, 1), thenfiltering by determinant set size post-hoc,fds[lengths(detset(fds)) <=n].
For DFD, the naive way to implement it is by removing determinant sets largerthan the limit from the search tree for possible functional dependencies foreach dependant. However, this usually results in the search taking much moretime than without a limit.
For example, suppose we search for determinant sets for a dependant that hasnone (the dependant is the only key fordf, for example). Using DFD,we begin with a single attribute, then add other attributes one-by-one, sinceevery set gives a non-dependency. When we reach a maximum-size set, we canmark all subsets as also being non-dependencies.
With the default limit, there is only one maximum-size set, containing all ofthe available attributes. If there aren candidate attributes fordeterminants, the search finishes after visitingn sets.
With a smaller limitk, there are\binom{n}{k} maximum-size setsto explore. Since a DFD search adds or removes one attribute at each step,this means the search must take at leastk - 2 + 2\binom{n}{k} steps,which is larger thann for all non-trivial cases0 < k \leq n.
We therefore use a different approach, where any determinant sets above thesize limit are not allowed to be candidate seeds for new search paths, andany discovered dependencies with a size above the limit are discard at theend of the entire DFD search. This means that nodes for determinant setsabove the size limit are only visited in order to determine maximality ofnon-dependencies within the size limit. It turns out to be rare that thisresults in a significant speed-up, but it never results in the search havingto visit more nodes than it would without a size limit, so the average searchtime is never made worse.
FDHits implementsdetset_limit more naturally, since it exploresdeterminant sets in increasing set size. Limiting the size is simply a matterof only visiting new nodes if their determinant set is within the given sizelimit.
Value
Afunctional_dependency object, containing thediscovered dependencies. The column names ofdf are stored in theattrs attribute, in order, to serve as a default priority order forthe attributes during normalisation.
References
DFD: Abedjan Z., Schulze P., Naumann F. (2014) DFD: efficient functionaldependency discovery.Proceedings of the 23rd ACM International Conferenceon Conference on Information and Knowledge Management (CIKM '14). New York,U.S.A., 949–958.
FDHits: Bleifuss T., Papenbrock T., Bläsius T., Schirneck M, Naumann F.(2024) Discovering Functional Dependencies through Hitting Set Enumeration.Proc. ACM Manag. Data,2, 1, 43:1–24.
Examples
# simple examplediscover(ChickWeight)# example with spurious dependenciesdiscover(CO2)# exclude attributes that can't be determinants.# in this case, the numeric attributes are now# not determined by anything, because of repeat measurements# with no variable to mark them as such.discover(CO2, exclude_class = "numeric")# include only dependencies with dependants of interest.discover(CO2, dependants = c("Treatment", "uptake"))# approximate dependencies, that hold if we remove up to 20% of the rows# (these need not agree on which rows are removed)discover(CO2, method = "DFD", accuracy = 0.8)Functional dependency vectors
Description
Creates a set of functional dependencies with length-one dependants.
Usage
functional_dependency(FDs, attrs_order, unique = TRUE)Arguments
FDs | a list of functional dependencies, in the form of two-elementslists: the first element contains a character vector of all attributes inthe determinant set, and the second element contains the single dependentattribute (dependant). |
attrs_order | a character vector, giving the names of all attributes.These need not be present in |
unique | a logical, TRUE by default, for whether to remove duplicatedependencies. |
Details
When several sets of functional dependencies are concatenated, theirattrs_order attributes are merged, so as to preserve all of theoriginal attribute orders, if possible. If this is not possible, because theorderings disagree, then the returned value of theattrs_orderattribute is their union instead.
Value
Afunctional_dependency object, containing the list given inFDs, withattrs_order an attribute of the same name.Functional dependencies are returned with their determinant sets sortedaccording to the attribute order inattrs. Any duplicates foundafter sorting are removed.
See Also
detset,dependant, andattrs_order for extracting parts of the information in afunctional_dependency;rename_attrsfor renaming the attributes inattrs_order.
Examples
fds <- functional_dependency( list(list(c("a", "b"), "c"), list(character(), "d")), attrs_order = c("a", "b", "c", "d"))print(fds)detset(fds)dependant(fds)attrs_order(fds)# vector operationsfds2 <- functional_dependency(list(list("e", "a")), c("a", "e"))c(fds, fds2) # attrs_order attributes are mergedunique(c(fds, fds))# subsettingfds[1]fds[c(1, 2, 1)]stopifnot(identical(fds[[2]], fds[2]))# reassignmentfds3 <- fdsfds3[2] <- functional_dependency(list(list("a", "c")), attrs_order(fds3))print(fds3)detset(fds3)[[2]] <- character()dependant(fds3)[[2]] <- "d"stopifnot(identical(fds3, fds))# changing appearance priority for attributesattrs_order(fds3) <- rev(attrs_order(fds3))fds3# reconstructing from componentsfds_recon <- functional_dependency( Map(list, detset(fds), dependant(fds)), attrs_order(fds))stopifnot(identical(fds_recon, fds))# can be a data frame columndata.frame(id = 1:2, fd = fds)# (in)equality ignores headerstopifnot(all(fds3 == fds))stopifnot(!any(fds != fds))Generate Graphviz input text to plot objects
Description
Produces text input for Graphviz to make an HTML diagram of a given object.
Usage
gv(x, name = NA_character_, ...)Arguments
x | an object to be plotted. |
name | a scalar character, giving the name of the object, if any. Thisname is used for the resulting graph, to allow for easier combining ofgraphs into a single diagram if required. |
... | further arguments passed to or from other methods. |
Details
Details of what is plotted are given in individual methods. There areexpected commonalities, which are described below.
The object is expected to be one of the following:
an object whose elements have the same length. Examples would bedata frames, matrices, and other objects that can represent relations, withnames for the elements, and an optional name for the object itself.
a graph of sub-objects, each of which represent a relation asdescribed above, possibly with connections between the objects, and anoptional name for the graph as a whole.
Each relation is presented as a record-like shape, with the following elements:
A optional header with the relation's name, and the number of (unique)records.
A set of rows, one for each attribute in the relation. These rowshave the following contents:
the attribute names.
a depiction of the relation's (candidate) keys. Eachcolumn represents a key, and a filled cell indicates that the attributein that row is in that key. The keys are given in lexical order, withprecedence given to keys with fewer attributes, and keys with attributesthat appear earlier in the original data frame's attribute order. Defaultoutput from other package functions will thus have the primary key givenfirst. In the future, this will be changed to always give the primary keyfirst.
optionally, the attribute types: specifically, the first elementwhen passing the attribute's values into
class.
Any foreign key references between relations are represented by one-way arrows,one per attribute in the foreign key.
If the object has a name, this name is attached to the resulting graph inGraphviz. This is to allow easier combination of several such graphs into asingle image, if a user wishes to do so.
Value
A scalar character, containing text input for Graphviz.
See Also
gv.data.frame,gv.relation_schema,gv.database_schema,gv.relation, andgv.database for individual methods.
Examples
# simple data.frame exampletxt_df <- gv(ChickWeight, "chick")cat(txt_df)if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_df)}# simple database exampledb <- autodb(ChickWeight)txt_db <- gv(db)cat(txt_db)if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_db)}# simple relation schemasrschema <- synthesise(discover(ChickWeight))txt_rschema <- gv(rschema)cat(txt_rschema)if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_rschema)}# simple database schemadschema <- normalise(discover(ChickWeight))txt_dschema <- gv(dschema)cat(txt_dschema)DiagrammeR::grViz(txt_dschema)# simple relationsrel <- create(synthesise(discover(ChickWeight)))txt_rel <- gv(rel)cat(txt_rel)if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_rel)}Generate Graphviz input text to plot a data frame
Description
Produces text input for Graphviz to make an HTML diagram of a given dataframe.
Usage
## S3 method for class 'data.frame'gv(x, name = NA_character_, ...)Arguments
x | a data.frame. |
name | a character scalar, giving the name of the record, if any. Thename must be non-empty, since it is also used to name the single table inthe plot. Defaults to |
... | further arguments passed to or from other methods. |
Details
The rows in the plotted data frame include information about the attributeclasses.
Value
A scalar character, containing text input for Graphviz.
See Also
The genericgv.
Generate Graphviz input text to plot databases
Description
Produces text input for Graphviz to make an HTML diagram of a given database.
Usage
## S3 method for class 'database'gv(x, name = NA_character_, ...)Arguments
x | |
name | a scalar character, giving the name of the database, if any. Thisname is used for the resulting graph, to allow for easier combining ofgraphs into a single diagram if required. |
... | further arguments passed to or from other methods. |
Details
Each relation in the database is presented as a set of rows, one for eachattribute in the relation. These rows include information about the attributeclasses.
Value
A scalar character, containing text input for Graphviz.
See Also
The genericgv.
Generate Graphviz input text to plot database schemas
Description
Produces text input for Graphviz to make an HTML diagram of a given databaseschema.
Usage
## S3 method for class 'database_schema'gv(x, name = NA_character_, ...)Arguments
x | a database schema, as given by |
name | a character scalar, giving the name of the schema, if any. |
... | further arguments passed to or from other methods. |
Details
Each relation in the schema is presented as a set of rows, one for eachattribute in the relation. These rows do not include information about theattribute classes.
Any foreign key references are represented by arrowsbetween the attribute pairs.
Value
A scalar character, containing text input for Graphviz.
See Also
The genericgv.
Generate Graphviz input text to plot relations
Description
Produces text input for Graphviz to make an HTML diagram of a given relation.
Usage
## S3 method for class 'relation'gv(x, name = NA_character_, ...)Arguments
x | a |
name | a character scalar, giving the name of the schema, if any. |
... | further arguments passed to or from other methods. |
Details
Each relation is presented as a set of rows, one for eachattribute in the relation. These rows include information about theattribute classes.
Value
A scalar character, containing text input for Graphviz.
See Also
The genericgv.
Generate Graphviz input text to plot relation schemas
Description
Produces text input for Graphviz to make an HTML diagram of a given relationschema.
Usage
## S3 method for class 'relation_schema'gv(x, name = NA_character_, ...)Arguments
x | a relation schema, as given by |
name | a character scalar, giving the name of the schema, if any. |
... | further arguments passed to or from other methods. |
Details
Each relation in the schema is presented as a set of rows, one for eachattribute in the relation. These rows do not include information about theattribute classes.
Value
A scalar character, containing text input for Graphviz.
See Also
The genericgv.
Insert data
Description
Generic function for inserting a data frame of data into an object.
Usage
insert( x, vals, relations = names(x), all = FALSE, keep_rownames = FALSE, digits = getOption("digits"), ...)Arguments
x | a relational data object, into which to insert data, such as a |
vals | a data frame, containing data to insert. Column names must beunique. |
relations | a character vector, containing names of elements of |
all | a logical, indicating whether |
keep_rownames | a logical or a string, indicating whether to include therow names as a column. If a string is given, it is used as the name for thecolumn, otherwise the column is named "row". Set to FALSE by default. |
digits | a positive integer, indicating how many significant digits areto be used for numeric and complex variables. A value of |
... | further arguments pass on to methods. |
Details
This function is intended for inserting into an object that is itselfcomprised of data frames, such as arelation or adatabase. The given methods have the following behaviour:
If an empty set of data is inserted, into a non-empty object element,nothing happens.
If an empty set of data is inserted into an empty object element, theresulting element is also empty, but takes on the attribute/column classesof the inserted data. This is done to prevent having to know attributeclasses during object creation.
Insertion can fail if inserting would violate object constraints. Forexample, databases cannot have data inserted that would violatecandidate/foreign key constraints.
For other cases, the data is inserted in an object element in thesame way as using
rbind, followed byunique.
While key violations prevent insertion, re-insertion of existing records inan object element does not. This makes insertion equivalent to anINSERT ORIGNORE expression in SQL. In particular, it is somewhat like using thisexpression in SQLite, since that implementation uses dynamic typing.
Ifvals contains attributes not included inattrs_order(x),insert throws an error, since thoseattributes can't be inserted.
If a partial set of attributes is inserted, andall isFALSE,then data is only inserted into components ofx[relations] whoserequired attributes are all present invals. Ifall isTRUE,insert returns an error instead. This is useful whenspecifyingrelations: in that case, you often intend to insertinto all of the specified elements, so not including all the requiredattributes is a mistake, andall = TRUE prevents it.
Ifall isTRUE,insertthrows an error in this case: This ensures you insert into all members of aspecified value ofrelations.
Value
An R object of the same class asx, containing the additionalnew data.
Relational data keys
Description
Generic function, with the only given method fetching candidate key lists forrelation schemas.
Usage
keys(x, ...)keys(x, ...) <- valueArguments
x | a relational schema object, such as a |
... | further arguments passed on to methods. |
value | A list of lists of character vectors, of the same length as |
Value
A list containing lists of unique character vectors, representingcandidate keys for each element ofx.
Merge relation schemas with empty keys
Description
Merges an object's schemas with empty keys. The remaining such schemacontains all attributes contained in such schemas.
Usage
merge_empty_keys(x)Arguments
x | a relational schema object, such as a |
Details
This function is not itself generic, but makes use of the generic functionskeys andmerge_schemas. Any input class withvalid methods for these generic functions can be passed into this function.
Fordatabase_schema objects, references involving theschemas with empty keys are updated to refer to the merged schema.
Value
An R object of the same class asx, where relations with anempty key have been merged into a single relation.
See Also
merge_schemas, on which this function is based.
Merge relation schemas in given pairs
Description
Generic function that merges pairs of an object's schemas with matching setsof keys. The remaining schemas contain all the attributes from the schemasmerged into them.
Usage
merge_schemas(x, to_remove, merge_into, ...)Arguments
x | a relational schema object, such as a |
to_remove | an integer vector, giving the indices for schemas to bemerged into other schemas, then removed. |
merge_into | an integer vector of the same length as |
... | further arguments passed on to methods. |
Value
An R object of the same class asx, where the relations havebeen merged as indicated.
See Also
merge_empty_keys, which is based on this function.
Examples
rs <- relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c"), list("b")), b.1 = list(c("b", "d"), list("b")), d = list(c("d", "e"), list("d", "e")) ), letters[1:5])ds <- database_schema( rs, list( list("a", "b", "b", "b"), list("b.1", "d", "d", "d") ))merge_schemas(rs, 3, 2) # merging b and b.1merge_schemas(ds, 3, 2) # also merging their references# merging a schema into itself just removes itmerge_schemas(rs, 3, 3)merge_schemas(ds, 3, 3)Create normalised database schemas from functional dependencies
Description
Creates a database schema from given functional dependencies, satisfying atleast third normal form, using Bernstein's synthesis.
Usage
normalise( dependencies, single_ref = FALSE, ensure_lossless = TRUE, reduce_attributes = TRUE, remove_avoidable = FALSE, constants_name = "constants", progress = FALSE, progress_file = "")Arguments
dependencies | a |
single_ref | a logical, FALSE by default. If TRUE, then only onereference between each relation pair is kept when generating foreign keyreferences. If a pair has multiple references, the kept reference refers tothe earliest key for the child relation, as sorted by priority order. |
ensure_lossless | a logical, TRUE by default. If TRUE, and thedecomposition isn't lossless, an extra relation is added to make thedecomposition lossless. |
reduce_attributes | a logical, TRUE by default. If TRUE, |
remove_avoidable | a logical, indicating whether to remove avoidableattributes in relations. If so, then an attribute are removed fromrelations if the keys can be changed such that it is not needed to preservethe given functional dependencies. |
constants_name | a scalar character, giving the name for any relationcreated to store constant attributes. If this is the same as a generatedrelation name, it will be changed, with a warning, to ensure that allrelations have a unique name. |
progress | a logical, for whether to display progress to the user duringdependency search in |
progress_file | a scalar character or a connection. If |
Details
This is a wrapper function for applyingsynthesise andautoref, in order. For creating relation schemas and foreignkey references separately, use these functions directly. See both functionsfor examples.
For details on the synthesis algorithm used, seesynthesise.
Value
Adatabase_schema object, containing the synthesisrelation schemas and the created foreign key references.
Nudge meta-analysis data
Description
Data used for a meta-analysis on the effectiveness of nudges, i.e. choicearchitecture interventions.
Usage
nudgeFormat
A data frame with 447 effect size measurements and 25 columns:
publication_id, integer ID number for the publication. Note thattwo publications were erroneously assigned the same ID number, so thisis not a unique publication identifier.
study_id, integer ID number for the study.
es_id, integer ID number for the effect size measured.
reference, publication citation in "Author(s) (year)" format. Dueto two publications being assigned the same reference, this is also not aunique publication identifier.
title, title of the publication. Due to the error in assigningpublication ID numbers, this is the unique publication identifier withinthe data set.
year, year of the publication.
location, geographical location of the intervention. This is givenas a factor, rather than an integer, using the informationprovided in the codebook.
domain, factor giving the intervention's behavioural domain.
intervention_category, factor giving the intervention's category,based on the taxonomy in Münscher et al. (2016).
intervention_technique, factor giving the intervention's technique,based on the taxonomy in Münscher et al. (2016).
type_experiment, factor giving the type of experiment, as definedby Harrison and List (2004).
population, factor giving the intervention's target population.This is given as a factor, rather than an integer, using the informationprovided in the codebook.
n_study, sample size of the overall study.
n_comparison, combined sample size of the control and theintervention for the measured effect size.
n_control, sample size of the control condition for themeasured effect size.
n_intervention, sample size of the intervention conditionfor the measured effect size.
binary_outcome, logical for whether the outcome scale is binary orcontinuous.
mean_control, mean of outcome for the control condition.
sd_control, SD of outcome for the control condition.
mean_intervention, mean of outcome for the intervention condition.
sd_intervention, SD of outcome for the intervention condition.
cohens_d, extracted effect size of intervention.
variance_d, variance of extracted effect size.
approximation, logical for whether effect size extraction involvedapproximation.
wansink, logical for whether the study was (co-)authored by BrianWansink. This was added on revision, because, a few years beforepublication, Wansink had many papers retracted or corrected, due tovarious questionable practices, resulting in Wansink being determined tohave committed scientific misconduct. This column was added to checkwhether the findings were robust to the exclusion of non-retractedstudies by the Cornell Food and Brand Laboratory, of which Wansink wasthe director.
Source
References
Mertens S., Herberz M., Hahnel U. J. J., Brosch T. (2022) Theeffectiveness of nudging: A meta-analysis of choice architectureinterventions across behavioral domains.Proc. Natl. Acad. Sci. U.S.A.,4,119(1).
Relational data records
Description
Generic function, for retrieving data contained in a database-like structure.In particular, this is intended for such structures where the individualrelations can't be accessed with subsetting.
Usage
records(x, ...)records(x, ...) <- valueArguments
x | a relational data object, such as a |
... | further arguments passed on to methods. |
value | A list of data frames of the same length as |
Details
Since the relational data objects inautodb,relationanddatabase, have subsetting methods that return relationaldata objects, the data contained within them can't be accessed by subsetting.This function is intended for accessing it instead.
It's recommended to callrecords before doing any subsetting, sincesubsetting on a relation data object does more work that will be thrown away,such as subsetting on adatabase checking whether foreign keyreferences should be removed.
Value
A list containing data frames, with elements named for theirrespective relations.
Examples
db <- autodb(ChickWeight)records(db) # data for Chick and Time_Chick relations# ways to get data for subsetsrecords(db)[c(1, 2)]records(db)[[1]]records(db)$Chick# subsetting first isn't recommended: removes foreign key# reference as mentions, and you need to subset again anywayrecords(db[[1]])[[1]]Remove relations not linked to the main relations
Description
Filters an object's relations, keeping only the main relations, and thoseconsidered ancestors via foreign key references. Foreign keyreferences involving removed relations are also removed.
Usage
reduce(x, ...)Arguments
x | An object whose relations are to be filtered. |
... | further arguments passed to or from other methods. |
Details
Details on how the main tables are chosen are given in individual methods.
This function is mostly intended for simplifying a database, or a databaseschema, for the purposes of exploration, particularly by examining plots.While the filtering might remove important auxiliary relations, it's alsolikely to remove any based on spurious dependencies, of which some databasescan contain many.
Value
An object of the same class asx, with the auxiliary relationsand foreign key references removed.
See Also
reduce.database_schema,reduce.database.
Remove database relations not linked to the main relations
Description
Filters a database's relations, keeping only the main relations, and thoseconsidered ancestors via foreign key references. Foreignkey references involving removed relations are also removed.
Usage
## S3 method for class 'database'reduce(x, main, ...)Arguments
x | A database, whose relations are to be filtered. |
main | A character vector, containing names of relations to beconsidered as the "main" relations. If missing, taken to be the names ofall relations with the largest record count. |
... | further arguments passed to or from other methods. |
Details
The main relations are considered to be the relations with the largest numberof records.
Usingrejoin on the database resulting fromreduce islikely to fail or return incomplete results.
Value
A database, with the auxiliary relations and foreign keyreferences removed.
Remove database schema relations not linked to the given relations
Description
Filters a database schema's relations, keeping only the given relations, andthose considered ancestors via foreign key references. Foreign keyreferences involving removed relations are also removed.
Usage
## S3 method for class 'database_schema'reduce(x, main, ...)Arguments
x | A database schema, whose relations are to be filtered. |
main | A character vector, containing names of relations to beconsidered as the "main" relations. |
... | further arguments passed to or from other methods. |
Details
This method takes a given set of main relations, rather than inferring them.
Usingrejoin on the database resulting from decomposing a dataframe with the reduced schema is likely to fail or return incomplete results.
Value
A database schema, with the auxiliary relations and foreign keyreferences removed.
Schema references
Description
Generic function, returning present (foreign key) references.
Usage
references(x, ...)references(x) <- valueArguments
x | an R object with references, such as a |
... | further arguments passed on to methods. |
value | A list, of the same length as |
Value
A list, giving references.
Join a database into a data frame
Description
Rejoins the relations in a database into a single data frame, if possible.This is the inverse of callingautodb, except that the rowsmight be returned in a different order.
Usage
rejoin(database)Arguments
database | A database containing the data to be rejoined, as returned by |
Details
The rejoining algorithm might not use all of the given relations: it beginswith the relation with the largest number of records, then joins it with enoughrelations to contain all of the present attributes. This is not limited torelations that the starting relation is linked to by foreign keys, and is notlimited to them either, since in some cases this constraint would make itimpossible to rejoin with all of the present attributes.
Since the algorithm may not use all of the given relations, the algorithm mayignore some types of database inconsistency, where different relations holddata inconsistent with each other. In this case, the rejoining will be lossy.Rejoining the results ofreduce can also be lossy.
Due to the above issues, the algorithm will be changed to use all of therelations in the future.
Not all databases can be represented as a single data frame. A simple exampleis any database where the same attribute name is used for several differencesources of data, since rejoining results in inappropriate merges.
Value
A data frame, containing all information containeddatabase ifit is lossless and self-consistent.
Examples
# simple exampledb <- autodb(ChickWeight)rj <- rejoin(db)rj <- rj[order(as.integer(rownames(rj))), ]all(rj == ChickWeight) # TRUE# showing rejoin() doesn't check for inconsistency:# add another Chick table with the diets swappeddb2 <- db[c(1, 2, 1)]records(db2)[[3]]$Diet <- rev(records(db2)[[3]]$Diet)rj2 <- rejoin(db2)rj2 <- rj2[order(as.integer(rownames(rj2))), ]all(rj2 == ChickWeight) # TRUERelation vectors
Description
Creates a set of relation schemas, including the relation's attributes andcandidate keys.
Usage
relation(relations, attrs_order)Arguments
relations | a named list of relations, in the form of two-element lists:the first element contains a data frame, where the column names are theattributes in the associated schema, and the second element contains a listof character vectors, each representing a candidate key. |
attrs_order | a character vector, giving the names of all attributes.These need not be present in |
Details
Relation vectors are unlikely to be needed by the user directly, since theyare essentiallydatabase objects that can't have foreign keyreferences. They are mostly used to mirror the use of the vector-likerelation_schema class for thedatabase_schemaclass to be a wrapper around. This makes creating adatabasefrom arelation_schema a two-step process, where the two stepscan be done in either order: creation withcreate andinsert, and adding references withdatabase_schema ordatabase.
Duplicate schemas, after ordering by attribute, are allowed, and can beremoved withunique.
When several sets of relation schemas are concatenated, theirattrs_order attributes are merged, so as to preserve all of the originalattribute orders, if possible. If this is not possible, because the orderingsdisagree, then the returned value of theattrs_order attribute is theirunion instead.
Value
Arelation object, containing the list given inrelations, withattrs_order stored in an attribute of thesame name. Relation schemas are returned with their keys' attributes sortedaccording to the attribute order inattrs_order, and the keys thensorted by priority order. Attributes in the data frame are also sorted,first by order of appearance in the sorted keys, then by order inattrs_order for non-prime attributes.
See Also
records,attrs,keys, andattrs_order for extracting parts of the information in arelation_schema;gv for converting the schema intoGraphviz code;rename_attrs for renaming the attributes inattrs_order.
Examples
rels <- relation( list( a = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b = list( df = data.frame(b = logical(), c = logical()), keys = list("b", "c") ) ), attrs_order = c("a", "b", "c", "d"))print(rels)records(rels)attrs(rels)stopifnot(identical( attrs(rels), lapply(records(rels), names)))keys(rels)attrs_order(rels)names(rels)# inserting datainsert(rels, data.frame(a = 1L, b = 2L, c = 3L, d = 4L))# data is only inserted into relations where all columns are given...insert(rels, data.frame(a = 1L, b = 2L, c = 3L))# and that are listed in relations argumentinsert( rels, data.frame(a = 1L, b = 2L, c = 3L, d = 4L), relations = "a")# vector operationsrels2 <- relation( list( e = list( df = data.frame(a = logical(), e = logical()), keys = list("e") ) ), attrs_order = c("a", "e"))c(rels, rels2) # attrs_order attributes are mergedunique(c(rels, rels))# subsettingrels[1]rels[c(1, 2, 1)]stopifnot(identical(rels[[1]], rels[1]))# reassignmentrels3 <- relsrels3[2] <- relation( list( d = list( df = data.frame(d = logical(), c = logical()), keys = list("d") ) ), attrs_order(rels3))print(rels3) # note the relation's name doesn't change# names(rels3)[2] <- "d" # this would change the namekeys(rels3)[[2]] <- list(character()) # removing keys first...# for a relation_schema, we could then change the attrs for# the second relation. For a created relation, this is not# allowed.## Not run: attrs(rels3)[[2]] <- c("b", "c") names(records(rels3)[[2]]) <- c("b", "c")## End(Not run)# changing appearance priority for attributesrels4 <- relsattrs_order(rels4) <- c("d", "c", "b", "a")print(rels4)# reconstructing from componentsrels_recon <- relation( Map(list, df = records(rels), keys = keys(rels)), attrs_order(rels))stopifnot(identical(rels_recon, rels))# can be a data frame columndata.frame(id = 1:2, relation = rels)Relation schema vectors
Description
Creates a set of relation schemas, including the relation's attributes andcandidate keys.
Usage
relation_schema(schemas, attrs_order)Arguments
schemas | a named list of schemas, in the form of two-element lists: thefirst element contains a character vector of all attributes in the relationschema, and the second element contains a list of character vectors,each representing a candidate key. |
attrs_order | a character vector, giving the names of all attributes.These need not be present in |
Details
Duplicate schemas, after ordering by attribute, are allowed, and can beremoved with\code{\link{unique}}.
When several sets of relation schemas are concatenated, theirattrs_order attributes are merged, so as to preserve all of the originalattribute orders, if possible. If this is not possible, because the orderingsdisagree, then the returned value of theattrs_order attribute is theirunion instead.
Value
Arelation_schema object, containing the list given inschemas, withattrs_order stored in an attribute of the samename. Relation schemas are returned with their keys' attributes sortedaccording to the attribute order inattrs_order, and the keys thensorted by priority order. Attributes in the schema are also sorted, firstby order of appearance in the sorted keys, then by order inattrs_order for non-prime attributes.
See Also
attrs,keys, andattrs_order for extracting parts of the information in arelation_schema;create for creating arelation object that uses the given schema;gvfor converting the schema into Graphviz code;rename_attrsfor renaming the attributes inattrs_order;merge_empty_keys for combining relations with an empty key;merge_schemas for combining relations with matching sets ofkeys.
Examples
schemas <- relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c"), list("b", "c")) ), attrs_order = c("a", "b", "c", "d"))print(schemas)attrs(schemas)keys(schemas)attrs_order(schemas)names(schemas)# vector operationsschemas2 <- relation_schema( list( e = list(c("a", "e"), list("e")) ), attrs_order = c("a", "e"))c(schemas, schemas2) # attrs_order attributes are mergedunique(c(schemas, schemas))# subsettingschemas[1]schemas[c(1, 2, 1)]stopifnot(identical(schemas[[1]], schemas[1]))# reassignmentschemas3 <- schemasschemas3[2] <- relation_schema( list(d = list(c("d", "c"), list("d"))), attrs_order(schemas3))print(schemas3) # note the schema's name doesn't change# names(schemas3)[2] <- "d" # this would change the namekeys(schemas3)[[2]] <- list(character()) # removing keys first...attrs(schemas3)[[2]] <- c("b", "c") # so we can change the attrs legallykeys(schemas3)[[2]] <- list("b", "c") # add the new keysstopifnot(identical(schemas3, schemas))# changing appearance priority for attributesattrs_order(schemas3) <- c("d", "c", "b", "a")print(schemas3)# reconstructing from componentsschemas_recon <- relation_schema( Map(list, attrs(schemas), keys(schemas)), attrs_order(schemas))stopifnot(identical(schemas_recon, schemas))# can be a data frame columndata.frame(id = 1:2, schema = schemas)Rename relational data attributes
Description
Generic function, for renaming attributes present in a database-likestructure.
Usage
rename_attrs(x, names, ...)Arguments
x | an object with an |
names | a character vector of the same length as |
... | further arguments passed on to methods. |
Details
This function has a different intended use to re-assigningattrs_order: that is intended only for rearranging the order ofthe attributes, without renaming them. This is intended for renaming theattributes without re-ordering them.
Value
A relational object of the same type asx, with attributesrenamed consistently across the whole object.
Database subrelations
Description
Generic function, returning subrelations forx.
Usage
subrelations(x, ...)Arguments
x | an R object, intended to be some sort of database-like object thatcontains relations, such as a |
... | further arguments passed on to methods. |
Value
A relation-type object, or a list of relation-type objects if thesubrelation isn't vectorised. For example, ifx is adatabase, the result is the containedrelation.
Schema subschemas
Description
Generic function, returning subschemas forx.
Usage
subschemas(x, ...)Arguments
x | an R object, intended to be some sort of schema that contains otherschemas, such as a |
... | further arguments passed on to methods. |
Value
A schema-type object, or a list of schema-type objects if thesubschema isn't vectorised. For example, ifx is adatabase_schema, the result is the containedrelation_schema.
Synthesise relation schemas from functional dependencies
Description
Synthesises the dependency relationships in dependencies into a databaseschema satisfying at least third normal form, using Bernstein's synthesis.
Usage
synthesise( dependencies, ensure_lossless = TRUE, reduce_attributes = TRUE, remove_avoidable = FALSE, constants_name = "constants", progress = FALSE, progress_file = "")Arguments
dependencies | a |
ensure_lossless | a logical, TRUE by default. If TRUE, and thedecomposition isn't lossless, an extra relation is added to make thedecomposition lossless. |
reduce_attributes | a logical, TRUE by default. If TRUE, |
remove_avoidable | a logical, indicating whether to remove avoidableattributes in relations. If so, then an attribute are removed fromrelations if the keys can be changed such that it is not needed to preservethe given functional dependencies. |
constants_name | a scalar character, giving the name for any relationcreated to store constant attributes. If this is the same as a generatedrelation name, it will be changed, with a warning, to ensure that allrelations have a unique name. |
progress | a logical, for whether to display progress to the user duringdependency search in |
progress_file | a scalar character or a connection. If |
Details
Bernstein's synthesis is a synthesis algorithm for normalisation of a set ofdependencies into a set of relations that are in third normal form. Thisimplementation is based on the version given in the referenced paper.
The implementation also includes a common additional step, to ensure that theresulting decomposition is lossless, i.e. a relation satisfying the givendependencies can be perfectly reconstructed from the relations given by thedecomposition. This is done by adding an additional relation, containing akey for all the original attributes, if one is not already present.
As an additional optional step, schemas are checked for "avoidable"attributes, that can be removed without loss of information.
Constant attributes, i.e. those whose only determinant set is empty, getassigned to a relation with no keys.
Output is independent of the order of the input dependencies: schemas aresorted according to their simplest keys.
Schemas are sorted before ensuring for losslessness, or removing avoidableattributes. As a result, neither optional step changes the order of theschemas, and ensuring losslessness can only add an extra schema to the end ofthe output vector.
Value
Arelation_schema object, containing the synthesisedrelation schemas.
References
3NF synthesis algorithm: Bernstein P. A. (1976) Synthesizing third normalform relations from functional dependencies.ACM Trans. Database Syst.,1, 4, 277–298.
Removal of avoidable attributes: Ling T., Tompa F. W., Kameda T. (1981) Animproved third normal form for relational databases.ACM Trans. DatabaseSyst.,6, 2, 329–346.
Examples
# example 6.24 from The Theory of Relational Databases by David Maier# A <-> B, AC -> D, AC -> E, BD -> Cdeps <- functional_dependency( list( list("A", "B"), list("B", "A"), list(c("A", "C"), "D"), list(c("A", "C"), "E"), list(c("B", "D"), "C") ), attrs_order = c("A", "B", "C", "D", "E"))synthesise(deps, remove_avoidable = FALSE)synthesise(deps, remove_avoidable = TRUE)