Movatterモバイル変換

Title:

Automatic Database Normalisation for Data Frames

Version:

3.2.4

Description:

Automatic normalisation of a data frame to third normal form, with the intention of easing the process of data cleaning. (Usage to design your actual database for you is not advised.) Originally inspired by the 'AutoNormalize' library for 'Python' by 'Alteryx' (https://github.com/alteryx/autonormalize), with various changes and improvements. Automatic discovery of functional or approximate dependencies, normalisation based on those, and plotting of the resulting "database" via 'Graphviz', with options to exclude some attributes at discovery time, or remove discovered dependencies at normalisation time.

License:

BSD_3_clause + file LICENSE

Encoding:

UTF-8

Language:

en-GB

RoxygenNote:

7.3.3

Depends:

R (≥ 4.1.0)

Suggests:

rlang, spelling, DiagrammeR (≥ 1.0.7), testthat (≥ 3.3.0),R.utils (≥ 2.11.0), hedgehog (≥ 0.1), tibble (≥ 3.2.1),knitr, rmarkdown

Config/testthat/edition:

VignetteBuilder:

knitr

LazyData:

true

URL:

https://charnelmouse.github.io/autodb/,https://github.com/CharnelMouse/autodb

BugReports:

https://github.com/CharnelMouse/autodb/issues

NeedsCompilation:

Packaged:

2025-11-17 21:06:50 UTC; Mark

Author:

Mark Webster

[aut, cre]

Maintainer:

Mark Webster <markwebster204@yahoo.co.uk>

Repository:

CRAN

Date/Publication:

2025-11-17 21:40:02 UTC

Database-style normalisation for data.frames

Description

Automatic normalisation of a data.frame to third normal form, with theintention of easing the process of data cleaning. (Usage to design youractual database for you is not advised.) Originally inspired by Alteryx'sAutoNormalize Python library (https://github.com/alteryx/autonormalize), withvarious changes and improvements. Automatic discovery of functional orapproximate dependencies, normalisation based on those, and plotting of theresulting "database" via Graphviz, with options to exclude some attributes atdiscovery time, or remove discovered dependencies at normalisation time.

Author(s)

Maintainer: Mark Webstermarkwebster204@yahoo.co.uk (ORCID)

Relational data attributes

Description

Generic function, for fetching attribute sets for elements of a relationalobject.

Usage

attrs(x, ...)attrs(x, ...) <- value

Arguments

x

a relational schema object, such as arelation_schemaordatabase_schema object, or a relational data object, suchas arelation ordatabase object.

...

further arguments passed on to methods.

value

A character vector of the same length asattrs(x, ...).

Value

A list, containing a character vector for each element ofx.

Relational data attribute order

Description

Generic function, fetching attribute order for relational objects.

Usage

attrs_order(x, ...)attrs_order(x, ...) <- value

Arguments

x

an R object, such as afunctional_dependency,relation_schema,relation,database_schema, ordatabase object.

...

further arguments passed on to methods.

value

A character vector of the same length asattrs_order(x, ...).

Details

All classes inautodb contain anattrs_order attribute. Itgives an easy way to find a list of all attributes/variables involved in anobject, but its main purpose is to also assign those attributes a consistentorder when printing or plotting the object.

Value

A character vector, giving attributes in the order in which they'reprioritised for sorting withinx.

Create a normalised database from a data frame

Description

This is a wrapper function for applyingnormalise,autoref, anddecompose. This takes a data frameand converts it straight into a database, which is the main intended use casefor the package.

Usage

autodb(  df,  keep_rownames = FALSE,  digits = getOption("digits"),  single_ref = FALSE,  ensure_lossless = TRUE,  remove_avoidable = FALSE,  constants_name = "constants",  progress = FALSE,  progress_file = "",  ...)

Arguments

df

a data.frame, containing the data to be normalised.

keep_rownames

a logical or a string, indicating whether to include therow names as a column. If a string is given, it is used as the name for thecolumn, otherwise the column is named "row". Like with the other columnnames, the function returns an error if this results in duplicate columnnames. Set to FALSE by default.

digits

a positive integer, indicating how many significant digits areto be used for numeric and complex variables. This is used for bothpre-formatting indiscover, and for rounding the data beforeuse indecompose, so that the data satisfies the resultingschema. A value ofNA results in no rounding. By default, this usesgetOption("digits"), similarly toformat. See the"Floating-point variables" section fordiscover for why thisrounding is necessary for consistent results across different machines. Seethe note inprint.default aboutdigits >= 16.

single_ref

a logical, FALSE by default. If TRUE, then only onereference between each relation pair is kept when generating foreign keyreferences. If a pair has multiple references, the kept reference refers tothe earliest key for the child relation, as sorted by priority order.

ensure_lossless

a logical, indicating whether to check whether thenormalisation is lossless. If it is not, then an additional relation isadded to the final "database", containing a key fordf. This isenough to make the normalisation lossless.

remove_avoidable

a logical, indicating whether to remove avoidableattributes in relations. If so, then an attribute are removed fromrelations if the keys can be changed such that it is not needed to preservethe given functional dependencies.

constants_name

a scalar character, giving the name for any relationcreated to store constant attributes. If this is the same as a generatedrelation name, it will be changed, with a warning, to ensure that allrelations have a unique name.

progress

a logical, for whether to display progress to the user duringdependency search indiscover.

progress_file

a scalar character or a connection. Ifprogressis non-zero, determines where the progress is written to, in the same wayas thefile argument forcat.

...

further arguments passed on todiscover.

Details

Sincedecompose only works with functional dependencies, not approximatedependencies, the accuracy indiscover is fixed as 1.

Value

Adatabase, containing the data indf within theinferred database schema.

Examples

# simple exampleautodb(ChickWeight)

Add foreign key references to a normalised database

Description

Adds foreign key references to arelation_schema objectautomatically, replacing any existing references.

Usage

autoref(schema, single_ref = FALSE)

Arguments

schema

arelation_schema object, as given bysynthesise.

single_ref

Details

The method for generating references is simple. First, it finds every linkbetween two relation schemas, where the parent contains all the attributes inone of the child's keys. This can be done separately for all of the child'skeys, so there can be multiple links with the same parent and child ifsingle_ref isTRUE.

Second, any transitive references are removed: if there are link relationpairs a -> b, b -> c, and a -> c, then the latter is transitive, and so isremoved. If there is a cyclic reference, e.g. where c -> a, then the choiceof which link to remove is arbitrary. Cycles cannot occur in sets of relationschemas resulting from decomposing a single table.

Value

Adatabase_schema object, containing the given relationschemas and the created foreign key references.

Examples

rs <- relation_schema(  list(    a_b_c = list(c("a", "b", "c", "d"), list(c("a", "b", "c"))),    a_b = list(c("a", "b", "d"), list(c("a", "b"), c("b", "d")))  ),  letters[1:4])autoref(rs, single_ref = FALSE)autoref(rs, single_ref = TRUE)

Create instance of a schema

Description

Create a relation data object, using the given relational schema object, withthe resulting relations empty and ready for data insertion usinginsert.

Usage

create(x, ...)

Arguments

x

a relational schema object, representing the schema to create aninstance of, such as arelation_schema ordatabase_schema object.

...

further arguments passed on to methods.

Value

An instance of the schema. For example, callingcreate on adatabase_schema creates adatabase, where allthe relations contain zero records.

Generate D2 input text to plot objects

Description

Produces text input for D2 to make a diagram of a given object, usuallyrendered with SVG.

Usage

d2(x, ...)

Arguments

x

an object to be plotted.

...

further arguments passed to or from other methods.

Details

The D2 language is in an early stage of development (pre-v1.0), so it may besubject to changes that make it unable to use output from the current versionofd2.

Details of what is plotted are given in individual methods. There areexpected commonalities, which are described below.

The object is expected to be one of the following:

an object whose elements have the same length. Examples would bedata frames, matrices, and other objects that can represent relations, withnames for the elements, and an optional name for the object itself.
a graph of sub-objects, each of which represent a relation asdescribed above, possibly with connections between the objects, and anoptional name for the graph as a whole.

Each relation is presented as a record-like shape, with the following elements:

A optional header with the relation's name, and the number of (unique)records.
A set of rows, one for each attribute in the relation. These rowshave the following contents:
- the attribute names.
- a depiction of the relation's (candidate) keys. Eachcolumn represents a key, and a filled cell indicates that the attributein that row is in that key. The keys are given in lexical order, withprecedence given to keys with fewer attributes, and keys with attributesthat appear earlier in the original data frame's attribute order. Defaultoutput from other package functions will thus have the primary key givenfirst. In the future, this will be changed to always give the primary keyfirst.
- optionally, the attribute types: specifically, the first elementwhen passing the attribute's values intoclass.

Any foreign key references between relations are represented by one-way arrows,one per attribute in the foreign key.

If the object has a name, then currently the name is not used, except as asingle data frame's name. In the future, this will be used to give a name tothe generated board, to make use of D2's composition features.

Value

A scalar character, containing text input for D2.

Examples

# simple data.frame examplecat(d2(ChickWeight, "chick"))

Generate D2 input text to plot a data frame

Description

Produces text input for D2 to make a diagram of a given data frame, usuallyrendered with SVG.

Usage

## S3 method for class 'data.frame'd2(x, name = NA_character_, ...)

Arguments

x

a data.frame.

name

a character scalar, giving the name of the record, if any. Thename must be non-empty, since it is also used to name the single table inthe plot. Defaults toNA: if left missing, it is set to "data".

...

further arguments passed to or from other methods.

Details

The rows in the plotted data frame include information about the attributeclasses.

Value

A scalar character, containing text input for Graphviz.

Generate D2 input text to plot databases

Description

Produces text input for D2 to make a diagram of a given database, usuallyrendered with SVG.

Usage

## S3 method for class 'database'd2(x, name = NA_character_, reference_level = c("attr", "relation"), ...)

Arguments

x

adatabase.

name

a character scalar, giving the name of the schema, if any.

reference_level

a character scalar, indicating the format to use forforeign key references. "relation" only specifies the relations involved;"attr" also specifies the attributes involved, one pair at a time.

...

further arguments passed to or from other methods.

Details

Each relation in the database is presented as a set of rows, one for eachattribute in the relation. These rows include information about the attributeclasses.

Any foreign key references are represented by arrows between either theattribute pairs or the relation pairs, depending on the value ofreference_level. This allows the output to be geared towards aspecific layout engine. Of the engines currently available for D2, Dagre cannot plot references between relation attributes, just the attributesthemselves, so usingreference_level = "relation" prevents compoundforeign keys resulting in duplicate reference arrows. ELK and Talacan plot between relation attributes, so the defaultreference_level ="attr" works as intended.

Value

A scalar character, containing text input for D2.

Generate D2 input text to plot database schemas

Description

Produces text input for D2 to make a diagram of a given database schema,usually rendered with SVG.

Usage

## S3 method for class 'database_schema'd2(x, name = NA_character_, reference_level = c("attr", "relation"), ...)

Arguments

x

a database schema, as given bydatabase_schema ornormalise.

name

a character scalar, giving the name of the schema, if any.

reference_level

a character scalar, indicating the format to use forforeign key references. "relation" only specifies the relations involved;"attr" also specifies the attributes involved, one pair at a time.

...

further arguments passed to or from other methods.

Details

Each relation in the schema is presented as a set of rows, one for eachattribute in the relation. These rows do not include information about theattribute classes.

Value

A scalar character, containing text input for D2.

Generate D2 input text to plot relations

Description

Produces text input for D2 to make a diagram of a given relation, usuallyrendered with SVG.

Usage

## S3 method for class 'relation'd2(x, name = NA_character_, ...)

Arguments

x

arelation.

name

a character scalar, giving the name of the schema, if any.

...

further arguments passed to or from other methods.

Details

Each relation is presented as a set of rows, one for eachattribute in the relation. These rows include information about theattribute classes.

Value

A scalar character, containing text input for D2.

Generate D2 input text to plot relation schemas

Description

Produces text input for D2 to make a diagram of a given relation schema,usually rendered with SVG.

Usage

## S3 method for class 'relation_schema'd2(x, name = NA_character_, ...)

Arguments

x

a relation schema, as given byrelation_schema orsynthesise.

name

a character scalar, giving the name of the schema, if any.

...

further arguments passed to or from other methods.

Details

Each relation in the schema is presented as a set of rows, one for eachattribute in the relation. These rows do not include information about theattribute classes.

Value

A scalar character, containing text input for D2.

Databases

Description

Enhances arelation object with foreign key referenceinformation.

Usage

database(relations, references, check = TRUE)

Arguments

relations

arelation object.

references

a list of references, eachrepresented by a list containing four character elements. In order, theelements are a scalar giving the name of the child (referrer) schema, avector giving the child attribute names, a scalar giving the name of theparent (referee) schema, and a vector giving the parent attribute names. Thevectors must be of the same length and contain names for attributes presentin their respective schemas, and the parent attributes must form a key.

check

a logical, indicating whether to check thatrelationssatisfies the foreign key references given inreferences beforecreating the result. This is redundant ifrelations andreferences were constructed based on the same functionaldependencies, such as when usingautodb. Only set to FALSE ifthe references are definitely satisfied: not checking references thatviolate the data will result in an invalid database, which will not bedetected until further operated on.

Details

Unlikerelation_schema andrelation, and likedatabase_schema,database is not designed to bevector-like: it only holds a single database. This adheres to the usualpackage use case, where a single data frame is being analysed at a time.However, it inherits fromrelation, so is vectorised withrespect to its relations.

As withrelation, duplicate relations, after ordering byattribute, are allowed, and can be removed withunique.

References, i.e. foreign key references, are allowed to have differentattribute names in the child and parent relations; this can't occur in theoutput forautoref andnormalise.

Subsetting removes any references that involve removed relations.Removing duplicates withunique changes references involvingduplicates to involve the kept equivalent relations instead. Renamingrelations withnames<- also changes their namesin the references.

Value

Adatabase object, containingrelations withreferences stored in an attribute of the same name. Referencesare stored with their attributes in the order they appear in theirrespective relations.

Examples

rels <- relation(  list(    a = list(      df = data.frame(a = logical(), b = logical()),      keys = list("a")    ),    b = list(      df = data.frame(b = logical(), c = logical()),      keys = list("b", "c")    )  ),  attrs_order = c("a", "b", "c", "d"))db <- database(  rels,  list(list("a", "b", "b", "b")))print(db)attrs(db)stopifnot(identical(  attrs(db),  lapply(records(db), names)))keys(db)attrs_order(db)names(db)references(db)# relations can't reference themselves## Not run:   database(    relation(      list(a = list(df = data.frame(a = 1:5), keys = list("a"))),      c("a", "b")    ),    list(list("a", "a", "a", "a"))  )  database(    relation(      list(a = list(df = data.frame(a = 1:5, b = 6:10), keys = list("a"))),      c("a", "b")    ),    list(list("a", "b", "a", "a"))  )## End(Not run)# an example with references between differently-named attributesprint(database(  relation(    list(      citation = list(        df = data.frame(citer = 1:5, citee = 6:10),        keys = list(c("citer", "citee"))      ),      article = list(df = data.frame(article = 1:10), keys = list("article"))    ),    c("citer", "citee", "article")  ),  list(    list("citation", "citer", "article", "article"),    list("citation", "citee", "article", "article")  )))# inserting datainsert(db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L))# data is only inserted into relations where all columns are given...insert(db, data.frame(a = 1L, b = 2L, c = 3L))# and that are listed in relations argumentinsert(  db,  data.frame(a = 1L, b = 2L, c = 3L, d = 4L),  relations = "b")# inserted data can't violate keys## Not run:   insert(    db,    data.frame(a = 1L, b = 1:2)  )## End(Not run)# inserted data can't violate foreign key references## Not run:   insert(    db,    data.frame(a = 1L, b = 2L, c = 3L, d = 4L),    relations = "a"  )## End(Not run)# vector operationsdb2 <- database(  relation(    list(      e = list(df = data.frame(a = 1:5, e = 6:10), keys = list("e"))    ),    attrs_order = c("a", "e")  ),  list())c(db, db2) # attrs_order attributes are mergedunique(c(db, db))# subsettingdb[1]stopifnot(identical(db[[1]], db[1]))db[c(1, 2, 1, 2)] # replicates the foreign key referencesc(db[c(1, 2)], db[c(1, 2)]) # doesn't reference between separate copies of dbunique(db[c(1, 2, 1, 2)]) # unique() also merges references# another example of unique() merging referencesdb_merge <- database(  relation(    list(      a = list(        df = data.frame(a = logical(), b = logical()),        keys = list("a")      ),      b = list(        df = data.frame(b = logical(), c = logical(), d = logical()),        keys = list("b")      ),      c_d = list(        df = data.frame(c = logical(), d = logical(), e = logical()),        keys = list(c("c", "d"))      ),      a.1 = list(        df = data.frame(a = logical(), b = logical()),        keys = list("a")      ),      b.1 = list(        df = data.frame(b = logical(), c = logical(), d = logical()),        keys = list("b")      )    ),    c("a", "b", "c", "d", "e")  ),  list(    list("a", "b", "b", "b"),    list("b.1", c("c", "d"), "c_d", c("c", "d"))  ))print(db_merge)unique(db_merge)# reassignment# can't change keys included in references## Not run: keys(db)[[2]] <- list("c")# can't remove attributes included in keys## Not run: attrs(db)[[2]] <- list("c", "d")# can't remove attributes included in references## Not run: attrs(db)[[1]] <- c("a", "d")db3 <- db# can change subset of schema, but loses references between altered and# non-altered subsetsdb3[2] <- database(  relation(    list(d = list(df = data.frame(d = logical(), c = logical()), keys = list("d"))),    attrs_order(db3)  ),  list())print(db3) # note the schema's name doesn't change# names(db3)[2] <- "d" # this would change the namekeys(db3)[[2]] <- list(character()) # removing keys first...# for a database_schema, we could then change the attrs for# the second database. For a created relation, this is not# allowed.## Not run:   attrs(db3)[[2]] <- c("b", "c")  names(records(db3)[[2]]) <- c("b", "c")## End(Not run)# changing appearance priority for attributesattrs_order(db3) <- c("d", "c", "b", "a")print(db3)# changing relation schema names changes them in referencesnames(db3) <- paste0(names(db3), "_long")print(db3)# reconstructing from componentsdb_recon <- database(  relation(    Map(list, df = records(db), keys = keys(db)),    attrs_order(db)  ),  references(db))stopifnot(identical(db_recon, db))db_recon2 <- database(  subrelations(db),  references(db))stopifnot(identical(db_recon2, db))# can be a data frame columndata.frame(id = 1:2, relation = db)# setting check = FALSE can give invalid databaseschickfds <- discover(ChickWeight)chickschema <- synthesise(chickfds)chickrels <- insert(create(chickschema), ChickWeight)badrefs <- list(list("Time_Chick", "weight", "Chick", "Chick"))# check = TRUE stops on non-satisfied references## Not run: database(chickrels, badrefs)# check = FALSE returns an invalid databasebaddb <- database(chickrels, badrefs, check = FALSE)# only returns an error when further manipulated by certain methods## Not run: insert(baddb, ChickWeight)

Database schemas

Description

Enhances arelation_schema object with foreign key referenceinformation.

Usage

database_schema(relation_schemas, references)

Arguments

relation_schemas

arelation_schema object, as returnedbysynthesise orrelation_schema.

references

Details

Unlikefunctional_dependency andrelation_schema,database_schema is not designed to be vector-like: it only holds asingle database schema. This adheres to the usual package use case, where asingle data frame is being analysed at a time. However, it inherits fromrelation_schema, so is vectorised with respect to its relationschemas.

As withrelation_schema, duplicate relation schemas, afterordering by attribute, are allowed, and can be removed withunique.

References, i.e. foreign key references, are allowed to have differentattribute names in the child and parent relations; this can't occur in theoutput forautoref andnormalise.

Subsetting removes any references that involve removed relation schemas.Removing duplicates withunique changes references involvingduplicates to involve the kept equivalent schemas instead. Renaming relationschemas withnames<- also changes their names inthe references.

Value

Adatabase_schema object, containingrelation_schemaswithreferences stored in an attribute of the same name.References are stored with their attributes in the order they appear intheir respective relation schemas.

Examples

rs <- relation_schema(  list(    a = list(c("a", "b"), list("a")),    b = list(c("b", "c"), list("b", "c"))  ),  attrs_order = c("a", "b", "c", "d"))ds <- database_schema(  rs,  list(list("a", "b", "b", "b")))print(ds)attrs(ds)keys(ds)attrs_order(ds)names(ds)references(ds)# relations can't reference themselves## Not run:   database_schema(    relation_schema(      list(a = list("a", list("a"))),      c("a", "b")    ),    list(list("a", "a", "a", "a"))  )  database_schema(    relation_schema(      list(a = list(c("a", "b"), list("a"))),      c("a", "b")    ),    list(list("a", "b", "a", "a"))  )## End(Not run)# an example with references between differently-named attributesprint(database_schema(  relation_schema(    list(      citation = list(c("citer", "citee"), list(c("citer", "citee"))),      article = list("article", list("article"))    ),    c("citer", "citee", "article")  ),  list(    list("citation", "citer", "article", "article"),    list("citation", "citee", "article", "article")  )))# vector operationsds2 <- database_schema(  relation_schema(    list(      e = list(c("a", "e"), list("e"))    ),    attrs_order = c("a", "e")  ),  list())c(ds, ds2) # attrs_order attributes are mergedunique(c(ds, ds))# subsettingds[1]stopifnot(identical(ds[[1]], ds[1]))ds[c(1, 2, 1, 2)] # replicates the foreign key referencesc(ds[c(1, 2)], ds[c(1, 2)]) # doesn't reference between separate copies of dsunique(ds[c(1, 2, 1, 2)]) # unique() also merges references# another example of unique() merging referencesds_merge <- database_schema(  relation_schema(    list(      a = list(c("a", "b"), list("a")),      b = list(c("b", "c", "d"), list("b")),      c_d = list(c("c", "d", "e"), list(c("c", "d"))),      a.1 = list(c("a", "b"), list("a")),      b.1 = list(c("b", "c", "d"), list("b"))    ),    c("a", "b", "c", "d", "e")  ),  list(    list("a", "b", "b", "b"),    list("b.1", c("c", "d"), "c_d", c("c", "d"))  ))print(ds_merge)unique(ds_merge)# reassignment# can't change keys included in references## Not run: keys(ds)[[2]] <- list("c")# can't remove attributes included in keys## Not run: attrs(ds)[[2]] <- list("c", "d")# can't remove attributes included in references## Not run: attrs(ds)[[1]] <- c("a", "d")ds3 <- ds# can change subset of schema, but loses references between altered and# non-altered subsetsds3[2] <- database_schema(  relation_schema(    list(d = list(c("d", "c"), list("d"))),    attrs_order(ds3)  ),  list())print(ds3) # note the schema's name doesn't change# names(ds3)[2] <- "d" # this would change the namekeys(ds3)[[2]] <- list(character()) # removing keys first...attrs(ds3)[[2]] <- c("b", "c") # so we can change the attrs legallykeys(ds3)[[2]] <- list("b", "c") # add the new keys# add the reference lost during subset replacementreferences(ds3) <- c(references(ds3), list(list("a", "b", "b", "b")))stopifnot(identical(ds3, ds))# changing appearance priority for attributesattrs_order(ds3) <- c("d", "c", "b", "a")print(ds3)# changing relation schema names changes them in referencesnames(ds3) <- paste0(names(ds3), "_long")print(ds3)# reconstructing from componentsds_recon <- database_schema(  relation_schema(    Map(list, attrs(ds), keys(ds)),    attrs_order(ds)  ),  references(ds))stopifnot(identical(ds_recon, ds))ds_recon2 <- database_schema(  subschemas(ds),  references(ds))stopifnot(identical(ds_recon2, ds))# can be a data frame columndata.frame(id = 1:2, schema = ds)

Decompose a data frame based on given normalised dependencies

Description

Decomposes a data frame into several relations, based on the given databaseschema. It's intended that the data frame satisfies all the functionaldependencies implied by the schema, such as if the schema was constructedfrom the same data frame. If this is not the case, the function will returnsan error.

Usage

decompose(  df,  schema,  keep_rownames = FALSE,  digits = getOption("digits"),  check = TRUE)

Arguments

df

a data.frame, containing the data to be normalised.

schema

a database schema with foreign key references, such as given byautoref.

keep_rownames

a logical or a string, indicating whether to include therow names as a column. If a string is given, it is used as the name for thecolumn, otherwise the column is named "row". Set to FALSE by default.

digits

a positive integer, indicating how many significant digits areto be used for numeric and complex variables. A value ofNA resultsin no rounding. By default, this usesgetOption("digits"), similarlytoformat. See the "Floating-point variables" section fordiscover for why this rounding is necessary for consistentresults across different machines. See the note inprint.default aboutdigits >= 16.

check

a logical, indicating whether to check thatdf satisfiesthe functional dependencies enforced byschema before creating theresult. This can find key violations without spending time creating theresult first, but is redundant ifdf was used to createschema in the first place.

Details

If the schema was constructed using approximate dependencies for the samedata frame,decompose returns an error, to prevent either duplicate recordsor lossy decompositions. This is temporary: for the next update, we plan toadd an option to allow this, or to add "approximate" equivalents of databasesand database schemas.

Value

Adatabase object, containing the data indfwithin the database schema given inschema.

Dependants

Description

Generic function, with the only given method fetching dependants forfunctional dependencies.

Usage

dependant(x, ...)dependant(x, ...) <- value

Arguments

x

an R object. For the given method, afunctional_dependency.

...

further arguments passed on to methods.

value

A character vector of the same length asdependant(x, ...).

Value

A character vector containing dependants.

Determinant sets

Description

Generic function, with the only given method fetching determinant sets forfunctional dependencies.

Usage

detset(x, ...)detset(x, ...) <- value

Arguments

x

an R object. For the given method, afunctional_dependency.

...

further arguments passed on to methods.

value

A character vector of the same length asdetset(x, ...).

Value

A list containing determinant sets, each consisting of a charactervector with unique elements.

Determine Duplicate Elements

Description

duplicated "determines which elements of a vector or data frameare duplicates of elements with smaller subscripts, and returns a logicalvector indicating which elements (rows) are duplicates". However, as of R4.1, calling this on a data frame with zero columns always returns an emptylogical vector. This has repercussions on other functions that useduplicated, such asunique andanyDuplicated.These functions add zero-column data frames as a special case.

Usage

df_duplicated(x, incomparables = FALSE, fromLast = FALSE, ...)df_unique(x, incomparables = FALSE, fromLast = FALSE, ...)df_anyDuplicated(x, incomparables = FALSE, fromLast = FALSE, ...)df_records(x, use_rownames = FALSE, use_colnames = FALSE)

Arguments

x

a data frame.

incomparables

a vector of values that cannot be compared.FALSE is a special value, meaning that all values can becompared, and may be the only value accepted for methods other thanthe default. It will be coerced internally to the same type asx.

fromLast

logical indicating if duplication should be consideredfrom the reverse side, i.e., the last (or rightmost) of identicalelements would correspond toduplicated = FALSE.

...

arguments for particular methods.

use_rownames

a logical, FALSE by default, indicating whether rowvalues should keep the row names fromx. Defaults to FALSE.

use_colnames

a logical, FALSE by default, indicating whether rowvalues should keep the column names fromx for their elements.Defaults to FALSE.

Value

Fordf_duplicated, a logical vector with one element for eachrow.

Fordf_unique, a data frame is returned with the samecolumns, but possible fewer rows (and with row names from the firstoccurrences of the unique rows).

Fordf_anyDuplicated, an integer or real vector of lengthone with value the 1-based index of the first duplicate if any, otherwise0.

Fordf_records, a list of the row values inx. This isbased on a step induplicated.data.frame. However, for dataframes with zero columns, special handling returns a list of empty rowvalues, one for each row inx. Without special handling, this stepreturns an empty list. This was the cause forduplicatedreturning incorrect results for zero-column data frames in older versionsof R.

Examples

# row values for a 5x0 data framex <- data.frame(a = 1:5)[, FALSE, drop = FALSE]do.call(Map, unname(c(list, x))) # original step returns empty listdf_records(x) # corrected version preserves row count

Test data frames for equivalence under row reordering

Description

A convenience function, mostly used to testing thatrejoinworks as intended. It checks that data frames have the same dimensions andcolumn names, with duplicates allowed, then checks they contain the samedata. For the latter step, column names are made unique first, so columnswith duplicate names must be presented in the same order in both data frames.

Usage

df_equiv(df1, df2, digits = getOption("digits"))

Arguments

df1,df2

Data frames.

digits

a positive integer, indicating how many significant digits areto be used for numeric and complex variables. A value of NA results in norounding. By default, this usesgetOption("digits"), similarly toformat. See the note inprint.default aboutdigits >= 16.

Value

A logical.

Combine R Objects by Rows or Columns

Description

rbind takes "a sequence of vector, matrix or data-framearguments", and combines by rows for the latter. However, as of R 4.1,calling this on data frame with zero columns always returns zero rows, due tothe issue mentioned fordf_duplicated. This function addszero-column data frames as a special case.

Usage

df_rbind(...)

Arguments

...

data frames.

Value

A data frame containing the... arguments row-wise.

Dependency discovery with DFD

Description

Finds all the minimal functional dependencies represented in a data frame.

Usage

discover(  df,  method = c("FDHitsSep", "FDHitsJoint", "DFD"),  keep_rownames = FALSE,  digits = getOption("digits"),  exclude = character(),  exclude_class = character(),  dependants = names(df),  detset_limit = ncol(df) - 1L,  progress = FALSE,  progress_file = "",  skip_bijections = FALSE,  accuracy = 1,  full_cache = TRUE,  store_cache = TRUE)

Arguments

df

a data.frame, the relation to evaluate.

method

a string, indicating which search algorithm to use. Currently,this defaults to DFD. Alternative options are FDHitsSep and FDHitsJoint.

keep_rownames

digits

a positive integer, indicating how many significant digits areto be used for numeric and complex variables. A value ofNA resultsin no rounding. By default, this usesgetOption("digits"), similarlytoformat. See the "Floating-point variables" section belowfor why this rounding is necessary for consistent results across differentmachines. See the note inprint.default aboutdigits >= 16.

exclude

a character vector, containing names of attributes to notconsider as members of determinant sets. If names are given that aren'tpresent indf, the user is given a warning.

exclude_class

a character vector, indicating classes of attributes tonot consider as members of determinant_sets. Attributes are excluded ifthey inherit from any given class.

dependants

a character vector, containing names of all attributes forwhich to find minimal functional dependencies for which they are thedependant. By default, this is all of the attribute names. A smaller set ofattribute names reduces the amount of searching required, so can reduce thecomputation time if only some potential dependencies are of interest.

detset_limit

an integer, indicating the largest determinant set sizethat should be searched for. By default, this is large enough to allow allpossible determinant sets. See Details for comments about the effect on theresult, and on the computation time.

progress

a logical, for whether to display progress to the user duringdependency search indiscover.

progress_file

a scalar character or a connection. Ifprogressis non-zero, determines where the progress is written to, in the same wayas thefile argument forcat.

skip_bijections

a logical, indicating whether to skip some dependencysearches that are made redundant by discovered bijections betweenattributes. This can significantly speed up the search ifdfcontains equivalent attributes early in column order, but results inundefined behaviour ifaccuracy < 1. See Details for moreinformation. Currently only implemented for DFD.

accuracy

a numeric in (0, 1]: the accuracy threshold required in orderto conclude a dependency. Accuracy thresholds less than one are onlysupported in DFD.

full_cache

a logical, indicating whether to store information abouthow sets of attributes group the relation records (stripped partitions).Otherwise, only the number of groups is stored. Storing the strippedpartition is expected to let the algorithm run more quickly, but might beinefficient for small data frames or small amounts of memory. Only relevantfor DFD.

store_cache

a logical, indicating whether to keep cached informationto use when finding dependencies for other dependants. This allows thealgorithm to run more quickly by not having to re-calculate information,but takes up more memory. Only relevant for DFD.

Details

Column names fordf must be unique.

There are two search algorithms available for finding dependencies: DFD, andFDHits. These are described below.

DFD

The DFD algorithm searches for determinant sets for each dependant attributeseparately, by traversing the powerset of the other (non-excluded)attributes. It can roughly be considered as a depth-first search overcandidate determinant sets.

The implementation for DFD differs a little from the algorithm presented inthe original paper:

Some attributes, or attribute types, can be designated, ahead oftime, as not being candidate members for determinant sets. This reduces thenumber of candidate determinant sets to be searched, saving time by notsearching for determinant sets that the user would remove later anyway. Thedependants can also be trimmed in similar fashion, although this reducesthe search space linearly rather than exponentially.
Attributes that have a single unique value, i.e. areconstant, get attributed a single empty determinant set. In the standardDFD algorithm, they would be assigned all the other non-excluded attributesas length-one determinant sets. Assigning them the empty set distinguishesthem as constant, allowing for special treatment at normalisation and latersteps.
As was done in the original Python library, there is an extra case inseed generation for when there are no discovered maximal non-dependencies.In this case, we take all of the single-attribute nodes, then filter out byminimal dependencies as usual. This is equivalent to taking the empty setas the single maximal non-dependency.
There are three results when checking whether a candidate node isminimal/maximal. TRUE indicates the node is minimal/maximal, as usual.FALSE has been split into FALSE and NA. NA indicates that we can not yetdetermine whether the node is minimal/maximal. FALSE indicates that we havedetermined that it is not minimal/maximal, and we can set its category assuch. This is done by checking whether any of its adjacentsubsets/supersets are dependencies/non-dependencies, instead of waiting toexhaust the adjacent subsets/supersets to visit when picking the next nodeto visit.
We do not yet keep hashmaps to manage subset/superset relationships,as described in Section 3.5 of the original paper.
skip_bijections allows for additional optimisation for findingfunctional dependencies when there are pairwise-equivalent attributes.
Missing values (NA) are treated as a normal value, with NA = NA beingtrue, and x = NA being false for any non-NA value of x.

FDHits

FDHits begins by sampling pairs of records for attributes in which theirvalues differ (difference sets). These difference sets render somedeterminant-dependant pairings invalid, pruning the search space. If acandidate pairing is not rendered invalid by the known difference sets, thenit's validated, resulting in either confirmation as a minimal functionaldependency, or discovery of new difference sets that make it invalid.

There are two variants. FDHitsSep performs a search for each dependantattribute separately, in the same way as DFD. FDHitsJoint handles alldependant attributes at the same time.

The implementation for FDHits differs a little from the algorithm presentedin the original paper:

Some attributes, or attribute types, can be designated, ahead oftime, as not being candidate members for determinant sets. This reduces thenumber of candidate determinant sets to be searched, saving time by notsearching for determinant sets that the user would remove later anyway. Thedependants can also be trimmed in similar fashion.
The search can be limited to determinant sets up to a given size.This is also an option for DFD, but it's more useful for FDHits, due to thesearch order.
As described in the paper, FDHitsSep and FDHitsJoint branch a searchinto several child nodes:\mu_0 for FDHitsJoint, and\mu_1,\mu_2 etc. for both. The paper implies that the latter nodes arevisited in order. However, this causes the algorithm to not always workcorrectly: for guaranteed correctness, they must be visited in reverseorder, with\mu_0 able to be visited at any point. This correction isexpected to appear in a future paper.
The final algorithm in the paper automatically chooses betweenFDHitsSep and FDHits Joint, depending on the number of initially-sampleddifference sets. This is not yet implemented.

Floating-point variables

Numerical/complex values, i.e. floating-point values, represent difficultiesfor stating functional dependencies. A fundamental condition for statingfunctional dependencies is that we can compare two values for the samevariable, and they are equivalent or not equivalent.

Usually, this is done by checking they're equal – this is the approach usedindiscover – but we can use any comparison that is an equivalencerelation.

However, checking floating-point values for equality is not simple.==is not appropriate, even when comparing non-calculated values we've read froma file, because how a given number is converted into a float can vary bycomputer architecture, meaning that two values can be considered equal on onecomputer, and not equal on another. This can happen even if they're bothusing 64-bit R, and even though all R platforms work with values conformingto the same standard (seedouble). For example,8.54917750000000076227 and8.54917749999999898591 are convertedinto different floating-point representations on x86, but the samerepresentation on ARM, resulting in inequality and equality respectively.

For this and other reasons, checking numerical/complex values for(near-)equality in R is usually done withall.equal. Thisdetermines valuesx andy to be equal if their absolute/relativeabsolute difference is within some tolerance value. However, we can not usethis. Equivalence relations must be transitive: if we have valuesx,y, andz, andx is equivalent to bothy andz,theny andz must also be equivalent. This tolerance-basedequivalence is not transitive: it is reasonably straightforward to set upthree values so that the outer values are far enough apart to be considerednon-equivalent, but the middle value is close enough to be consideredequivalent to both of them. Using this to determine functional dependencies,therefore, could easily result in a large number of inconsistencies.

This means we have no good option for comparing numerical/complex valuesas-is for equivalence, with consistent results across different machines, sowe must treat them differently. We have three options:

Round/truncate the values, before comparison, to some low degree of precision;
Coerce the values to another class before passing them intodiscover;
Read values as characters if reading data from a file.

discover takes the first option, with a default number of significantdigits low enough to ensure consistency across different machines. However,the user can also use any of these options when processing the data beforepassing it todiscover. The third option, in particular, isrecommended if reading data from a file.

Skipping bijections

Skipping bijections allows skipping redundant searches. For example, if thesearch discovers thatA -> B andB -> A, then only one of thoseattributes is considered for the remainder of the search. Since the searchtime increases exponentially with the number of attributes considered, thiscan significantly speed up search times. At the moment, this is only be donefor bijections between single attributes, such asA <-> B; ifA<-> {B, C}, nothing is skipped. Whether bijections are skipped doesn'taffect which functional dependencies are present in the output, but it mightaffect their order.

Skipping bijections for approximate dependencies, i.e. whenaccuracy < 1,should be avoided: it can result in incorrect output, since an approximatebijection doesn't imply equivalent approximate dependencies.

Limiting the determinant set size

Settingdetset_limit smaller than the largest-possible value hasdifferent behaviour for different search algorithms, but the result is alwaysthatdiscover(x, 1, detset_limit = n) is equivalent to doing a fullsearch,fds <- discover(x, 1), thenfiltering by determinant set size post-hoc,fds[lengths(detset(fds)) <=n].

For DFD, the naive way to implement it is by removing determinant sets largerthan the limit from the search tree for possible functional dependencies foreach dependant. However, this usually results in the search taking much moretime than without a limit.

For example, suppose we search for determinant sets for a dependant that hasnone (the dependant is the only key fordf, for example). Using DFD,we begin with a single attribute, then add other attributes one-by-one, sinceevery set gives a non-dependency. When we reach a maximum-size set, we canmark all subsets as also being non-dependencies.

With the default limit, there is only one maximum-size set, containing all ofthe available attributes. If there aren candidate attributes fordeterminants, the search finishes after visitingn sets.

With a smaller limitk, there are\binom{n}{k} maximum-size setsto explore. Since a DFD search adds or removes one attribute at each step,this means the search must take at leastk - 2 + 2\binom{n}{k} steps,which is larger thann for all non-trivial cases0 < k \leq n.

We therefore use a different approach, where any determinant sets above thesize limit are not allowed to be candidate seeds for new search paths, andany discovered dependencies with a size above the limit are discard at theend of the entire DFD search. This means that nodes for determinant setsabove the size limit are only visited in order to determine maximality ofnon-dependencies within the size limit. It turns out to be rare that thisresults in a significant speed-up, but it never results in the search havingto visit more nodes than it would without a size limit, so the average searchtime is never made worse.

FDHits implementsdetset_limit more naturally, since it exploresdeterminant sets in increasing set size. Limiting the size is simply a matterof only visiting new nodes if their determinant set is within the given sizelimit.

Value

Afunctional_dependency object, containing thediscovered dependencies. The column names ofdf are stored in theattrs attribute, in order, to serve as a default priority order forthe attributes during normalisation.

References

DFD: Abedjan Z., Schulze P., Naumann F. (2014) DFD: efficient functionaldependency discovery.Proceedings of the 23rd ACM International Conferenceon Conference on Information and Knowledge Management (CIKM '14). New York,U.S.A., 949–958.

FDHits: Bleifuss T., Papenbrock T., Bläsius T., Schirneck M, Naumann F.(2024) Discovering Functional Dependencies through Hitting Set Enumeration.Proc. ACM Manag. Data,2, 1, 43:1–24.

Examples

# simple examplediscover(ChickWeight)# example with spurious dependenciesdiscover(CO2)# exclude attributes that can't be determinants.# in this case, the numeric attributes are now# not determined by anything, because of repeat measurements# with no variable to mark them as such.discover(CO2, exclude_class = "numeric")# include only dependencies with dependants of interest.discover(CO2, dependants = c("Treatment", "uptake"))# approximate dependencies, that hold if we remove up to 20% of the rows# (these need not agree on which rows are removed)discover(CO2, method = "DFD", accuracy = 0.8)

Functional dependency vectors

Description

Creates a set of functional dependencies with length-one dependants.

Usage

functional_dependency(FDs, attrs_order, unique = TRUE)

Arguments

FDs

a list of functional dependencies, in the form of two-elementslists: the first element contains a character vector of all attributes inthe determinant set, and the second element contains the single dependentattribute (dependant).

attrs_order

a character vector, giving the names of all attributes.These need not be present inFDs, but all attributes inFDsmust be present inattrs.

unique

a logical, TRUE by default, for whether to remove duplicatedependencies.

Details

When several sets of functional dependencies are concatenated, theirattrs_order attributes are merged, so as to preserve all of theoriginal attribute orders, if possible. If this is not possible, because theorderings disagree, then the returned value of theattrs_orderattribute is their union instead.

Value

Afunctional_dependency object, containing the list given inFDs, withattrs_order an attribute of the same name.Functional dependencies are returned with their determinant sets sortedaccording to the attribute order inattrs. Any duplicates foundafter sorting are removed.

Examples

fds <- functional_dependency(  list(list(c("a", "b"), "c"), list(character(), "d")),  attrs_order = c("a", "b", "c", "d"))print(fds)detset(fds)dependant(fds)attrs_order(fds)# vector operationsfds2 <- functional_dependency(list(list("e", "a")), c("a", "e"))c(fds, fds2) # attrs_order attributes are mergedunique(c(fds, fds))# subsettingfds[1]fds[c(1, 2, 1)]stopifnot(identical(fds[[2]], fds[2]))# reassignmentfds3 <- fdsfds3[2] <- functional_dependency(list(list("a", "c")), attrs_order(fds3))print(fds3)detset(fds3)[[2]] <- character()dependant(fds3)[[2]] <- "d"stopifnot(identical(fds3, fds))# changing appearance priority for attributesattrs_order(fds3) <- rev(attrs_order(fds3))fds3# reconstructing from componentsfds_recon <- functional_dependency( Map(list, detset(fds), dependant(fds)), attrs_order(fds))stopifnot(identical(fds_recon, fds))# can be a data frame columndata.frame(id = 1:2, fd = fds)# (in)equality ignores headerstopifnot(all(fds3 == fds))stopifnot(!any(fds != fds))

Generate Graphviz input text to plot objects

Description

Produces text input for Graphviz to make an HTML diagram of a given object.

Usage

gv(x, name = NA_character_, ...)

Arguments

x

an object to be plotted.

name

a scalar character, giving the name of the object, if any. Thisname is used for the resulting graph, to allow for easier combining ofgraphs into a single diagram if required.

...

further arguments passed to or from other methods.

Details

Details of what is plotted are given in individual methods. There areexpected commonalities, which are described below.

The object is expected to be one of the following:

an object whose elements have the same length. Examples would bedata frames, matrices, and other objects that can represent relations, withnames for the elements, and an optional name for the object itself.
a graph of sub-objects, each of which represent a relation asdescribed above, possibly with connections between the objects, and anoptional name for the graph as a whole.

Each relation is presented as a record-like shape, with the following elements:

A optional header with the relation's name, and the number of (unique)records.
A set of rows, one for each attribute in the relation. These rowshave the following contents:
- the attribute names.
- a depiction of the relation's (candidate) keys. Eachcolumn represents a key, and a filled cell indicates that the attributein that row is in that key. The keys are given in lexical order, withprecedence given to keys with fewer attributes, and keys with attributesthat appear earlier in the original data frame's attribute order. Defaultoutput from other package functions will thus have the primary key givenfirst. In the future, this will be changed to always give the primary keyfirst.
- optionally, the attribute types: specifically, the first elementwhen passing the attribute's values intoclass.

Any foreign key references between relations are represented by one-way arrows,one per attribute in the foreign key.

If the object has a name, this name is attached to the resulting graph inGraphviz. This is to allow easier combination of several such graphs into asingle image, if a user wishes to do so.

Value

A scalar character, containing text input for Graphviz.

Examples

# simple data.frame exampletxt_df <- gv(ChickWeight, "chick")cat(txt_df)if (requireNamespace("DiagrammeR", quietly = TRUE)) {  DiagrammeR::grViz(txt_df)}# simple database exampledb <- autodb(ChickWeight)txt_db <- gv(db)cat(txt_db)if (requireNamespace("DiagrammeR", quietly = TRUE)) {  DiagrammeR::grViz(txt_db)}# simple relation schemasrschema <- synthesise(discover(ChickWeight))txt_rschema <- gv(rschema)cat(txt_rschema)if (requireNamespace("DiagrammeR", quietly = TRUE)) {  DiagrammeR::grViz(txt_rschema)}# simple database schemadschema <- normalise(discover(ChickWeight))txt_dschema <- gv(dschema)cat(txt_dschema)DiagrammeR::grViz(txt_dschema)# simple relationsrel <- create(synthesise(discover(ChickWeight)))txt_rel <- gv(rel)cat(txt_rel)if (requireNamespace("DiagrammeR", quietly = TRUE)) {  DiagrammeR::grViz(txt_rel)}

Generate Graphviz input text to plot a data frame

Description

Produces text input for Graphviz to make an HTML diagram of a given dataframe.

Usage

## S3 method for class 'data.frame'gv(x, name = NA_character_, ...)

Arguments

x

a data.frame.

name

a character scalar, giving the name of the record, if any. Thename must be non-empty, since it is also used to name the single table inthe plot. Defaults toNA: if left missing, it is set to "data".

...

further arguments passed to or from other methods.

Details

The rows in the plotted data frame include information about the attributeclasses.

Value

A scalar character, containing text input for Graphviz.

Generate Graphviz input text to plot databases

Description

Produces text input for Graphviz to make an HTML diagram of a given database.

Usage

## S3 method for class 'database'gv(x, name = NA_character_, ...)

Arguments

x

a database, as returned byautoref orautodb.

name

a scalar character, giving the name of the database, if any. Thisname is used for the resulting graph, to allow for easier combining ofgraphs into a single diagram if required.

...

further arguments passed to or from other methods.

Details

Each relation in the database is presented as a set of rows, one for eachattribute in the relation. These rows include information about the attributeclasses.

Value

A scalar character, containing text input for Graphviz.

Generate Graphviz input text to plot database schemas

Description

Produces text input for Graphviz to make an HTML diagram of a given databaseschema.

Usage

## S3 method for class 'database_schema'gv(x, name = NA_character_, ...)

Arguments

x

a database schema, as given bynormalise,synthesise, orautoref.

name

a character scalar, giving the name of the schema, if any.

...

further arguments passed to or from other methods.

Details

Each relation in the schema is presented as a set of rows, one for eachattribute in the relation. These rows do not include information about theattribute classes.

Any foreign key references are represented by arrowsbetween the attribute pairs.

Value

A scalar character, containing text input for Graphviz.

Generate Graphviz input text to plot relations

Description

Produces text input for Graphviz to make an HTML diagram of a given relation.

Usage

## S3 method for class 'relation'gv(x, name = NA_character_, ...)

Arguments

x

arelation.

name

a character scalar, giving the name of the schema, if any.

...

further arguments passed to or from other methods.

Details

Each relation is presented as a set of rows, one for eachattribute in the relation. These rows include information about theattribute classes.

Value

A scalar character, containing text input for Graphviz.

Generate Graphviz input text to plot relation schemas

Description

Produces text input for Graphviz to make an HTML diagram of a given relationschema.

Usage

## S3 method for class 'relation_schema'gv(x, name = NA_character_, ...)

Arguments

x

a relation schema, as given byrelation_schema orsynthesise.

name

a character scalar, giving the name of the schema, if any.

...

further arguments passed to or from other methods.

Details

Each relation in the schema is presented as a set of rows, one for eachattribute in the relation. These rows do not include information about theattribute classes.

Value

A scalar character, containing text input for Graphviz.

Insert data

Description

Generic function for inserting a data frame of data into an object.

Usage

insert(  x,  vals,  relations = names(x),  all = FALSE,  keep_rownames = FALSE,  digits = getOption("digits"),  ...)

Arguments

x

a relational data object, into which to insert data, such as arelation ordatabase object.

vals

a data frame, containing data to insert. Column names must beunique.

relations

a character vector, containing names of elements ofxinto which to insert data. By default,insert attempts to insertdata into every element.

all

a logical, indicating whethervals is required to containall attributes of all elements ofx[relations]. By default, it isnot, and data is only inserted into elements ofx[relations] whoseattributes are all present invals.

keep_rownames

digits

...

further arguments pass on to methods.

Details

This function is intended for inserting into an object that is itselfcomprised of data frames, such as arelation or adatabase. The given methods have the following behaviour:

If an empty set of data is inserted, into a non-empty object element,nothing happens.
If an empty set of data is inserted into an empty object element, theresulting element is also empty, but takes on the attribute/column classesof the inserted data. This is done to prevent having to know attributeclasses during object creation.
Insertion can fail if inserting would violate object constraints. Forexample, databases cannot have data inserted that would violatecandidate/foreign key constraints.
For other cases, the data is inserted in an object element in thesame way as usingrbind, followed byunique.

While key violations prevent insertion, re-insertion of existing records inan object element does not. This makes insertion equivalent to anINSERT ORIGNORE expression in SQL. In particular, it is somewhat like using thisexpression in SQLite, since that implementation uses dynamic typing.

Ifvals contains attributes not included inattrs_order(x),insert throws an error, since thoseattributes can't be inserted.

If a partial set of attributes is inserted, andall isFALSE,then data is only inserted into components ofx[relations] whoserequired attributes are all present invals. Ifall isTRUE,insert returns an error instead. This is useful whenspecifyingrelations: in that case, you often intend to insertinto all of the specified elements, so not including all the requiredattributes is a mistake, andall = TRUE prevents it.

Ifall isTRUE,insertthrows an error in this case: This ensures you insert into all members of aspecified value ofrelations.

Value

An R object of the same class asx, containing the additionalnew data.

Relational data keys

Description

Generic function, with the only given method fetching candidate key lists forrelation schemas.

Usage

keys(x, ...)keys(x, ...) <- value

Arguments

x

a relational schema object, such as arelation_schemaordatabase_schema object, or a relational data object, suchas arelation ordatabase object.

...

further arguments passed on to methods.

value

A list of lists of character vectors, of the same length askeys(x, ...). The number of keys for an element ofx can bechanged.

Value

A list containing lists of unique character vectors, representingcandidate keys for each element ofx.

Merge relation schemas with empty keys

Description

Merges an object's schemas with empty keys. The remaining such schemacontains all attributes contained in such schemas.

Usage

merge_empty_keys(x)

Arguments

x

a relational schema object, such as arelation_schemaordatabase_schema object.

Details

This function is not itself generic, but makes use of the generic functionskeys andmerge_schemas. Any input class withvalid methods for these generic functions can be passed into this function.

Fordatabase_schema objects, references involving theschemas with empty keys are updated to refer to the merged schema.

Value

An R object of the same class asx, where relations with anempty key have been merged into a single relation.

Merge relation schemas in given pairs

Description

Generic function that merges pairs of an object's schemas with matching setsof keys. The remaining schemas contain all the attributes from the schemasmerged into them.

Usage

merge_schemas(x, to_remove, merge_into, ...)

Arguments

x

a relational schema object, such as arelation_schemaordatabase_schema object.

to_remove

an integer vector, giving the indices for schemas to bemerged into other schemas, then removed.

merge_into

an integer vector of the same length asto_remove,giving the indices for the schemas into which to merge.

...

further arguments passed on to methods.

Value

An R object of the same class asx, where the relations havebeen merged as indicated.

Examples

rs <- relation_schema(  list(    a = list(c("a", "b"), list("a")),    b = list(c("b", "c"), list("b")),    b.1 = list(c("b", "d"), list("b")),    d = list(c("d", "e"), list("d", "e"))  ),  letters[1:5])ds <- database_schema(  rs,  list(    list("a", "b", "b", "b"),    list("b.1", "d", "d", "d")   ))merge_schemas(rs, 3, 2) # merging b and b.1merge_schemas(ds, 3, 2) # also merging their references# merging a schema into itself just removes itmerge_schemas(rs, 3, 3)merge_schemas(ds, 3, 3)

Create normalised database schemas from functional dependencies

Description

Creates a database schema from given functional dependencies, satisfying atleast third normal form, using Bernstein's synthesis.

Usage

normalise(  dependencies,  single_ref = FALSE,  ensure_lossless = TRUE,  reduce_attributes = TRUE,  remove_avoidable = FALSE,  constants_name = "constants",  progress = FALSE,  progress_file = "")

Arguments

dependencies

afunctional_dependency object, as given bydiscover.

single_ref

ensure_lossless

a logical, TRUE by default. If TRUE, and thedecomposition isn't lossless, an extra relation is added to make thedecomposition lossless.

reduce_attributes

a logical, TRUE by default. If TRUE,dependencies are checked for determinant attributes that are maderedundant by the other dependencies. This is redundant ifdependencies is output fromdiscover, since there will be nosuch redundant attributes.

remove_avoidable

constants_name

progress

a logical, for whether to display progress to the user duringdependency search indiscover.

progress_file

a scalar character or a connection. Ifprogressis non-zero, determines where the progress is written to, in the same wayas thefile argument forcat.

Details

This is a wrapper function for applyingsynthesise andautoref, in order. For creating relation schemas and foreignkey references separately, use these functions directly. See both functionsfor examples.

For details on the synthesis algorithm used, seesynthesise.

Value

Adatabase_schema object, containing the synthesisrelation schemas and the created foreign key references.

Nudge meta-analysis data

Description

Data used for a meta-analysis on the effectiveness of nudges, i.e. choicearchitecture interventions.

Usage

nudge

Format

A data frame with 447 effect size measurements and 25 columns:

publication_id, integer ID number for the publication. Note thattwo publications were erroneously assigned the same ID number, so thisis not a unique publication identifier.
study_id, integer ID number for the study.
es_id, integer ID number for the effect size measured.
reference, publication citation in "Author(s) (year)" format. Dueto two publications being assigned the same reference, this is also not aunique publication identifier.
title, title of the publication. Due to the error in assigningpublication ID numbers, this is the unique publication identifier withinthe data set.
year, year of the publication.
location, geographical location of the intervention. This is givenas a factor, rather than an integer, using the informationprovided in the codebook.
domain, factor giving the intervention's behavioural domain.
intervention_category, factor giving the intervention's category,based on the taxonomy in Münscher et al. (2016).
intervention_technique, factor giving the intervention's technique,based on the taxonomy in Münscher et al. (2016).
type_experiment, factor giving the type of experiment, as definedby Harrison and List (2004).
population, factor giving the intervention's target population.This is given as a factor, rather than an integer, using the informationprovided in the codebook.
n_study, sample size of the overall study.
n_comparison, combined sample size of the control and theintervention for the measured effect size.
n_control, sample size of the control condition for themeasured effect size.
n_intervention, sample size of the intervention conditionfor the measured effect size.
binary_outcome, logical for whether the outcome scale is binary orcontinuous.
mean_control, mean of outcome for the control condition.
sd_control, SD of outcome for the control condition.
mean_intervention, mean of outcome for the intervention condition.
sd_intervention, SD of outcome for the intervention condition.
cohens_d, extracted effect size of intervention.
variance_d, variance of extracted effect size.
approximation, logical for whether effect size extraction involvedapproximation.
wansink, logical for whether the study was (co-)authored by BrianWansink. This was added on revision, because, a few years beforepublication, Wansink had many papers retracted or corrected, due tovarious questionable practices, resulting in Wansink being determined tohave committed scientific misconduct. This column was added to checkwhether the findings were robust to the exclusion of non-retractedstudies by the Cornell Food and Brand Laboratory, of which Wansink wasthe director.

Source

https://osf.io/fywae/

References

Mertens S., Herberz M., Hahnel U. J. J., Brosch T. (2022) Theeffectiveness of nudging: A meta-analysis of choice architectureinterventions across behavioral domains.Proc. Natl. Acad. Sci. U.S.A.,4,119(1).

Relational data records

Description

Generic function, for retrieving data contained in a database-like structure.In particular, this is intended for such structures where the individualrelations can't be accessed with subsetting.

Usage

records(x, ...)records(x, ...) <- value

Arguments

x

a relational data object, such as arelation ordatabase object.

...

further arguments passed on to methods.

value

A list of data frames of the same length asrecords(x, ...), where each data frame has the same column names as that which itwill replace, in the same order.

Details

Since the relational data objects inautodb,relationanddatabase, have subsetting methods that return relationaldata objects, the data contained within them can't be accessed by subsetting.This function is intended for accessing it instead.

It's recommended to callrecords before doing any subsetting, sincesubsetting on a relation data object does more work that will be thrown away,such as subsetting on adatabase checking whether foreign keyreferences should be removed.

Value

A list containing data frames, with elements named for theirrespective relations.

Examples

db <- autodb(ChickWeight)records(db) # data for Chick and Time_Chick relations# ways to get data for subsetsrecords(db)[c(1, 2)]records(db)[[1]]records(db)$Chick# subsetting first isn't recommended: removes foreign key# reference as mentions, and you need to subset again anywayrecords(db[[1]])[[1]]

Remove relations not linked to the main relations

Description

Filters an object's relations, keeping only the main relations, and thoseconsidered ancestors via foreign key references. Foreign keyreferences involving removed relations are also removed.

Usage

reduce(x, ...)

Arguments

x

An object whose relations are to be filtered.

...

further arguments passed to or from other methods.

Details

Details on how the main tables are chosen are given in individual methods.

This function is mostly intended for simplifying a database, or a databaseschema, for the purposes of exploration, particularly by examining plots.While the filtering might remove important auxiliary relations, it's alsolikely to remove any based on spurious dependencies, of which some databasescan contain many.

Value

An object of the same class asx, with the auxiliary relationsand foreign key references removed.

Remove database relations not linked to the main relations

Description

Filters a database's relations, keeping only the main relations, and thoseconsidered ancestors via foreign key references. Foreignkey references involving removed relations are also removed.

Usage

## S3 method for class 'database'reduce(x, main, ...)

Arguments

x

A database, whose relations are to be filtered.

main

A character vector, containing names of relations to beconsidered as the "main" relations. If missing, taken to be the names ofall relations with the largest record count.

...

further arguments passed to or from other methods.

Details

The main relations are considered to be the relations with the largest numberof records.

Usingrejoin on the database resulting fromreduce islikely to fail or return incomplete results.

Value

A database, with the auxiliary relations and foreign keyreferences removed.

Remove database schema relations not linked to the given relations

Description

Filters a database schema's relations, keeping only the given relations, andthose considered ancestors via foreign key references. Foreign keyreferences involving removed relations are also removed.

Usage

## S3 method for class 'database_schema'reduce(x, main, ...)

Arguments

x

A database schema, whose relations are to be filtered.

main

A character vector, containing names of relations to beconsidered as the "main" relations.

...

further arguments passed to or from other methods.

Details

This method takes a given set of main relations, rather than inferring them.

Usingrejoin on the database resulting from decomposing a dataframe with the reduced schema is likely to fail or return incomplete results.

Value

A database schema, with the auxiliary relations and foreign keyreferences removed.

Schema references

Description

Generic function, returning present (foreign key) references.

Usage

references(x, ...)references(x) <- value

Arguments

x

an R object with references, such as adatabase_schemaordatabase object.

...

further arguments passed on to methods.

value

A list, of the same length asreferences(x, ...).

Value

A list, giving references.

Join a database into a data frame

Description

Rejoins the relations in a database into a single data frame, if possible.This is the inverse of callingautodb, except that the rowsmight be returned in a different order.

Usage

rejoin(database)

Arguments

database

A database containing the data to be rejoined, as returned bydecompose.

Details

The rejoining algorithm might not use all of the given relations: it beginswith the relation with the largest number of records, then joins it with enoughrelations to contain all of the present attributes. This is not limited torelations that the starting relation is linked to by foreign keys, and is notlimited to them either, since in some cases this constraint would make itimpossible to rejoin with all of the present attributes.

Since the algorithm may not use all of the given relations, the algorithm mayignore some types of database inconsistency, where different relations holddata inconsistent with each other. In this case, the rejoining will be lossy.Rejoining the results ofreduce can also be lossy.

Due to the above issues, the algorithm will be changed to use all of therelations in the future.

Not all databases can be represented as a single data frame. A simple exampleis any database where the same attribute name is used for several differencesources of data, since rejoining results in inappropriate merges.

Value

A data frame, containing all information containeddatabase ifit is lossless and self-consistent.

Examples

# simple exampledb <- autodb(ChickWeight)rj <- rejoin(db)rj <- rj[order(as.integer(rownames(rj))), ]all(rj == ChickWeight) # TRUE# showing rejoin() doesn't check for inconsistency:# add another Chick table with the diets swappeddb2 <- db[c(1, 2, 1)]records(db2)[[3]]$Diet <- rev(records(db2)[[3]]$Diet)rj2 <- rejoin(db2)rj2 <- rj2[order(as.integer(rownames(rj2))), ]all(rj2 == ChickWeight) # TRUE

Relation vectors

Description

Creates a set of relation schemas, including the relation's attributes andcandidate keys.

Usage

relation(relations, attrs_order)

Arguments

relations

a named list of relations, in the form of two-element lists:the first element contains a data frame, where the column names are theattributes in the associated schema, and the second element contains a listof character vectors, each representing a candidate key.

attrs_order

a character vector, giving the names of all attributes.These need not be present inschemas, but all attributes inschemas must be present inattrs_order.

Details

Relation vectors are unlikely to be needed by the user directly, since theyare essentiallydatabase objects that can't have foreign keyreferences. They are mostly used to mirror the use of the vector-likerelation_schema class for thedatabase_schemaclass to be a wrapper around. This makes creating adatabasefrom arelation_schema a two-step process, where the two stepscan be done in either order: creation withcreate andinsert, and adding references withdatabase_schema ordatabase.

Duplicate schemas, after ordering by attribute, are allowed, and can beremoved withunique.

When several sets of relation schemas are concatenated, theirattrs_order attributes are merged, so as to preserve all of the originalattribute orders, if possible. If this is not possible, because the orderingsdisagree, then the returned value of theattrs_order attribute is theirunion instead.

Value

Arelation object, containing the list given inrelations, withattrs_order stored in an attribute of thesame name. Relation schemas are returned with their keys' attributes sortedaccording to the attribute order inattrs_order, and the keys thensorted by priority order. Attributes in the data frame are also sorted,first by order of appearance in the sorted keys, then by order inattrs_order for non-prime attributes.

Examples

rels <- relation(  list(    a = list(      df = data.frame(a = logical(), b = logical()),      keys = list("a")    ),    b = list(      df = data.frame(b = logical(), c = logical()),      keys = list("b", "c")    )  ),  attrs_order = c("a", "b", "c", "d"))print(rels)records(rels)attrs(rels)stopifnot(identical(  attrs(rels),  lapply(records(rels), names)))keys(rels)attrs_order(rels)names(rels)# inserting datainsert(rels, data.frame(a = 1L, b = 2L, c = 3L, d = 4L))# data is only inserted into relations where all columns are given...insert(rels, data.frame(a = 1L, b = 2L, c = 3L))# and that are listed in relations argumentinsert(  rels,  data.frame(a = 1L, b = 2L, c = 3L, d = 4L),  relations = "a")# vector operationsrels2 <- relation(  list(    e = list(      df = data.frame(a = logical(), e = logical()),      keys = list("e")    )  ),  attrs_order = c("a", "e"))c(rels, rels2) # attrs_order attributes are mergedunique(c(rels, rels))# subsettingrels[1]rels[c(1, 2, 1)]stopifnot(identical(rels[[1]], rels[1]))# reassignmentrels3 <- relsrels3[2] <- relation(  list(    d = list(      df = data.frame(d = logical(), c = logical()),      keys = list("d")    )  ),  attrs_order(rels3))print(rels3) # note the relation's name doesn't change# names(rels3)[2] <- "d" # this would change the namekeys(rels3)[[2]] <- list(character()) # removing keys first...# for a relation_schema, we could then change the attrs for# the second relation. For a created relation, this is not# allowed.## Not run:   attrs(rels3)[[2]] <- c("b", "c")  names(records(rels3)[[2]]) <- c("b", "c")## End(Not run)# changing appearance priority for attributesrels4 <- relsattrs_order(rels4) <- c("d", "c", "b", "a")print(rels4)# reconstructing from componentsrels_recon <- relation(  Map(list, df = records(rels), keys = keys(rels)),  attrs_order(rels))stopifnot(identical(rels_recon, rels))# can be a data frame columndata.frame(id = 1:2, relation = rels)

Relation schema vectors

Description

Creates a set of relation schemas, including the relation's attributes andcandidate keys.

Usage

relation_schema(schemas, attrs_order)

Arguments

schemas

a named list of schemas, in the form of two-element lists: thefirst element contains a character vector of all attributes in the relationschema, and the second element contains a list of character vectors,each representing a candidate key.

attrs_order

a character vector, giving the names of all attributes.These need not be present inschemas, but all attributes inschemas must be present inattrs_order.

Details

Duplicate schemas, after ordering by attribute, are allowed, and can beremoved with⁠\code{\link{unique}}⁠.

Value

Arelation_schema object, containing the list given inschemas, withattrs_order stored in an attribute of the samename. Relation schemas are returned with their keys' attributes sortedaccording to the attribute order inattrs_order, and the keys thensorted by priority order. Attributes in the schema are also sorted, firstby order of appearance in the sorted keys, then by order inattrs_order for non-prime attributes.

Examples

schemas <- relation_schema(  list(    a = list(c("a", "b"), list("a")),    b = list(c("b", "c"), list("b", "c"))  ),  attrs_order = c("a", "b", "c", "d"))print(schemas)attrs(schemas)keys(schemas)attrs_order(schemas)names(schemas)# vector operationsschemas2 <- relation_schema(  list(    e = list(c("a", "e"), list("e"))  ),  attrs_order = c("a", "e"))c(schemas, schemas2) # attrs_order attributes are mergedunique(c(schemas, schemas))# subsettingschemas[1]schemas[c(1, 2, 1)]stopifnot(identical(schemas[[1]], schemas[1]))# reassignmentschemas3 <- schemasschemas3[2] <- relation_schema(  list(d = list(c("d", "c"), list("d"))),  attrs_order(schemas3))print(schemas3) # note the schema's name doesn't change# names(schemas3)[2] <- "d" # this would change the namekeys(schemas3)[[2]] <- list(character()) # removing keys first...attrs(schemas3)[[2]] <- c("b", "c") # so we can change the attrs legallykeys(schemas3)[[2]] <- list("b", "c") # add the new keysstopifnot(identical(schemas3, schemas))# changing appearance priority for attributesattrs_order(schemas3) <- c("d", "c", "b", "a")print(schemas3)# reconstructing from componentsschemas_recon <- relation_schema(  Map(list, attrs(schemas), keys(schemas)),  attrs_order(schemas))stopifnot(identical(schemas_recon, schemas))# can be a data frame columndata.frame(id = 1:2, schema = schemas)

Rename relational data attributes

Description

Generic function, for renaming attributes present in a database-likestructure.

Usage

rename_attrs(x, names, ...)

Arguments

x

an object with anattrs_order attribute. This includesrelational schema objects, such as arelation_schema ordatabase_schema object, relational data objects, such as arelation ordatabase object, andfunctional_dependency objects.

names

a character vector of the same length asattrs_order(x),with no duplicated elements, to be used as the new attribute names.

...

further arguments passed on to methods.

Details

This function has a different intended use to re-assigningattrs_order: that is intended only for rearranging the order ofthe attributes, without renaming them. This is intended for renaming theattributes without re-ordering them.

Value

A relational object of the same type asx, with attributesrenamed consistently across the whole object.

Database subrelations

Description

Generic function, returning subrelations forx.

Usage

subrelations(x, ...)

Arguments

x

an R object, intended to be some sort of database-like object thatcontains relations, such as adatabase object.

...

further arguments passed on to methods.

Value

A relation-type object, or a list of relation-type objects if thesubrelation isn't vectorised. For example, ifx is adatabase, the result is the containedrelation.

Schema subschemas

Description

Generic function, returning subschemas forx.

Usage

subschemas(x, ...)

Arguments

x

an R object, intended to be some sort of schema that contains otherschemas, such as adatabase_schema object.

...

further arguments passed on to methods.

Value

A schema-type object, or a list of schema-type objects if thesubschema isn't vectorised. For example, ifx is adatabase_schema, the result is the containedrelation_schema.

Synthesise relation schemas from functional dependencies

Description

Synthesises the dependency relationships in dependencies into a databaseschema satisfying at least third normal form, using Bernstein's synthesis.

Usage

synthesise(  dependencies,  ensure_lossless = TRUE,  reduce_attributes = TRUE,  remove_avoidable = FALSE,  constants_name = "constants",  progress = FALSE,  progress_file = "")

Arguments

dependencies

afunctional_dependency object, as given bydiscover.

ensure_lossless

a logical, TRUE by default. If TRUE, and thedecomposition isn't lossless, an extra relation is added to make thedecomposition lossless.

reduce_attributes

remove_avoidable

constants_name

progress

a logical, for whether to display progress to the user duringdependency search indiscover.

progress_file

a scalar character or a connection. Ifprogressis non-zero, determines where the progress is written to, in the same wayas thefile argument forcat.

Details

Bernstein's synthesis is a synthesis algorithm for normalisation of a set ofdependencies into a set of relations that are in third normal form. Thisimplementation is based on the version given in the referenced paper.

The implementation also includes a common additional step, to ensure that theresulting decomposition is lossless, i.e. a relation satisfying the givendependencies can be perfectly reconstructed from the relations given by thedecomposition. This is done by adding an additional relation, containing akey for all the original attributes, if one is not already present.

As an additional optional step, schemas are checked for "avoidable"attributes, that can be removed without loss of information.

Constant attributes, i.e. those whose only determinant set is empty, getassigned to a relation with no keys.

Output is independent of the order of the input dependencies: schemas aresorted according to their simplest keys.

Schemas are sorted before ensuring for losslessness, or removing avoidableattributes. As a result, neither optional step changes the order of theschemas, and ensuring losslessness can only add an extra schema to the end ofthe output vector.

Value

Arelation_schema object, containing the synthesisedrelation schemas.

References

3NF synthesis algorithm: Bernstein P. A. (1976) Synthesizing third normalform relations from functional dependencies.ACM Trans. Database Syst.,1, 4, 277–298.

Removal of avoidable attributes: Ling T., Tompa F. W., Kameda T. (1981) Animproved third normal form for relational databases.ACM Trans. DatabaseSyst.,6, 2, 329–346.

Examples

# example 6.24 from The Theory of Relational Databases by David Maier# A <-> B, AC -> D, AC -> E, BD -> Cdeps <- functional_dependency(  list(    list("A", "B"),    list("B", "A"),    list(c("A", "C"), "D"),    list(c("A", "C"), "E"),    list(c("B", "D"), "C")  ),  attrs_order = c("A", "B", "C", "D", "E"))synthesise(deps, remove_avoidable = FALSE)synthesise(deps, remove_avoidable = TRUE)

Movatterモバイル変換

Database-style normalisation for data.frames

Description

Author(s)

See Also

Relational data attributes

Description

Usage

Arguments

Value

Relational data attribute order

Description

Usage

Arguments

Details

Value

Create a normalised database from a data frame

Description

Usage

Arguments

Details

Value

Examples

Add foreign key references to a normalised database

Description

Usage

Arguments

Details

Value

Examples

Create instance of a schema

Description

Usage

Arguments

Value

Generate D2 input text to plot objects

Description

Usage

Arguments

Details

Value

See Also

Examples

Generate D2 input text to plot a data frame

Description

Usage

Arguments

Details

Value

See Also

Generate D2 input text to plot databases

Description

Usage

Arguments

Details

Value

See Also

Generate D2 input text to plot database schemas

Description

Usage

Arguments

Details

Value

See Also

Generate D2 input text to plot relations

Description

Usage

Arguments

Details

Value

See Also

Generate D2 input text to plot relation schemas

Description

Usage

Arguments

Details

Value

See Also

Databases

Description