Using autodb

library(autodb)

## ## Attaching package: 'autodb'

## The following object is masked from 'package:stats':## ##     decompose

Terminology

To avoid confusion, this vignette is consistent with terminology: adata frame (table) hasrecords (rows) with the sameattributes (columns/variables), and we split it up into adatabase, that consists ofrelations (data frames withkey and foreign key constraints between them). This is in line withterms used in the relational model.

When we talk about the type of values an attribute can take, we talkabout a dataclass, as in the relational model and R, ratherthan a value type, as in many other programming languages.

Motivation

Let’s expand on the package DESCRIPTION:

Automatic normalisation of a data frame to third normal form, withthe intention of easing the process of data cleaning. (Usage to designyour actual database for you is not advised.)

Database normalisation

Database normalisation is, roughly, the idea of taking a dataset andsplitting it up into several linked tables. This is useful, because itmakes the structure of the data more explicit, and enforceable.

As an example, let’s look at theChickWeight dataset,included with base R.

summary(ChickWeight)

##      weight           Time           Chick     Diet   ##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  ##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  ##  Median :103.0   Median :10.00   20     : 12   3:120  ##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  ##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          ##  Max.   :373.0   Max.   :21.00   19     : 12          ##                                  (Other):506

This is a simple dataset, but the flat table format obscures someinformation about the structure of the data. Specifically, chicks havetheir diet assigned at the beginning of the study, and it’s neverchanged. To make this more explicit, we could split the data into twoseparate tables:

measurement <- unique(subset(ChickWeight, , -Diet))chick <- unique(subset(ChickWeight, , c(Chick, Diet)))summary(measurement)

##      weight           Time           Chick    ##  Min.   : 35.0   Min.   : 0.00   13     : 12  ##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12  ##  Median :103.0   Median :10.00   20     : 12  ##  Mean   :121.8   Mean   :10.72   10     : 12  ##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12  ##  Max.   :373.0   Max.   :21.00   19     : 12  ##                                  (Other):506

summary(chick)

##      Chick    Diet  ##  18     : 1   1:20  ##  16     : 1   2:10  ##  15     : 1   3:10  ##  13     : 1   4:10  ##  9      : 1         ##  20     : 1         ##  (Other):44

stopifnot(  identical(    merge(measurement, chick, sort = FALSE)[names(ChickWeight)],    data.frame(ChickWeight)  ),  setequal(measurement$Chick, chick$Chick))

We can also add the restriction that each chick can only appear inthe chick table once, and that each chick-time pair can only appear inthe measurement table once.

For data management, this is good because it prevents some commontypes of data error:

Consistency errors. In the flat table, we can accidentally add a rowwhere a chick is on a different diet to before. In the split tables, theabove restrictions mean that this is impossible.
Insertion/deletion errors. Suppose we have a chick that’s beenassigned a diet, but hasn’t been measured yet. In the flat table, thisinformation can’t be added, unless we use a temporary row where the timeand weight are missing, and hope we remember to remove it later. In thesplit tables, we can just add the chick to the chick table withoutissue. Conversely, if we removed all of a chick’s measurements in theflat table, we’d also lose the information about which diet it’s beenassigned.
Update errors. Suppose we notice that a chick is recorded under thewrong diet, and change values to correct it. In the flat table, achick’s diet is recorded for every measurement, so we could only updatesome of them by mistake, resulting in inconsistency. In the splittables, the diet is only given once, so this can’t happen.

These all tie into the same idea: reducing data redundancy so thateach fact is only stated in one place.

For data cleaning

All of the above points are also useful for data exploration andvalidation, and that’s the emphasis of the package. Being able to makethe data’s structure more explicit is valuable for making sure that weunderstand the problem domain before we try to model it. This is trueeven if we still use the original flat table for the analysis itself,since tools like R generally expect that format.

I’ve often started data exploration under the followingconditions:

Little prior knowledge of the data, since it’s provided by a thirdparty.
One-off analyses, where the data is discard at the end, so it’s notworth the time to get the data into a proper database.
Limited, or unreliable, documentation.
Data that fits in local memory, presenting less problems forautomation.

It’s in this context that I would usually have to do thenormalisation by hand, and I would like to have a tool that does itsemi-automatically. This package is intended to be that tool, using aminimal amount of additional libraries.

For example, we can passChickWeight into the mainfunction,autodb, and get the expected normalisation:

chick_db <- autodb(ChickWeight)chick_db

## database with 2 relations## 4 attributes: weight, Time, Chick, Diet## relation Chick: Chick, Diet; 50 records##   key 1: Chick## relation Time_Chick: Time, Chick, weight; 578 records##   key 1: Time, Chick## references:## Time_Chick.{Chick} -> Chick.{Chick}

We can also plot it, by usinggv to turn the result intocode for the Graphviz language, and then calling Graphviz however weprefer.

cat(gv(chick_db))

## digraph {##   rankdir = "LR"##   node [shape=plaintext];## ##   "Chick" [label = <##     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">##     <TR><TD COLSPAN="3">Chick (50 records)</TD></TR>##     <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>##     <TR><TD PORT="TO_diet">Diet</TD><TD></TD><TD PORT="FROM_diet">factor</TD></TR>##     </TABLE>>];##   "Time_Chick" [label = <##     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">##     <TR><TD COLSPAN="3">Time_Chick (578 records)</TD></TR>##     <TR><TD PORT="TO_time">Time</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_time">numeric</TD></TR>##     <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>##     <TR><TD PORT="TO_weight">weight</TD><TD></TD><TD PORT="FROM_weight">numeric</TD></TR>##     </TABLE>>];## ##   "Time_Chick":FROM_chick -> "Chick":TO_chick;## }

if (requireNamespace("DiagrammeR", quietly = TRUE)) {  show <- function(x) DiagrammeR::grViz(gv(x), width = "100%")  maybe_plot <- function(x) DiagrammeR::grViz(gv(x), width = "100%")}else{  show <- print  maybe_plot <- function(x) invisible(NULL)}

show(chick_db)

In an interactive session, the simplest way to do this is to callDiagrammeR::grViz on the output fromgv. In aQuarto file, we can write the output to a file, and call that file fromadot code block. The advantage to the latter approach isthat DiagrammeR creates an HTML widget, and the Quarto block does not,so we avoid needing to include about 2MB of Javascript.

The plot represents each table/relation as a box: the top row givesthe relation name and the number of records, and the other rows list theattributes and their classes.

In the middle is a grid containing black cells: each column of blackcells represents akey, or uniqueness constraint, for therelation. Keys are irreducible subsets of a relation’s attributes, thattake a unique set of values in each record. They can be treated as aunique row identifier. For example, each chick has a uniqueChick value, and each measurement has a uniqueTime-Chick pair.

Not for database design

Good database design is a hard and involved process, which requires astrong grasp on the domain that the database will hold information on.Please don’t try to use this package to do it on autopilot. If you do,you’ll quickly find some shortcomings with the results:

Since the package can only work with the data it’s got, theresulting database schema is specific to the given dataset, rather thandata of that type in general. This is useful for data exploration, sinceit can give useful information about sampling issues, but not fordesigning a database.
The resulting “databases” might not be something SQL etc. willaccept as-is. For example,autodb needs to handle constantattributes, that have the same value in every record. This is done bymaking a relation with an empty key, which is allowed in the relationmodel, but SQL won’t let you do it. In a real database, we would eithernot store constants in the database at all, or place them with theexpectation that they won’t stay constant when more data is added.
It doesn’t introduce artificial keys, if you like using those, orsingle-attribute lookups, which are a relational database’s equivalentto factor levels.
It doesn’t create “virtual relations”, also known as “views”, whichare data-less relations defined in terms of manipulating real relations(usually to create aggregates).
If a relation has multiple keys, it doesn’t choose one as a “primarykey”. Primary keys are implementation details for databases, not part ofthe relational model, and they’re not especially relevant for dataexploration. However, SQL etc. will ask for one.
It can’t account for semantic knowledge, which a person would spoteasily.

See theLimitations vignette(vignette("limits", package = "autodb")) for moredetails.

To third normal form

autodb gets a given data frame to third normal form(3NF): every attribute depends on the whole key(s), and non-keyattributes depend on nothing but the key(s). This was mostly chosenbecause there is an existing algorithm, Bernstein’s synthesis, fornormalising to third normal form.

autodb works by searching for functional dependencies inthe data, and the highest normal form attainable with functionaldependencies is a higher form called Boyes-Codd normal form (BCNF).However, normalising to BCNF requires more complicated infrastructure,andautodb isn’t set up for that yet. This may change inthe future.

An additional enhancement to 3NF, called LTK form, is available as anoption: see the section on avoidable attributes for more details.

Individual steps

Finding functional dependencies

Having looked at the final result forChickWeight first,we now look at the individual steps. The first step is to find the(non-trivial and minimal) dependencies present in the given data frame.There are various ways to do this; by default, the package usesFDHitsSep, a depth-first search algorithm. We run this by using thediscover function, settingprogress toTRUE to see the steps taken:

deps <- discover(ChickWeight, progress = TRUE)

## formatting numerical/complex variables with 7 significant digits## simplifying data types## calculating single-attribute PLIs## sampling difference sets## 7 initial diffsets## ## dependant weight## dependant Time## dependant Chick## dependant Diet## ## FDHitsSep complete## 9 final diffsets## 10 nodes visited## 6 partitions cached

deps

## 2 functional dependencies## 4 attributes: weight, Time, Chick, Diet## Time, Chick -> weight##       Chick -> Diet

The result is a list offunctional dependencies, in theformatdeterminant -> dependant, with an attribute namedattrs_order that gives all the attribute names in theiroriginal order. Each of these three parts can be extracted:

detset(deps)

## [[1]]## [1] "Time"  "Chick"## ## [[2]]## [1] "Chick"

dependant(deps)

## [1] "weight" "Diet"

attrs_order(deps)

## [1] "weight" "Time"   "Chick"  "Diet"

The former two are useful for filtering.

Normalisation

Now that we have a list of discovered dependencies, we can constructa database schema, where the relation schemas are normalised to thirdnormal form. This is done using Bernstein’s synthesis.

schema <- synthesise(deps)schema

## 2 relation schemas## 4 attributes: weight, Time, Chick, Diet## schema Chick: Chick, Diet##   key 1: Chick## schema Time_Chick: Time, Chick, weight##   key 1: Time, Chick

We can also plot this schema:

show(schema)

This is similar to the database plot given before, but there is someinformation not present, that requires the data frame itself: recordcounts, and data classes. We do have automatically-generated names forthe individual relations, which are created using the keys.

At this point, we have no references between the relation schemas,since Bernstein’s synthesis doesn’t supply information about foreign keyreferences. We could use this database schema to build a database, butwe’d rather add these foreign key references first.

Adding foreign key references

Getting back to ourChickWeight example, we now have adatabase schema, consisting of a list ofrelation schemas.However, we have no information about how these relation schemas arelinked to each other. In particular, we have no information aboutforeign keys. We can add this information usingautoref:

linked_schema <- autoref(schema)linked_schema

## database schema with 2 relation schemas## 4 attributes: weight, Time, Chick, Diet## schema Chick: Chick, Diet##   key 1: Chick## schema Time_Chick: Time, Chick, weight##   key 1: Time, Chick## references:## Time_Chick.{Chick} -> Chick.{Chick}

We could also have usednormalise, instead ofsynthesise andautoref separately:

normalise(deps)

## database schema with 2 relation schemas## 4 attributes: weight, Time, Chick, Diet## schema Chick: Chick, Diet##   key 1: Chick## schema Time_Chick: Time, Chick, weight##   key 1: Time, Chick## references:## Time_Chick.{Chick} -> Chick.{Chick}

Plotting this updated database schema shows the same relation schemasas before, linked together by foreign key references:

show(linked_schema)

Decomposing the original relation

Finally, once we have our normalised database schema, we can apply itto our original data frame, or a new one with the same structure. Thisresults in a normalised database, as we got from usingautodb:

db <- decompose(ChickWeight, linked_schema)

show(db)

Rejoining a database back into a data frame

We can reverse the process of turning a data frame into a databasewith therejoin function. This may not be identical toChickWeight, since the rows may have been rearranged.However, we can use the convenience functiondf_equiv tocheck for equivalence under row reordering:

rejoined <- rejoin(db)summary(rejoined)

##      weight           Time           Chick     Diet   ##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  ##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  ##  Median :103.0   Median :10.00   20     : 12   3:120  ##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  ##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          ##  Max.   :373.0   Max.   :21.00   19     : 12          ##                                  (Other):506

identical(rejoined, ChickWeight)

## [1] FALSE

df_equiv(rejoined, ChickWeight)

## [1] TRUE

Tuning detection and normalisation

Let’s look at a different dataset for a moment, to look at some caseswhere we don’t want to use the dependencies as given. We’ll use theTitanic data set, also provided with base R. This data is in array form,so we first convert it to data frame form:

knitr::kable(as.data.frame(Titanic))

Class	Sex	Age	Survived	Freq
1st	Male	Child	No	0
2nd	Male	Child	No	0
3rd	Male	Child	No	35
Crew	Male	Child	No	0
1st	Female	Child	No	0
2nd	Female	Child	No	0
3rd	Female	Child	No	17
Crew	Female	Child	No	0
1st	Male	Adult	No	118
2nd	Male	Adult	No	154
3rd	Male	Adult	No	387
Crew	Male	Adult	No	670
1st	Female	Adult	No	4
2nd	Female	Adult	No	13
3rd	Female	Adult	No	89
Crew	Female	Adult	No	3
1st	Male	Child	Yes	5
2nd	Male	Child	Yes	11
3rd	Male	Child	Yes	13
Crew	Male	Child	Yes	0
1st	Female	Child	Yes	1
2nd	Female	Child	Yes	13
3rd	Female	Child	Yes	14
Crew	Female	Child	Yes	0
1st	Male	Adult	Yes	57
2nd	Male	Adult	Yes	14
3rd	Male	Adult	Yes	75
Crew	Male	Adult	Yes	192
1st	Female	Adult	Yes	140
2nd	Female	Adult	Yes	80
3rd	Female	Adult	Yes	76
Crew	Female	Adult	Yes	20

This is a simple set of data, with a single count observation,Freq, for each combination of the four determiningattributes. In other words, the relation is already normalised, so weonly expect one relation in the normalised database.

If we useautodb again, we get the followingdatabase:

show(autodb(as.data.frame(Titanic)))

Oops! The search found some functional dependencies where the countcould be used to determine another attribute. These are clearlyspurious: frequency count can’t causally determine age, for example.

There are two ways we can remove these spurious dependencies:limiting the search to not check them in the first place, and removingthem from the dependencies before usingsynthesise/normalise.

To limit the search, we can set certain attributes to not beconsidered as members of determinants, or we can exclude attributes thatinherit from certain classes. In this example, we could excludeFreq from being considered:

titanic_deps_freqonly <- discover(as.data.frame(Titanic), exclude = "Freq")titanic_deps_freqonly

## 1 functional dependency## 5 attributes: Class, Sex, Age, Survived, Freq## Class, Sex, Age, Survived -> Freq

Alternatively, we could exclude all attributes that inherit from“numeric”:

stopifnot(setequal(  titanic_deps_freqonly,  discover(as.data.frame(Titanic), exclude_class = "numeric")))

We can also limit the search when usingautodb:

show(autodb(as.data.frame(Titanic), exclude = "Freq"))

Excluding numeric attributes as determinants is often useful, becausewe expect non-integer numbers to be a measurement that doesn’tco-determine anything else. The main exception is when some attributesare summaries of others.

Alternatively, we can remove the unwanted dependencies. Here are allthe found dependencies, if we don’t exclude anything:

titanic_deps <- discover(as.data.frame(Titanic))titanic_deps

## 3 functional dependencies## 5 attributes: Class, Sex, Age, Survived, Freq##       Sex, Survived, Freq -> Age##     Class, Survived, Freq -> Age## Class, Sex, Age, Survived -> Freq

We can remove the unwanted dependencies, whereAge isthe dependant, using subsetting,Filter, etc.:

titanic_deps[dependant(titanic_deps) == "Age"]

## 2 functional dependencies## 5 attributes: Class, Sex, Age, Survived, Freq##   Sex, Survived, Freq -> Age## Class, Survived, Freq -> Age

Avoidable attributes

The next normal form after third normal form (3NF) is Boyes-Coddnormal form (BCNF). Ensuring BCNF is enforced by the database istrickier: in some cases, it can’t be enforced with just relations andforeign key constraints.

However, the package includes an option to convert to enhanced thirdnormal form, also known as LTK form, which can be so enforced. Thisenhancement is tangential to BCNF, and could also be used to enhanceschemas in BCNF.

In brief, the standard normal forms only put constraints on theattributes present in the relations one relation at a time. Theenhancement is a constraint on the attributes present in a relation,while considering their presence in other relations. If a attribute in arelation can be removed, and still be determined from that relation byjoining it to others, then the attribute is “avoidable”, and can beremoved. If the attribute is in any of the relation’s keys, they’ll bereplaced by keys that use the attributes not being removed. This removesattributes from relations without removing any information from thedatabase as a whole.

For example, we can take this simple example from Chapter 6 of TheTheory of Relational Databases, by David Maier:

avoid_deps <- functional_dependency(  list(    list("A", "B"),    list("B", "A"),    list(c("A", "C"), "D"),    list(c("A", "C"), "E"),    list(c("B", "D"), "C")  ),  attrs_order = c("A", "B", "C", "D", "E"))avoid_deps

## 5 functional dependencies## 5 attributes: A, B, C, D, E##    A -> B##    B -> A## A, C -> D## A, C -> E## B, D -> C

normalise(avoid_deps)

## database schema with 2 relation schemas## 5 attributes: A, B, C, D, E## schema A: A, B##   key 1: A##   key 2: B## schema A_C: A, C, B, D, E##   key 1: A, C##   key 2: B, D## references:## A_C.{A} -> A.{A}## A_C.{B} -> A.{B}

show(normalise(avoid_deps))

AttributesA andB are equivalent, sincerelationA has them both as a key. In other words, relationA is a simple lookup relation. Because of this, we couldremoveB from relationA_C, and replace thekeyB, D withA, D, which is equivalent whenaccounting for relationA.

We can have this removal of avoidable attributes done automatically,using theremove_avoidable flag fornormalise:

normalise(  avoid_deps,  remove_avoidable = TRUE) |>  show()

This schema is now in LTK form, with no remaining avoidableattributes. We could have also removedA from relationA_C instead ofB, so this process may not havea unique result. The package’s implementation prefers to removeattributes that appear later in the original relation.

Movatterモバイル変換