A larger example: the nudge dataset

library(autodb)

if (requireNamespace("DiagrammeR", quietly = TRUE)) {  show <- function(x) DiagrammeR::grViz(gv(x), width = "100%")  maybe_plot <- function(x) DiagrammeR::grViz(gv(x), width = "100%")}else{  show <- print  maybe_plot <- function(x) invisible(NULL)}

Included in the package is a 447-by-25 data frame callednudge:

knitr::kable(data.frame(  attribute = names(nudge),  class = vapply(nudge, \(x) class(x)[[1]], character(1)),  NAs = vapply(nudge, \(x) sum(is.na(x)), integer(1)),  row.names = NULL))

attribute	class	NAs
publication_id	integer	0
study_id	integer	0
es_id	integer	0
reference	character	0
title	character	0
year	integer	0
location	factor	0
domain	factor	0
intervention_category	factor	0
intervention_technique	factor	0
type_experiment	factor	0
population	factor	0
n_study	integer	0
n_comparison	integer	0
n_control	integer	0
n_intervention	integer	0
binary_outcome	logical	0
mean_control	numeric	13
sd_control	numeric	286
mean_intervention	numeric	13
sd_intervention	numeric	286
cohens_d	numeric	0
variance_d	numeric	0
approximation	logical	0
wansink	logical	0

This is the data set for a meta-analysis, looking at theeffectiveness of “nudge” interventions. Like any meta-analysis, the datais strictly hierarchical: publications contain studies, which containeffect size measurements. We expect to see this hierarchy in thenormalisation.

Initial decomposition

The full dependency information for this data is rather large. Thisresults in a relatively large search time for a data set of this size,but it still only takes a few seconds.

nudge_deps_big <- discover(nudge)nudge_schema_big <- normalise(nudge_deps_big, remove_avoidable = TRUE)nudge_db_big <- decompose(nudge, nudge_schema_big)

However, the resulting schema is hard to make anything out of, withmore relations than we’d want to go through manually:

length(nudge_schema_big)

## [1] 473

There are a few reasons for this, but the main one is the sheernumber of functional dependencies discovered:

length(nudge_deps_big)

## [1] 3732

Even if we remove the transitive dependencies, which are implied bythe others, we still have 597 dependencies.

Simplifying the search result

This is a rather large set for the simple approach to simplifying aset of FDs: manually looking for spurious cases to remove beforenormalisation.

As a hack, we can usereduce on the database. This findsthe relations with the most records – i.e. the relations with one recordfor each unique row in the original data – and returns a reduceddatabase, containing only those relations, and relations that theyreference, either directly or indirectly.

nudge_reduced_big <- reduce(nudge_db_big)

The relations removed byreduce are often spurious, butnot always: see the reference chains section of thePlanned improvements vignette(vignette("plans", package = "autodb")) for an example ofnon-spurious tables not being connected. However,reduce atleast gives us something smaller to look at.

In this case, the reduced database is still too large to easilyreview:

length(nudge_reduced_big)

## [1] 124

Instead, we get a grip on what’s going on by collecting statistics onattributes involved in the determinants.

For example, we can also see how large the determinants are:

table(lengths(detset(nudge_deps_big)))

## ##   1   2   3   4   5   6   7 ##  41 545 780 993 926 435  12

While there are some large true determinants in real data, a commonrule of thumb is that FDs with larger determinants are more likely to bespurious. This doesn’t make for an obvious way to filter the FDs withsome hard size limit, though.

We can also see how often each attribute appears in adeterminant:

sort(table(unlist(detset(nudge_deps_big))), decreasing = TRUE)

## ##           n_comparison        type_experiment  intervention_category ##                   1095                   1085                   1061 ##              n_control                 domain         n_intervention ##                   1029                   1016                    993 ##                   year          approximation intervention_technique ##                    889                    766                    745 ##               location             variance_d        sd_intervention ##                    739                    727                    687 ##             sd_control                n_study               cohens_d ##                    684                    628                    549 ##         binary_outcome           mean_control      mean_intervention ##                    503                    370                    313 ##             population                wansink         publication_id ##                    269                    248                    119 ##              reference                  title               study_id ##                    108                     83                     37 ##                  es_id ##                     24

We can be more specific, and see how often each attributes appears ina determinant of a given size:

sort_by_rowSums <- function(x, ...) x[order(rowSums(x), ...), , drop = FALSE]level_table <- function(x, levels) table(factor(x, levels))attrs_table <- function(x) level_table(unlist(x), names(nudge))by_lengths <- function(x, f) do.call(cbind, tapply(x, lengths(x), f))by_lengths(detset(nudge_deps_big), attrs_table) |>  sort_by_rowSums(decreasing = TRUE)

##                         1   2   3   4   5   6  7## n_comparison            0  73  64 342 395 217  4## type_experiment         0   6 114 286 376 295  8## intervention_category   0   2 140 185 389 341  4## n_control               0  53 148 305 361 162  0## domain                  0  23 109 335 453  84 12## n_intervention          0  64 142 283 290 206  8## year                    0  11 266 408 192  12  0## approximation           0   0  35 125 317 277 12## intervention_technique  1  25 124 309 249  37  0## location                0   4  91 191 259 182 12## variance_d              0  76 201 241 164  45  0## sd_intervention         0  15  65 228 227 148  4## sd_control              0  12  65 228 227 148  4## n_study                 0  60  81 224 182  77  4## cohens_d                0 211 258  23  46  11  0## binary_outcome          0   3  20  74 225 177  4## mean_control            1  89 196  59  25   0  0## mean_intervention       1 103 141  46  22   0  0## population              0   0   7  38 112 104  8## wansink                 0   0   0  42 119  87  0## publication_id          0  97  22   0   0   0  0## reference               1  85  22   0   0   0  0## title                   3  58  22   0   0   0  0## study_id               10  20   7   0   0   0  0## es_id                  24   0   0   0   0   0  0

Either way, we can see that there are attributes that we wouldn’texpect to be in a determinant, but often are, especially in larger ones.We don’t expect measurement attributes likevariance_d andsd_control to co-determine anything else, but they oftendo.

More generally, there are certain data classes we don’t expect to seeappear as determinants. For example, numerics (floats) are probablymeasurements, and shouldn’t co-determine anything else. However, if wecount determinant appearances by the attribute’s class, we find thatthey often appear in determinants:

attr_classes <- vapply(nudge, \(x) class(x)[[1]], character(1))class_table <- function(x) {  level_table(attr_classes[unlist(x)], sort(unique(attr_classes)))}by_lengths(detset(nudge_deps_big), class_table) |>  sort_by_rowSums(decreasing = TRUE)

##            1   2   3    4    5    6  7## factor     1  60 585 1344 1838 1043 44## integer   34 378 730 1562 1420  674 16## numeric    2 506 926  825  711  352  8## logical    0   3  55  241  661  541 16## character  4 143  44    0    0    0  0

This suggests that a simple first step is to remove any FD with afloat in the determinant. We can write this as a filter vector:

det_nofloat <- vapply(  detset(nudge_deps_big),  \(x) all(attr_classes[x] != "numeric"),  logical(1))summary(det_nofloat)

##    Mode   FALSE    TRUE ## logical    2784     948

This removes a lot!

If we use the filtered set of FDs, we still get a large schema:

length(normalise(nudge_deps_big[det_nofloat]))

## [1] 176

However, if we reduce the resulting database, we get something muchmore manageable:

nudge_schema_filtered <- normalise(nudge_deps_big[det_nofloat])nudge_db_filtered <- reduce(decompose(nudge, nudge_schema_filtered))

show(nudge_db_filtered)

Let’s look at the resulting relations, in two sets. Here’s theleft-hand set:

subsample <- c(  "es_id",  "study_id",  "intervention_technique_n_comparison_n_control",  "intervention_technique",  "n_comparison_n_control")

show(nudge_db_filtered[subsample])

We see two relations,es_id andstudy_id,that clearly represent the effect and study levels of the datahierarchy, which is encouraging. They both contain the ID for the nextlevel, too, as they should.

The relation that the effect relation references, with the long name,looks rather arbitrary, but the relations it references look useful. Oneshows that the intervention technique determines the interventioncategory, i.e. techniques are subcategories.

Less simply, the other relation contains the three effect samplesizes: comparison size, intervention arm size, and control arm size.Each pair of these sizes determines the other, which is what we’d see ifthe comparison size is the sum of the other two. However, we can’tassume that this is the nature of the relationship. Indeed, we can finda case where it’s false:

knitr::kable(  subset(    nudge,    n_comparison != n_control + n_intervention,    c(reference, study_id, es_id, n_study, n_comparison, n_control, n_intervention)  ),  row.names = FALSE)

reference	study_id	es_id	n_study	n_comparison	n_control	n_intervention
Hedlin & Sunstein (2016)	137	180	1037	1037	345	346
Hedlin & Sunstein (2016)	137	181	1037	1037	345	346

It turns out to be difficult to identify which results in therelevant paper these are referring to, since the presentation of theresults is rather obfuscated. However, it is clear that these arereferring to a part of the study with three arms instead of two(actually nine arms aggregated to groups of three). If we account forhaving two treatment arms, then the arm sizes do sum to the comparisonsize, and there is still a sum relationship of sorts, just not thesimpler one we might expect.

In any case, most of these relations in this set seem reasonable. Thesame can’t be said of the other set, which should represent thepublication-level data:

show(nudge_db_filtered[setdiff(names(nudge_db_filtered), subsample)])

There is a lot going on here, so let’s look at the last two:

show(nudge_db_filtered[c("title", "reference")])

This is our publication-level data, but something is wrong. Thepublication ID is not a simple key, as we’d expect. Neither is thereference, which we’d expect to be unique, otherwise the bibliographywill have duplicate entries. Instead, the only simple key is thepublication’s title. To use the ID or the reference as an identifier forpublications, we need to use both together as a compound key.

Without having looked at the data itself, these keys tell us thatthere are duplicate publication IDs and duplicate references. Since thekeys for thetitle relation showed us the problem, we cansearch that relation to find the duplicates, with all the attributesneeded for context:

duplicates <- function(x) unique(x[duplicated(x)])subset_duplicates <- function(x, attr) {  x[x[[attr]] %in% duplicates(x[[attr]]), , drop = FALSE]}

nudge_title_relation <- records(nudge_db_filtered)$titleknitr::kable(subset_duplicates(nudge_title_relation, "publication_id"))

	title	publication_id	reference	year
44	Enhanced active choice: A new method to motivatebehavior change	95	Keller et al. (2011)	2011
130	Nudging product choices: The effect of position changeon snack bar choice	95	Keller et al. (2015)	2015

knitr::kable(subset_duplicates(nudge_title_relation, "reference"))

	title	publication_id	reference	year
214	Nudge vs superbugs: A behavioural economics trial toreduce the overprescribing of antibiotics	18	BETA (2018)	2018
399	Energy labels that make cents	19	BETA (2018)	2018

The publications with the same ID have the same first author in theirreferences; this looks like a simple data entry error.

BETA is the Behavioural Economics Team of the Australian Government,so it’s not surprising that they’d have multiple publications/reportsper year. However the references for publications were generated, thepossibility of a group publishing multiple relevant papers in a yearwasn’t accounted for.

Looking at the keys for the other relations, we see that they oftenincludepublication_id orreference, and therelationships they describe seem rather questionable. These are notworth going into in detail, since we can guess that they stem from theduplication problems.

In a real project, this would be a good point to go back to the dataprovider with questions, since we now have examples from the data ofsurprising behaviour.

Fixing the data

There are two ideal results when we find surprising behaviour:

The behaviour is as it should be, and we can correct ourunderstanding of the data;
The behaviour is unexpected, and we can fix the data to get rid ofit.

In this example,publication_id andreference are synthetic variables, so we can easily fix thedata ourselves:

nudge_fixed <- within(nudge, {  publication_id[publication_id == 95 & year == 2015] <- max(publication_id) + 1L  reference[publication_id == 19] <- "BETA (2018a)"})

We can then re-runautodb to find the new schema. Notethat we use theexclude_class argument to exclude FDs withfloats in their determinants, instead of removing them ourselves. Thisprunes them from the search space, speeding up the search.

db_fixed <- autodb(  nudge_fixed,  exclude_class = "numeric")length(db_fixed)

## [1] 170

show(reduce(db_fixed))

This is an improvement: the publications now have an ID and areference as simple keys, as they should, and some of the questionablerelations have disappeared.

Simplifying further with search filters

Some of the relations are still questionable. If we examine them, wesee that they all have large keys, containing three attributes or more.To simplify things, we can decide to remove them. Rather than manuallyfilter the FDs again, we can do this using another filtering argumentfordiscover/autodb,detset_limit:

db_final <- autodb(  nudge_fixed,  exclude_class = "numeric",  detset_limit = 2)length(db_final)

## [1] 11

The database is now small enough to not need to reduce it:

show(db_final)

However, the relations not referred to by the effect relation lookspurious too, so we can reduce the database to remove them too:

show(reduce(db_final))

This is a great improvement. We are left with only two relations thatwe don’t expect in general, but hold for this dataset in particular:

publication_id_type_experiment states that studies fromthe same publication, that use the same type of experiment, also takeplace in the same location. This is probably not interesting, since the“location” is just a logical value for whether the study was conductedin the USA.
publication_id_n_study states that studies from thesame publication, and with the same sample size, have the same location,domain, and target population. This is likely to be a coincidence, andnot of much interest either.

Dataset-particular relations like this are useful to check, sincethey often indicate sampling limitations, which can make our plannedstatistical analysis infeasible.

We can remove either of these relations easily usingexclude, since their keys contain attributes that don’tappear in keys elsewhere. If we do, we get a schema that looksreasonable for data of this sort in general, apart from the mentionedissue with the effect sample sizes:

show(reduce(autodb(  nudge_fixed,  exclude = c("type_experiment", "n_study"),  exclude_class = "numeric",  detset_limit = 2)))

How not to remove spurious structure

Instead of adding search filters, or removing functionaldependencies, we could try to simplify the original schema by removingquestionable relations. This section is about why this is a badidea.

This is the database we had:

show(nudge_db_filtered)

What happens if we remove the offending relations from theschema?

nudge_schema_relfiltered <- nudge_schema_filtered[  vapply(    keys(nudge_schema_filtered),    \(ks) all(lengths(ks) <= 2) &&      sum(c("publication_id", "reference") %in% ks[[1]]) != 1,    logical(1)  )]

show(nudge_schema_relfiltered)

The offending relations are removed, but so are any foreign keyreferences involving them. The resulting schema isn’t invalid, but it’sless connected than before.

Callingautoref to re-generate the foreign keyreferences doesn’t entirely fix the problem, since, for example, noother relation contains a key for thetitle relation:

show(autoref(nudge_schema_relfiltered))

This partly happens because removing relations removes moreinformation than it needs to. If we remove the relevant functionaldependencies instead, then we can have new intermediate relations appearin the new schema, whose dependencies were originally redundant bytransitivity. Removing relations directly can’t account for this.

Alternative approach: hierarchical limits

An alternative approach is to make use of our knowledge that the datais hierarchical, and remove dependencies where an attribute isco-determining an attribute on a lower level. For example,publication-level attributes shouldn’t co-determine study- ormeasurement-level attributes. This means we don’t find cases where theydo, but we could decide that any such cases are spurious, and we don’twant to spend time on them.

hlev <- c(  publication_id = 3,  study_id = 2,  es_id = 1,  reference = 3,  title = 3,  year = 3,  location = 2,  domain = 2,  intervention_category = 2,  intervention_technique = 2,  type_experiment = 2,  population = 2,  n_study = 2,  n_comparison = 1,  n_control = 1,  n_intervention = 1,  binary_outcome = 1,  mean_control = 1,  sd_control = 1,  mean_intervention = 1,  sd_intervention = 1,  cohens_d = 1,  variance_d = 1,  approximation = 1,  wansink = 2)hfilter <- function(fds, hlev) {  fds[mapply(    \(det, dep) all(hlev[det] <= hlev[[dep]]),    detset(fds),    dependant(fds)  )]}

Filtered result for the original data:

nudge |>  discover(exclude_class = "numeric", detset_limit = 2) |>  hfilter(hlev) |>  normalise(remove_avoidable = TRUE) |>  decompose(df = nudge) |>  reduce() |>  show()

Filtered result for the fixed data:

nudge_fixed |>  discover(exclude_class = "numeric", detset_limit = 2) |>  hfilter(hlev) |>  normalise(remove_avoidable = TRUE) |>  decompose(df = nudge_fixed) |>  show()