library(autodb)if (requireNamespace("DiagrammeR", quietly = TRUE)) { show <- function(x) DiagrammeR::grViz(gv(x), width = "100%") maybe_plot <- function(x) DiagrammeR::grViz(gv(x), width = "100%")}else{ show <- print maybe_plot <- function(x) invisible(NULL)}Included in the package is a 447-by-25 data frame callednudge:
knitr::kable(data.frame( attribute = names(nudge), class = vapply(nudge, \(x) class(x)[[1]], character(1)), NAs = vapply(nudge, \(x) sum(is.na(x)), integer(1)), row.names = NULL))| attribute | class | NAs |
|---|---|---|
| publication_id | integer | 0 |
| study_id | integer | 0 |
| es_id | integer | 0 |
| reference | character | 0 |
| title | character | 0 |
| year | integer | 0 |
| location | factor | 0 |
| domain | factor | 0 |
| intervention_category | factor | 0 |
| intervention_technique | factor | 0 |
| type_experiment | factor | 0 |
| population | factor | 0 |
| n_study | integer | 0 |
| n_comparison | integer | 0 |
| n_control | integer | 0 |
| n_intervention | integer | 0 |
| binary_outcome | logical | 0 |
| mean_control | numeric | 13 |
| sd_control | numeric | 286 |
| mean_intervention | numeric | 13 |
| sd_intervention | numeric | 286 |
| cohens_d | numeric | 0 |
| variance_d | numeric | 0 |
| approximation | logical | 0 |
| wansink | logical | 0 |
This is the data set for a meta-analysis, looking at theeffectiveness of “nudge” interventions. Like any meta-analysis, the datais strictly hierarchical: publications contain studies, which containeffect size measurements. We expect to see this hierarchy in thenormalisation.
The full dependency information for this data is rather large. Thisresults in a relatively large search time for a data set of this size,but it still only takes a few seconds.
nudge_deps_big <- discover(nudge)nudge_schema_big <- normalise(nudge_deps_big, remove_avoidable = TRUE)nudge_db_big <- decompose(nudge, nudge_schema_big)However, the resulting schema is hard to make anything out of, withmore relations than we’d want to go through manually:
length(nudge_schema_big)## [1] 473There are a few reasons for this, but the main one is the sheernumber of functional dependencies discovered:
length(nudge_deps_big)## [1] 3732Even if we remove the transitive dependencies, which are implied bythe others, we still have 597 dependencies.
This is a rather large set for the simple approach to simplifying aset of FDs: manually looking for spurious cases to remove beforenormalisation.
As a hack, we can usereduce on the database. This findsthe relations with the most records – i.e. the relations with one recordfor each unique row in the original data – and returns a reduceddatabase, containing only those relations, and relations that theyreference, either directly or indirectly.
nudge_reduced_big <- reduce(nudge_db_big)The relations removed byreduce are often spurious, butnot always: see the reference chains section of thePlanned improvements vignette(vignette("plans", package = "autodb")) for an example ofnon-spurious tables not being connected. However,reduce atleast gives us something smaller to look at.
In this case, the reduced database is still too large to easilyreview:
length(nudge_reduced_big)## [1] 124Instead, we get a grip on what’s going on by collecting statistics onattributes involved in the determinants.
For example, we can also see how large the determinants are:
table(lengths(detset(nudge_deps_big)))## ## 1 2 3 4 5 6 7 ## 41 545 780 993 926 435 12While there are some large true determinants in real data, a commonrule of thumb is that FDs with larger determinants are more likely to bespurious. This doesn’t make for an obvious way to filter the FDs withsome hard size limit, though.
We can also see how often each attribute appears in adeterminant:
sort(table(unlist(detset(nudge_deps_big))), decreasing = TRUE)## ## n_comparison type_experiment intervention_category ## 1095 1085 1061 ## n_control domain n_intervention ## 1029 1016 993 ## year approximation intervention_technique ## 889 766 745 ## location variance_d sd_intervention ## 739 727 687 ## sd_control n_study cohens_d ## 684 628 549 ## binary_outcome mean_control mean_intervention ## 503 370 313 ## population wansink publication_id ## 269 248 119 ## reference title study_id ## 108 83 37 ## es_id ## 24We can be more specific, and see how often each attributes appears ina determinant of a given size:
sort_by_rowSums <- function(x, ...) x[order(rowSums(x), ...), , drop = FALSE]level_table <- function(x, levels) table(factor(x, levels))attrs_table <- function(x) level_table(unlist(x), names(nudge))by_lengths <- function(x, f) do.call(cbind, tapply(x, lengths(x), f))by_lengths(detset(nudge_deps_big), attrs_table) |> sort_by_rowSums(decreasing = TRUE)## 1 2 3 4 5 6 7## n_comparison 0 73 64 342 395 217 4## type_experiment 0 6 114 286 376 295 8## intervention_category 0 2 140 185 389 341 4## n_control 0 53 148 305 361 162 0## domain 0 23 109 335 453 84 12## n_intervention 0 64 142 283 290 206 8## year 0 11 266 408 192 12 0## approximation 0 0 35 125 317 277 12## intervention_technique 1 25 124 309 249 37 0## location 0 4 91 191 259 182 12## variance_d 0 76 201 241 164 45 0## sd_intervention 0 15 65 228 227 148 4## sd_control 0 12 65 228 227 148 4## n_study 0 60 81 224 182 77 4## cohens_d 0 211 258 23 46 11 0## binary_outcome 0 3 20 74 225 177 4## mean_control 1 89 196 59 25 0 0## mean_intervention 1 103 141 46 22 0 0## population 0 0 7 38 112 104 8## wansink 0 0 0 42 119 87 0## publication_id 0 97 22 0 0 0 0## reference 1 85 22 0 0 0 0## title 3 58 22 0 0 0 0## study_id 10 20 7 0 0 0 0## es_id 24 0 0 0 0 0 0Either way, we can see that there are attributes that we wouldn’texpect to be in a determinant, but often are, especially in larger ones.We don’t expect measurement attributes likevariance_d andsd_control to co-determine anything else, but they oftendo.
More generally, there are certain data classes we don’t expect to seeappear as determinants. For example, numerics (floats) are probablymeasurements, and shouldn’t co-determine anything else. However, if wecount determinant appearances by the attribute’s class, we find thatthey often appear in determinants:
attr_classes <- vapply(nudge, \(x) class(x)[[1]], character(1))class_table <- function(x) { level_table(attr_classes[unlist(x)], sort(unique(attr_classes)))}by_lengths(detset(nudge_deps_big), class_table) |> sort_by_rowSums(decreasing = TRUE)## 1 2 3 4 5 6 7## factor 1 60 585 1344 1838 1043 44## integer 34 378 730 1562 1420 674 16## numeric 2 506 926 825 711 352 8## logical 0 3 55 241 661 541 16## character 4 143 44 0 0 0 0This suggests that a simple first step is to remove any FD with afloat in the determinant. We can write this as a filter vector:
det_nofloat <- vapply( detset(nudge_deps_big), \(x) all(attr_classes[x] != "numeric"), logical(1))summary(det_nofloat)## Mode FALSE TRUE ## logical 2784 948This removes a lot!
If we use the filtered set of FDs, we still get a large schema:
length(normalise(nudge_deps_big[det_nofloat]))## [1] 176However, if we reduce the resulting database, we get something muchmore manageable:
nudge_schema_filtered <- normalise(nudge_deps_big[det_nofloat])nudge_db_filtered <- reduce(decompose(nudge, nudge_schema_filtered))show(nudge_db_filtered)Let’s look at the resulting relations, in two sets. Here’s theleft-hand set:
subsample <- c( "es_id", "study_id", "intervention_technique_n_comparison_n_control", "intervention_technique", "n_comparison_n_control")show(nudge_db_filtered[subsample])We see two relations,es_id andstudy_id,that clearly represent the effect and study levels of the datahierarchy, which is encouraging. They both contain the ID for the nextlevel, too, as they should.
The relation that the effect relation references, with the long name,looks rather arbitrary, but the relations it references look useful. Oneshows that the intervention technique determines the interventioncategory, i.e. techniques are subcategories.
Less simply, the other relation contains the three effect samplesizes: comparison size, intervention arm size, and control arm size.Each pair of these sizes determines the other, which is what we’d see ifthe comparison size is the sum of the other two. However, we can’tassume that this is the nature of the relationship. Indeed, we can finda case where it’s false:
knitr::kable( subset( nudge, n_comparison != n_control + n_intervention, c(reference, study_id, es_id, n_study, n_comparison, n_control, n_intervention) ), row.names = FALSE)| reference | study_id | es_id | n_study | n_comparison | n_control | n_intervention |
|---|---|---|---|---|---|---|
| Hedlin & Sunstein (2016) | 137 | 180 | 1037 | 1037 | 345 | 346 |
| Hedlin & Sunstein (2016) | 137 | 181 | 1037 | 1037 | 345 | 346 |
It turns out to be difficult to identify which results in therelevant paper these are referring to, since the presentation of theresults is rather obfuscated. However, it is clear that these arereferring to a part of the study with three arms instead of two(actually nine arms aggregated to groups of three). If we account forhaving two treatment arms, then the arm sizes do sum to the comparisonsize, and there is still a sum relationship of sorts, just not thesimpler one we might expect.
In any case, most of these relations in this set seem reasonable. Thesame can’t be said of the other set, which should represent thepublication-level data:
show(nudge_db_filtered[setdiff(names(nudge_db_filtered), subsample)])There is a lot going on here, so let’s look at the last two:
show(nudge_db_filtered[c("title", "reference")])This is our publication-level data, but something is wrong. Thepublication ID is not a simple key, as we’d expect. Neither is thereference, which we’d expect to be unique, otherwise the bibliographywill have duplicate entries. Instead, the only simple key is thepublication’s title. To use the ID or the reference as an identifier forpublications, we need to use both together as a compound key.
Without having looked at the data itself, these keys tell us thatthere are duplicate publication IDs and duplicate references. Since thekeys for thetitle relation showed us the problem, we cansearch that relation to find the duplicates, with all the attributesneeded for context:
duplicates <- function(x) unique(x[duplicated(x)])subset_duplicates <- function(x, attr) { x[x[[attr]] %in% duplicates(x[[attr]]), , drop = FALSE]}nudge_title_relation <- records(nudge_db_filtered)$titleknitr::kable(subset_duplicates(nudge_title_relation, "publication_id"))| title | publication_id | reference | year | |
|---|---|---|---|---|
| 44 | Enhanced active choice: A new method to motivatebehavior change | 95 | Keller et al. (2011) | 2011 |
| 130 | Nudging product choices: The effect of position changeon snack bar choice | 95 | Keller et al. (2015) | 2015 |
knitr::kable(subset_duplicates(nudge_title_relation, "reference"))| title | publication_id | reference | year | |
|---|---|---|---|---|
| 214 | Nudge vs superbugs: A behavioural economics trial toreduce the overprescribing of antibiotics | 18 | BETA (2018) | 2018 |
| 399 | Energy labels that make cents | 19 | BETA (2018) | 2018 |
The publications with the same ID have the same first author in theirreferences; this looks like a simple data entry error.
BETA is the Behavioural Economics Team of the Australian Government,so it’s not surprising that they’d have multiple publications/reportsper year. However the references for publications were generated, thepossibility of a group publishing multiple relevant papers in a yearwasn’t accounted for.
Looking at the keys for the other relations, we see that they oftenincludepublication_id orreference, and therelationships they describe seem rather questionable. These are notworth going into in detail, since we can guess that they stem from theduplication problems.
In a real project, this would be a good point to go back to the dataprovider with questions, since we now have examples from the data ofsurprising behaviour.
There are two ideal results when we find surprising behaviour:
In this example,publication_id andreference are synthetic variables, so we can easily fix thedata ourselves:
nudge_fixed <- within(nudge, { publication_id[publication_id == 95 & year == 2015] <- max(publication_id) + 1L reference[publication_id == 19] <- "BETA (2018a)"})We can then re-runautodb to find the new schema. Notethat we use theexclude_class argument to exclude FDs withfloats in their determinants, instead of removing them ourselves. Thisprunes them from the search space, speeding up the search.
db_fixed <- autodb( nudge_fixed, exclude_class = "numeric")length(db_fixed)## [1] 170show(reduce(db_fixed))This is an improvement: the publications now have an ID and areference as simple keys, as they should, and some of the questionablerelations have disappeared.
Some of the relations are still questionable. If we examine them, wesee that they all have large keys, containing three attributes or more.To simplify things, we can decide to remove them. Rather than manuallyfilter the FDs again, we can do this using another filtering argumentfordiscover/autodb,detset_limit:
db_final <- autodb( nudge_fixed, exclude_class = "numeric", detset_limit = 2)length(db_final)## [1] 11The database is now small enough to not need to reduce it:
show(db_final)However, the relations not referred to by the effect relation lookspurious too, so we can reduce the database to remove them too:
show(reduce(db_final))This is a great improvement. We are left with only two relations thatwe don’t expect in general, but hold for this dataset in particular:
publication_id_type_experiment states that studies fromthe same publication, that use the same type of experiment, also takeplace in the same location. This is probably not interesting, since the“location” is just a logical value for whether the study was conductedin the USA.publication_id_n_study states that studies from thesame publication, and with the same sample size, have the same location,domain, and target population. This is likely to be a coincidence, andnot of much interest either.Dataset-particular relations like this are useful to check, sincethey often indicate sampling limitations, which can make our plannedstatistical analysis infeasible.
We can remove either of these relations easily usingexclude, since their keys contain attributes that don’tappear in keys elsewhere. If we do, we get a schema that looksreasonable for data of this sort in general, apart from the mentionedissue with the effect sample sizes:
show(reduce(autodb( nudge_fixed, exclude = c("type_experiment", "n_study"), exclude_class = "numeric", detset_limit = 2)))Instead of adding search filters, or removing functionaldependencies, we could try to simplify the original schema by removingquestionable relations. This section is about why this is a badidea.
This is the database we had:
show(nudge_db_filtered)What happens if we remove the offending relations from theschema?
nudge_schema_relfiltered <- nudge_schema_filtered[ vapply( keys(nudge_schema_filtered), \(ks) all(lengths(ks) <= 2) && sum(c("publication_id", "reference") %in% ks[[1]]) != 1, logical(1) )]show(nudge_schema_relfiltered)The offending relations are removed, but so are any foreign keyreferences involving them. The resulting schema isn’t invalid, but it’sless connected than before.
Callingautoref to re-generate the foreign keyreferences doesn’t entirely fix the problem, since, for example, noother relation contains a key for thetitle relation:
show(autoref(nudge_schema_relfiltered))This partly happens because removing relations removes moreinformation than it needs to. If we remove the relevant functionaldependencies instead, then we can have new intermediate relations appearin the new schema, whose dependencies were originally redundant bytransitivity. Removing relations directly can’t account for this.
An alternative approach is to make use of our knowledge that the datais hierarchical, and remove dependencies where an attribute isco-determining an attribute on a lower level. For example,publication-level attributes shouldn’t co-determine study- ormeasurement-level attributes. This means we don’t find cases where theydo, but we could decide that any such cases are spurious, and we don’twant to spend time on them.
hlev <- c( publication_id = 3, study_id = 2, es_id = 1, reference = 3, title = 3, year = 3, location = 2, domain = 2, intervention_category = 2, intervention_technique = 2, type_experiment = 2, population = 2, n_study = 2, n_comparison = 1, n_control = 1, n_intervention = 1, binary_outcome = 1, mean_control = 1, sd_control = 1, mean_intervention = 1, sd_intervention = 1, cohens_d = 1, variance_d = 1, approximation = 1, wansink = 2)hfilter <- function(fds, hlev) { fds[mapply( \(det, dep) all(hlev[det] <= hlev[[dep]]), detset(fds), dependant(fds) )]}Filtered result for the original data:
nudge |> discover(exclude_class = "numeric", detset_limit = 2) |> hfilter(hlev) |> normalise(remove_avoidable = TRUE) |> decompose(df = nudge) |> reduce() |> show()Filtered result for the fixed data:
nudge_fixed |> discover(exclude_class = "numeric", detset_limit = 2) |> hfilter(hlev) |> normalise(remove_avoidable = TRUE) |> decompose(df = nudge_fixed) |> show()