Movatterモバイル変換


[0]ホーム

URL:


A larger example: the nudge dataset

library(autodb)
if (requireNamespace("DiagrammeR", quietly = TRUE)) {  show <- function(x) DiagrammeR::grViz(gv(x), width = "100%")  maybe_plot <- function(x) DiagrammeR::grViz(gv(x), width = "100%")}else{  show <- print  maybe_plot <- function(x) invisible(NULL)}

Included in the package is a 447-by-25 data frame callednudge:

knitr::kable(data.frame(  attribute = names(nudge),  class = vapply(nudge, \(x) class(x)[[1]], character(1)),  NAs = vapply(nudge, \(x) sum(is.na(x)), integer(1)),  row.names = NULL))
attributeclassNAs
publication_idinteger0
study_idinteger0
es_idinteger0
referencecharacter0
titlecharacter0
yearinteger0
locationfactor0
domainfactor0
intervention_categoryfactor0
intervention_techniquefactor0
type_experimentfactor0
populationfactor0
n_studyinteger0
n_comparisoninteger0
n_controlinteger0
n_interventioninteger0
binary_outcomelogical0
mean_controlnumeric13
sd_controlnumeric286
mean_interventionnumeric13
sd_interventionnumeric286
cohens_dnumeric0
variance_dnumeric0
approximationlogical0
wansinklogical0

This is the data set for a meta-analysis, looking at theeffectiveness of “nudge” interventions. Like any meta-analysis, the datais strictly hierarchical: publications contain studies, which containeffect size measurements. We expect to see this hierarchy in thenormalisation.

Initial decomposition

The full dependency information for this data is rather large. Thisresults in a relatively large search time for a data set of this size,but it still only takes a few seconds.

nudge_deps_big <- discover(nudge)nudge_schema_big <- normalise(nudge_deps_big, remove_avoidable = TRUE)nudge_db_big <- decompose(nudge, nudge_schema_big)

However, the resulting schema is hard to make anything out of, withmore relations than we’d want to go through manually:

length(nudge_schema_big)
## [1] 473

There are a few reasons for this, but the main one is the sheernumber of functional dependencies discovered:

length(nudge_deps_big)
## [1] 3732

Even if we remove the transitive dependencies, which are implied bythe others, we still have 597 dependencies.

Simplifying the search result

This is a rather large set for the simple approach to simplifying aset of FDs: manually looking for spurious cases to remove beforenormalisation.

As a hack, we can usereduce on the database. This findsthe relations with the most records – i.e. the relations with one recordfor each unique row in the original data – and returns a reduceddatabase, containing only those relations, and relations that theyreference, either directly or indirectly.

nudge_reduced_big <- reduce(nudge_db_big)

The relations removed byreduce are often spurious, butnot always: see the reference chains section of thePlanned improvements vignette(vignette("plans", package = "autodb")) for an example ofnon-spurious tables not being connected. However,reduce atleast gives us something smaller to look at.

In this case, the reduced database is still too large to easilyreview:

length(nudge_reduced_big)
## [1] 124

Instead, we get a grip on what’s going on by collecting statistics onattributes involved in the determinants.

For example, we can also see how large the determinants are:

table(lengths(detset(nudge_deps_big)))
## ##   1   2   3   4   5   6   7 ##  41 545 780 993 926 435  12

While there are some large true determinants in real data, a commonrule of thumb is that FDs with larger determinants are more likely to bespurious. This doesn’t make for an obvious way to filter the FDs withsome hard size limit, though.

We can also see how often each attribute appears in adeterminant:

sort(table(unlist(detset(nudge_deps_big))), decreasing = TRUE)
## ##           n_comparison        type_experiment  intervention_category ##                   1095                   1085                   1061 ##              n_control                 domain         n_intervention ##                   1029                   1016                    993 ##                   year          approximation intervention_technique ##                    889                    766                    745 ##               location             variance_d        sd_intervention ##                    739                    727                    687 ##             sd_control                n_study               cohens_d ##                    684                    628                    549 ##         binary_outcome           mean_control      mean_intervention ##                    503                    370                    313 ##             population                wansink         publication_id ##                    269                    248                    119 ##              reference                  title               study_id ##                    108                     83                     37 ##                  es_id ##                     24

We can be more specific, and see how often each attributes appears ina determinant of a given size:

sort_by_rowSums <- function(x, ...) x[order(rowSums(x), ...), , drop = FALSE]level_table <- function(x, levels) table(factor(x, levels))attrs_table <- function(x) level_table(unlist(x), names(nudge))by_lengths <- function(x, f) do.call(cbind, tapply(x, lengths(x), f))by_lengths(detset(nudge_deps_big), attrs_table) |>  sort_by_rowSums(decreasing = TRUE)
##                         1   2   3   4   5   6  7## n_comparison            0  73  64 342 395 217  4## type_experiment         0   6 114 286 376 295  8## intervention_category   0   2 140 185 389 341  4## n_control               0  53 148 305 361 162  0## domain                  0  23 109 335 453  84 12## n_intervention          0  64 142 283 290 206  8## year                    0  11 266 408 192  12  0## approximation           0   0  35 125 317 277 12## intervention_technique  1  25 124 309 249  37  0## location                0   4  91 191 259 182 12## variance_d              0  76 201 241 164  45  0## sd_intervention         0  15  65 228 227 148  4## sd_control              0  12  65 228 227 148  4## n_study                 0  60  81 224 182  77  4## cohens_d                0 211 258  23  46  11  0## binary_outcome          0   3  20  74 225 177  4## mean_control            1  89 196  59  25   0  0## mean_intervention       1 103 141  46  22   0  0## population              0   0   7  38 112 104  8## wansink                 0   0   0  42 119  87  0## publication_id          0  97  22   0   0   0  0## reference               1  85  22   0   0   0  0## title                   3  58  22   0   0   0  0## study_id               10  20   7   0   0   0  0## es_id                  24   0   0   0   0   0  0

Either way, we can see that there are attributes that we wouldn’texpect to be in a determinant, but often are, especially in larger ones.We don’t expect measurement attributes likevariance_d andsd_control to co-determine anything else, but they oftendo.

More generally, there are certain data classes we don’t expect to seeappear as determinants. For example, numerics (floats) are probablymeasurements, and shouldn’t co-determine anything else. However, if wecount determinant appearances by the attribute’s class, we find thatthey often appear in determinants:

attr_classes <- vapply(nudge, \(x) class(x)[[1]], character(1))class_table <- function(x) {  level_table(attr_classes[unlist(x)], sort(unique(attr_classes)))}by_lengths(detset(nudge_deps_big), class_table) |>  sort_by_rowSums(decreasing = TRUE)
##            1   2   3    4    5    6  7## factor     1  60 585 1344 1838 1043 44## integer   34 378 730 1562 1420  674 16## numeric    2 506 926  825  711  352  8## logical    0   3  55  241  661  541 16## character  4 143  44    0    0    0  0

This suggests that a simple first step is to remove any FD with afloat in the determinant. We can write this as a filter vector:

det_nofloat <- vapply(  detset(nudge_deps_big),  \(x) all(attr_classes[x] != "numeric"),  logical(1))summary(det_nofloat)
##    Mode   FALSE    TRUE ## logical    2784     948

This removes a lot!

If we use the filtered set of FDs, we still get a large schema:

length(normalise(nudge_deps_big[det_nofloat]))
## [1] 176

However, if we reduce the resulting database, we get something muchmore manageable:

nudge_schema_filtered <- normalise(nudge_deps_big[det_nofloat])nudge_db_filtered <- reduce(decompose(nudge, nudge_schema_filtered))
show(nudge_db_filtered)

Let’s look at the resulting relations, in two sets. Here’s theleft-hand set:

subsample <- c(  "es_id",  "study_id",  "intervention_technique_n_comparison_n_control",  "intervention_technique",  "n_comparison_n_control")
show(nudge_db_filtered[subsample])

We see two relations,es_id andstudy_id,that clearly represent the effect and study levels of the datahierarchy, which is encouraging. They both contain the ID for the nextlevel, too, as they should.

The relation that the effect relation references, with the long name,looks rather arbitrary, but the relations it references look useful. Oneshows that the intervention technique determines the interventioncategory, i.e. techniques are subcategories.

Less simply, the other relation contains the three effect samplesizes: comparison size, intervention arm size, and control arm size.Each pair of these sizes determines the other, which is what we’d see ifthe comparison size is the sum of the other two. However, we can’tassume that this is the nature of the relationship. Indeed, we can finda case where it’s false:

knitr::kable(  subset(    nudge,    n_comparison != n_control + n_intervention,    c(reference, study_id, es_id, n_study, n_comparison, n_control, n_intervention)  ),  row.names = FALSE)
referencestudy_ides_idn_studyn_comparisonn_controln_intervention
Hedlin & Sunstein (2016)13718010371037345346
Hedlin & Sunstein (2016)13718110371037345346

It turns out to be difficult to identify which results in therelevant paper these are referring to, since the presentation of theresults is rather obfuscated. However, it is clear that these arereferring to a part of the study with three arms instead of two(actually nine arms aggregated to groups of three). If we account forhaving two treatment arms, then the arm sizes do sum to the comparisonsize, and there is still a sum relationship of sorts, just not thesimpler one we might expect.

In any case, most of these relations in this set seem reasonable. Thesame can’t be said of the other set, which should represent thepublication-level data:

show(nudge_db_filtered[setdiff(names(nudge_db_filtered), subsample)])

There is a lot going on here, so let’s look at the last two:

show(nudge_db_filtered[c("title", "reference")])

This is our publication-level data, but something is wrong. Thepublication ID is not a simple key, as we’d expect. Neither is thereference, which we’d expect to be unique, otherwise the bibliographywill have duplicate entries. Instead, the only simple key is thepublication’s title. To use the ID or the reference as an identifier forpublications, we need to use both together as a compound key.

Without having looked at the data itself, these keys tell us thatthere are duplicate publication IDs and duplicate references. Since thekeys for thetitle relation showed us the problem, we cansearch that relation to find the duplicates, with all the attributesneeded for context:

duplicates <- function(x) unique(x[duplicated(x)])subset_duplicates <- function(x, attr) {  x[x[[attr]] %in% duplicates(x[[attr]]), , drop = FALSE]}
nudge_title_relation <- records(nudge_db_filtered)$titleknitr::kable(subset_duplicates(nudge_title_relation, "publication_id"))
titlepublication_idreferenceyear
44Enhanced active choice: A new method to motivatebehavior change95Keller et al. (2011)2011
130Nudging product choices: The effect of position changeon snack bar choice95Keller et al. (2015)2015
knitr::kable(subset_duplicates(nudge_title_relation, "reference"))
titlepublication_idreferenceyear
214Nudge vs superbugs: A behavioural economics trial toreduce the overprescribing of antibiotics18BETA (2018)2018
399Energy labels that make cents19BETA (2018)2018

The publications with the same ID have the same first author in theirreferences; this looks like a simple data entry error.

BETA is the Behavioural Economics Team of the Australian Government,so it’s not surprising that they’d have multiple publications/reportsper year. However the references for publications were generated, thepossibility of a group publishing multiple relevant papers in a yearwasn’t accounted for.

Looking at the keys for the other relations, we see that they oftenincludepublication_id orreference, and therelationships they describe seem rather questionable. These are notworth going into in detail, since we can guess that they stem from theduplication problems.

In a real project, this would be a good point to go back to the dataprovider with questions, since we now have examples from the data ofsurprising behaviour.

Fixing the data

There are two ideal results when we find surprising behaviour:

In this example,publication_id andreference are synthetic variables, so we can easily fix thedata ourselves:

nudge_fixed <- within(nudge, {  publication_id[publication_id == 95 & year == 2015] <- max(publication_id) + 1L  reference[publication_id == 19] <- "BETA (2018a)"})

We can then re-runautodb to find the new schema. Notethat we use theexclude_class argument to exclude FDs withfloats in their determinants, instead of removing them ourselves. Thisprunes them from the search space, speeding up the search.

db_fixed <- autodb(  nudge_fixed,  exclude_class = "numeric")length(db_fixed)
## [1] 170
show(reduce(db_fixed))

This is an improvement: the publications now have an ID and areference as simple keys, as they should, and some of the questionablerelations have disappeared.

Simplifying further with search filters

Some of the relations are still questionable. If we examine them, wesee that they all have large keys, containing three attributes or more.To simplify things, we can decide to remove them. Rather than manuallyfilter the FDs again, we can do this using another filtering argumentfordiscover/autodb,detset_limit:

db_final <- autodb(  nudge_fixed,  exclude_class = "numeric",  detset_limit = 2)length(db_final)
## [1] 11

The database is now small enough to not need to reduce it:

show(db_final)

However, the relations not referred to by the effect relation lookspurious too, so we can reduce the database to remove them too:

show(reduce(db_final))

This is a great improvement. We are left with only two relations thatwe don’t expect in general, but hold for this dataset in particular:

Dataset-particular relations like this are useful to check, sincethey often indicate sampling limitations, which can make our plannedstatistical analysis infeasible.

We can remove either of these relations easily usingexclude, since their keys contain attributes that don’tappear in keys elsewhere. If we do, we get a schema that looksreasonable for data of this sort in general, apart from the mentionedissue with the effect sample sizes:

show(reduce(autodb(  nudge_fixed,  exclude = c("type_experiment", "n_study"),  exclude_class = "numeric",  detset_limit = 2)))

How not to remove spurious structure

Instead of adding search filters, or removing functionaldependencies, we could try to simplify the original schema by removingquestionable relations. This section is about why this is a badidea.

This is the database we had:

show(nudge_db_filtered)

What happens if we remove the offending relations from theschema?

nudge_schema_relfiltered <- nudge_schema_filtered[  vapply(    keys(nudge_schema_filtered),    \(ks) all(lengths(ks) <= 2) &&      sum(c("publication_id", "reference") %in% ks[[1]]) != 1,    logical(1)  )]
show(nudge_schema_relfiltered)

The offending relations are removed, but so are any foreign keyreferences involving them. The resulting schema isn’t invalid, but it’sless connected than before.

Callingautoref to re-generate the foreign keyreferences doesn’t entirely fix the problem, since, for example, noother relation contains a key for thetitle relation:

show(autoref(nudge_schema_relfiltered))

This partly happens because removing relations removes moreinformation than it needs to. If we remove the relevant functionaldependencies instead, then we can have new intermediate relations appearin the new schema, whose dependencies were originally redundant bytransitivity. Removing relations directly can’t account for this.

Alternative approach: hierarchical limits

An alternative approach is to make use of our knowledge that the datais hierarchical, and remove dependencies where an attribute isco-determining an attribute on a lower level. For example,publication-level attributes shouldn’t co-determine study- ormeasurement-level attributes. This means we don’t find cases where theydo, but we could decide that any such cases are spurious, and we don’twant to spend time on them.

hlev <- c(  publication_id = 3,  study_id = 2,  es_id = 1,  reference = 3,  title = 3,  year = 3,  location = 2,  domain = 2,  intervention_category = 2,  intervention_technique = 2,  type_experiment = 2,  population = 2,  n_study = 2,  n_comparison = 1,  n_control = 1,  n_intervention = 1,  binary_outcome = 1,  mean_control = 1,  sd_control = 1,  mean_intervention = 1,  sd_intervention = 1,  cohens_d = 1,  variance_d = 1,  approximation = 1,  wansink = 2)hfilter <- function(fds, hlev) {  fds[mapply(    \(det, dep) all(hlev[det] <= hlev[[dep]]),    detset(fds),    dependant(fds)  )]}

Filtered result for the original data:

nudge |>  discover(exclude_class = "numeric", detset_limit = 2) |>  hfilter(hlev) |>  normalise(remove_avoidable = TRUE) |>  decompose(df = nudge) |>  reduce() |>  show()

Filtered result for the fixed data:

nudge_fixed |>  discover(exclude_class = "numeric", detset_limit = 2) |>  hfilter(hlev) |>  normalise(remove_avoidable = TRUE) |>  decompose(df = nudge_fixed) |>  show()

[8]ページ先頭

©2009-2025 Movatter.jp