long_panel() andwiden_panel()One of the initial challenges a data analyst is likely to face withpanel data is getting it into a format suitable for analysis. Mostregression analyses for panel data require the data to be inlong format. That means there is a row for each entity(e.g., person) at each time point. If I conducted a 3-wave panel surveyof 300 people, each of whom responded to all 3 waves, the long format ofthese data would have 900 rows (300 respondents x 3 waves).
For example, the following is how long data look, whereid is the identifier for each entity,wave isthe indicator of the time point, andQ1/Q2 aremeasures repeated at each time point.
| id | wave | Q1 | Q2 |
|---|---|---|---|
| 1 | 1 | 1.0 | 5.0 |
| 1 | 2 | 1.5 | 7.5 |
| 1 | 3 | 2.0 | 10.0 |
| 2 | 1 | 5.0 | 14.0 |
| 2 | 2 | 4.0 | 10.5 |
| 2 | 3 | 3.0 | 7.0 |
| 3 | 1 | 15.0 | 8.0 |
| 3 | 2 | 12.0 | 12.0 |
| 3 | 3 | 9.0 | 16.0 |
Wide data, on the other hand, have only one row perentity and a separate column for each measure and time point. The samedata above in wide format look like this:
| id | Q1_W1 | Q1_W2 | Q1_W3 | Q2_W1 | Q2_W2 | Q2_W3 |
|---|---|---|---|---|---|---|
| 1 | 1 | 1.5 | 2 | 5 | 7.5 | 10 |
| 2 | 5 | 4.0 | 3 | 14 | 10.5 | 7 |
| 3 | 15 | 12.0 | 9 | 8 | 12.0 | 16 |
Here you differentiate between waves by looking at the column name,which in this case end in “_W” and then the wave indicator. Someanalyses prefer the data in this format, like structural equationmodels.
panelr considers the native format of panel data to belong and provides thepanel_data class to keepyour data tidy in the long format. Of course, sometimes your raw dataaren’t in long format and need to be “reshaped” from wide to long. Inother cases, you have long format data but need to get it into wideformat for some reason or another.panelr provides tools tohelp with these situations.
There are some other tools, including ones thatpanelruses internally, that can manage these situations. However, they tend tobe some combination of confusing, inflexible, or too general to beeasily used for these purposes by non-experts.
In my experience, survey contractors (i.e., the people you pay tocarry out panel surveys) like to provide the data in wide format. As ageneral rule, the conversion of data from wide to long is much moredifficult than the inverse. When preparing to reshape data from wide tolong format, you’ll need to answer some questions relating to how thecolumn/variable names distinguish the variable name from the timeindicator:
W1_variable has both prefix (W)and suffix (_)).One key assumption is that variables labeled with a pattern such asQ1_W1,Q1_W2, and so on refer to thesamemeasure at different times. I’ve encountered datasets in whichQ1 might refer to a different measure at each time pointand this is not a problem that can be handled in an automated way.
With that warning out of the way, let’s look at a coupleexamples.
Let’s return to the wide data we looked at earlier.
| id | Q1_W1 | Q1_W2 | Q1_W3 | Q2_W1 | Q2_W2 | Q2_W3 |
|---|---|---|---|---|---|---|
| 1 | 1 | 1.5 | 2 | 5 | 7.5 | 10 |
| 2 | 5 | 4.0 | 3 | 14 | 10.5 | 7 |
| 3 | 15 | 12.0 | 9 | 8 | 12.0 | 16 |
Here we can see that the time indicators are at theend ofthe variable names (_W1), time indicators of 1, 2, and 3,and aprefix of_W. With that in mind, we can uselong_panel() to convert the data to long format.
long_panel(wide,prefix ="_W",begin =1,end =3,label_location ="end")| id | wave | Q1 | Q2 |
|---|---|---|---|
| 1 | 1 | 1.0 | 5.0 |
| 1 | 2 | 1.5 | 7.5 |
| 1 | 3 | 2.0 | 10.0 |
| 2 | 1 | 5.0 | 14.0 |
| 2 | 2 | 4.0 | 10.5 |
| 2 | 3 | 3.0 | 7.0 |
| 3 | 1 | 15.0 | 8.0 |
| 3 | 2 | 12.0 | 12.0 |
| 3 | 3 | 9.0 | 16.0 |
Perfect! The first argument,w, was the wide data.prefix is self-explanatory.begin andend refer to the range of the time indicators, since theyare consecutive. You can instead useperiods = c(1, 2, 3)if you prefer. That’s especially true if you have non-consecutive timeindicators like a biannual survey that uses the year as an indicator,likeperiods = c(1990, 1992, 1994).
I should note that base R has a function,reshape() thatcan work in this situation without making you pull your hair out toomuch:
reshape(as.data.frame(wide),sep ="_W",times =c(1,2,3),direction ="long",varying =c("Q1_W1","Q1_W2","Q1_W3","Q2_W1","Q2_W2","Q2_W3"))| id | time | Q1 | Q2 | |
|---|---|---|---|---|
| 1.1 | 1 | 1 | 1.0 | 5.0 |
| 2.1 | 2 | 1 | 5.0 | 14.0 |
| 3.1 | 3 | 1 | 15.0 | 8.0 |
| 1.2 | 1 | 2 | 1.5 | 7.5 |
| 2.2 | 2 | 2 | 4.0 | 10.5 |
| 3.2 | 3 | 2 | 12.0 | 12.0 |
| 1.3 | 1 | 3 | 2.0 | 10.0 |
| 2.3 | 2 | 3 | 3.0 | 7.0 |
| 3.3 | 3 | 3 | 9.0 | 16.0 |
You can see how frustrating that could be if you had many morevariables — it wouldn’t be unusual to have hundreds of columns in thewide format, not all of which would be variables that vary over time(e.g., race). Truth be told,long_panel() usesreshape() internally, but only after a lot of processing.Other options include thereshape2 andtidyrpackages, but they are not purpose-built for the panel setting andtherefore can have a learning curve to avoid having data that end up abittoo long.
Here’s a wide dataset with what is usually a trickier format tohandle due to limitations ofreshape().
| WA_Q1 | WB_Q1 | WC_Q1 | WA_Q2 | WC_Q2 |
|---|---|---|---|---|
| 1 | 1.5 | 2 | 5 | 10 |
| 5 | 4.0 | 3 | 14 | 7 |
| 15 | 12.0 | 9 | 8 | 16 |
Key characteristics:
W) and suffix (_).While you don’t have to recognize this to use the function properly,notice that in this caseQ2 was only measured at times Aand C. This can add considerable difficulty to when trying to reshapedata “by hand.”
long_panel(wide,prefix ="W",suffix ="_",label_location ="beginning",begin ="A",end ="C")| id | wave | Q1 | Q2 |
|---|---|---|---|
| 1 | A | 1.0 | 5 |
| 1 | B | 1.5 | NA |
| 1 | C | 2.0 | 10 |
| 2 | A | 5.0 | 14 |
| 2 | B | 4.0 | NA |
| 2 | C | 3.0 | 7 |
| 3 | A | 15.0 | 8 |
| 3 | B | 12.0 | NA |
| 3 | C | 9.0 | 16 |
Just what we were looking for. Note thatpanel_dataobjects must have an ordered wave variable, butlong_data()understands how to order letters and handles that for you. Themissingness inQ2 is by design, since it wasn’t measured inwave B.
Another issue that can come up is the treatment of constants — thatis, variables that do not change over time. The best wide data shouldcome labeled in a way that makes it clear the constants are constants.For instance, a variable signifying race wouldn’t be calledrace_W1, but instead justrace.long_panel() automatically checks your data for variablesthat are labeled as if they vary over time but actually do not.
For instance, data that start by looking like this:
| id | Q1_W1 | Q1_W2 | Q1_W3 | race_W1 |
|---|---|---|---|---|
| 1 | 1 | 1.5 | 2 | white |
| 2 | 4 | 3.0 | 2 | black |
| 3 | 15 | 12.0 | 9 | white |
Can easily end up shaped like this:
| id | wave | race | Q1 |
|---|---|---|---|
| 1 | 1 | white | 1.0 |
| 1 | 2 | NA | 1.5 |
| 1 | 3 | NA | 2.0 |
| 2 | 1 | black | 4.0 |
| 2 | 2 | NA | 3.0 |
| 2 | 3 | NA | 2.0 |
| 3 | 1 | white | 15.0 |
| 3 | 2 | NA | 12.0 |
| 3 | 3 | NA | 9.0 |
But obviously just because the wide data markedracewith a wave label, that doesn’t mean it was unknown in the other waves.You’ll get the right result withlong_panel():
long_panel(wide,prefix ="_W",label_location ="end",begin =1,end =3)| id | wave | Q1 | race |
|---|---|---|---|
| 1 | 1 | 1.0 | white |
| 1 | 2 | 1.5 | white |
| 1 | 3 | 2.0 | white |
| 2 | 1 | 4.0 | black |
| 2 | 2 | 3.0 | black |
| 2 | 3 | 2.0 | black |
| 3 | 1 | 15.0 | white |
| 3 | 2 | 12.0 | white |
| 3 | 3 | 9.0 | white |
If you have an ID variable in the wide data, you can pass the name ofthat variable tolong_panel() with theidargument, which is"id" by default. If there is no variablewith the name you give toid, one will be created. You canalso choose the name of the wave variable viawave, whichis"wave" by default.
You can also choose not to have the output oflong_panel() be apanel_data object by settingas_panel_data toFALSE.
There are some other options available to you for tougher cases. Forinstance, whenuse.regex isTRUE, thearguments forprefix andsuffix are treated asregular expressions for more complicated patterns.
Internally, time-varying variables are detected by the presence ofprefix, one of the time periods, andsuffix inthe variable name. The “root” variable without the indicator is whateveris left. Sometimes, though, this can cause false matches. Here’s anexample I have encountered. My wide data looked like this:
| CaseID | Consent | A1 | B1 | C1 |
|---|---|---|---|---|
| 1 | TRUE | 5 | 4 | 3 |
| 2 | TRUE | 6 | 7 | 8 |
| 3 | TRUE | 10 | 8 | 6 |
My ID variable was calledCaseID and the periods — whichwere A, B, and C — were labeled at thebeginning of the columnnames. Following the earlier examples, this will confuselong_panel():
long_panel(wide,begin ="A",end ="C",label_location ="beginning",id ="CaseID")| CaseID | wave | 1 | onsent |
|---|---|---|---|
| 1 | A | 5 | TRUE |
| 1 | B | 4 | TRUE |
| 1 | C | 3 | TRUE |
| 2 | A | 6 | TRUE |
| 2 | B | 7 | TRUE |
| 2 | C | 8 | TRUE |
| 3 | A | 10 | TRUE |
| 3 | B | 8 | TRUE |
| 3 | C | 6 | TRUE |
See what happened? TheConsent variable in the wide datalooked just like a constant variable that was measured at time point C.This isn’t the end of the world, but errors like this can be moreconfusing and damaging in other scenarios. Fortunately, I knew moreabout the labeling of the time-varying variables than what I toldlong_panel(). Yes, there is A/B/C at the beginning with noprefix/suffix, but also each time-varying item has anumberthat comes after A/B/C.
long_panel() offers the argumentmatch forsituations like these. This is the regular expression used to match andthen capture the variable name sans time indicator. By default,match is".*", meaning any character anynumber of times. To reflect what I know about these data, I change it to"\\d+.*", meaning at least one digit following by anynumber of other characters.
long_panel(wide,begin ="A",end ="C",label_location ="beginning",id ="CaseID",match ="\\d+.*")| CaseID | wave | Consent | 1 |
|---|---|---|---|
| 1 | A | TRUE | 5 |
| 1 | B | TRUE | 4 |
| 1 | C | TRUE | 3 |
| 2 | A | TRUE | 6 |
| 2 | B | TRUE | 7 |
| 2 | C | TRUE | 8 |
| 3 | A | TRUE | 10 |
| 3 | B | TRUE | 8 |
| 3 | C | TRUE | 6 |
Now it rightly ignoresConsent as a variable that lacksa time indicator. In general,long_panel() tries to protectyou from having to use or even know how to use regular expressions, butsometimes there’s no way around it.
widen_panel(), as you might expect, does the opposite oflong_panel(). This is generally an easier operation,thankfully.
widen_panel() expects apanel_data object.If your long data aren’t in that format, it’s easy enough to just passit topanel_data().
To go through an example, let’s take a look at some long data.
| person | time | Q1 | Q2 | race |
|---|---|---|---|---|
| 1 | 1 | 1.0 | 5.0 | white |
| 1 | 2 | 1.5 | 7.5 | white |
| 1 | 3 | 2.0 | 10.0 | white |
| 2 | 1 | 5.0 | 14.0 | black |
| 2 | 2 | 4.0 | 10.5 | black |
| 2 | 3 | 3.0 | 7.0 | black |
| 3 | 1 | 15.0 | 8.0 | white |
| 3 | 2 | 12.0 | 12.0 | white |
| 3 | 3 | 9.0 | 16.0 | white |
Okay, so we have an ID variable (person), wave variable(time), two time-varying variables (Q1 andQ2), and a time-invariant variable (race). Theonly difficulty here conceptually is how to automatically know, withoutthe domain knowledge about the substantive meaning of these variables,which ones vary over time and which don’t. This is simply a matter ofwiden_panel() checking the variance of each (using thepanelr functionare_varying()). Note that invery wide datasets, or those with many individuals, this can take alittle while to happen.
widen_panel(long_data,separator ="_")| person | race | Q1_1 | Q2_1 | Q1_2 | Q2_2 | Q1_3 | Q2_3 |
|---|---|---|---|---|---|---|---|
| 1 | white | 1 | 5 | 1.5 | 7.5 | 2 | 10 |
| 2 | black | 5 | 14 | 4.0 | 10.5 | 3 | 7 |
| 3 | white | 15 | 8 | 12.0 | 12.0 | 9 | 16 |
Pretty much all you need to worry about is how you want to label thewide data. By default theseparator argument is"_".
There are only two other arguments.varying lets youspecify which variables in the long data vary over time. This can saveyou time compared to havingwiden_panel() check them all,but of course requires you to pass those variable names along which canbe more work than it’s worth at times.
ignore.attributes deals with the scenario in which youstarted with wide data, usedlong_panel() to convert tolong format, and now want to convert back to wide format.long_panel() stores information in the data frame aboutwhich variables vary over time so that they don’t have to be checked allover again. If you’ve made changes or think something went wrong, youcan setignore.attributes toTRUE to forcethose checks all over again.