351

I'd like to take data of the form

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))  attr          type1    1   foo_and_bar2   30 foo_and_bar_23    4   foo_and_bar4    6 foo_and_bar_2

and usesplit() on the column "type" from above to get something like this:

  attr type_1 type_21    1    foo    bar2   30    foo  bar_23    4    foo    bar4    6    foo  bar_2

I came up with something unbelievably complex involving some form ofapply that worked, but I've since misplaced that. It seemed far too complicated to be the best way. I can usestrsplit as below, but then unclear how to get that back into 2 columns in the data frame.

> strsplit(as.character(before$type),'_and_')[[1]][1] "foo" "bar"[[2]][1] "foo"   "bar_2"[[3]][1] "foo" "bar"[[4]][1] "foo"   "bar_2"

Thanks for any pointers. I've not quite groked R lists just yet.

David Arenburg's user avatar
David Arenburg
92.4k18 gold badges145 silver badges202 bronze badges
askedDec 3, 2010 at 22:29
jkebinger's user avatar

18 Answers18

363

Usestringr::str_split_fixed

library(stringr)str_split_fixed(before$type, "_and_", 2)
answeredDec 4, 2010 at 4:21
hadley's user avatar
Sign up to request clarification or add additional context in comments.

8 Comments

this worked pretty fine for my problem today as well.. but it was adding a 'c' at the beginning of each row. Any idea why is that???left_right <- str_split_fixed(as.character(split_df),'\">',2)
I would like to split with a pattern that has "...", when I apply that function, it returns nothing. What could be the problem. my type is something like "test...score"
@user3841581 - old query of yours I know, but this is covered in the documentation -str_split_fixed("aaa...bbb", fixed("..."), 2) works fine withfixed() to "Match a fixed string" in thepattern= argument.. means 'any character' in regex.
@Martin That c() around the regex is unnecessary and confusing. It suggests the could be more than one pattern and I doubt that is the case.
I think the comment from @Martin above explains my issue. The char I needed to split my field on was a "|". The accepted answer didn't work for me until I addedfixed("|")
|
287
Answer recommended byR Language Collective

You can use thetidyr package.

before <- data.frame(  attr = c(1, 30 ,4 ,6 ),   type = c('foo_and_bar', 'foo_and_bar_2'))library(tidyr)before |>  separate_wider_delim(type, delim = "_and_", names = c("foo", "bar"))# # A tibble: 4 × 3#    attr foo   bar  #   <dbl> <chr> <chr># 1     1 foo   bar  # 2    30 foo   bar_2# 3     4 foo   bar  # 4     6 foo   bar_2

(Or using older versions oftidyr)

before %>%  separate(type, c("foo", "bar"), "_and_")##   attr foo   bar## 1    1 foo   bar## 2   30 foo bar_2## 3    4 foo   bar## 4    6 foo bar_2
Gregor Thomas's user avatar
Gregor Thomas
147k22 gold badges185 silver badges320 bronze badges
answeredJun 11, 2014 at 16:50
hadley's user avatar

3 Comments

Is there a way to limit number of splits with separate? Let's say I want to split on '_' only once (or do it withstr_split_fixed and adding columns to existing dataframe)?
@hadley How about if I want to split based on second_? I want the values asfoo_and,bar/bar_2?
tidyr::separate has been superseded bytidyr::separate_wider_delim.
95

5 years later adding the obligatorydata.table solution

library(data.table) ## v 1.9.6+ setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_")]before#    attr          type type1 type2# 1:    1   foo_and_bar   foo   bar# 2:   30 foo_and_bar_2   foo bar_2# 3:    4   foo_and_bar   foo   bar# 4:    6 foo_and_bar_2   foo bar_2

We could also both make sure that the resulting columns will have correct typesand improve performance by addingtype.convert andfixed arguments (since"_and_" isn't really a regex)

setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_", type.convert = TRUE, fixed = TRUE)]
answeredOct 14, 2015 at 14:14
David Arenburg's user avatar

3 Comments

if the number of your'_and_' patterns vary, you can find out the maximum number of matches (i.e. future columns) withmax(lengths(strsplit(before$type, '_and_')))
This is my favorite answer, works very well! Could you please explain how it works. Why transpose(strsplit(…)) and isn't paste0 for concatenating strings - not splitting them...
@Gecko I'm not sure what is the question. If you just usestrsplit it creates a single vector with 2 values in each slot, sotstrsplit transposes it into 2 vectors with a single value in each.paste0 is just used in order to create the column names, it is not used on the values. On the LHS of the equation are the column names, on the RHS is the split + transpose operation on the column.:= stands for "assign in place", hence you don't see the<- assignment operator there.
80

Yet another approach: userbind onout:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))  out <- strsplit(as.character(before$type),'_and_') do.call(rbind, out)     [,1]  [,2]   [1,] "foo" "bar"  [2,] "foo" "bar_2"[3,] "foo" "bar"  [4,] "foo" "bar_2"

And to combine:

data.frame(before$attr, do.call(rbind, out))
answeredDec 4, 2010 at 0:51
Aniko's user avatar

2 Comments

Another alternative on newer R versions isstrcapture("(.*)_and_(.*)", as.character(before$type), data.frame(type_1 = "", type_2 = ""))
This is the correct solution. Simple and doesn't require thrid-party packages.
46

Notice that sapply with "[" can be used to extract either the first or second items in those lists so:

before$type_1 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 1)before$type_2 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 2)before$type <- NULL

And here's a gsub method:

before$type_1 <- gsub("_and_.+$", "", before$type)before$type_2 <- gsub("^.+_and_", "", before$type)before$type <- NULL
Stedy's user avatar
Stedy
7,47715 gold badges60 silver badges79 bronze badges
answeredDec 3, 2010 at 23:35
IRTFM's user avatar

Comments

38

here is a one liner along the same lines as aniko's solution, but using hadley's stringr package:

do.call(rbind, stringr::str_split(before$type, '_and_'))
Friede's user avatar
Friede
11.8k2 gold badges14 silver badges32 bronze badges
answeredDec 4, 2010 at 2:09
Ramnath's user avatar

2 Comments

Good catch, best solution for me. Though a bit slower than with thestringr package.
did this function get renamed tostrsplit() ?
31

To add to the options, you could also use mysplitstackshape::cSplit function like this:

library(splitstackshape)cSplit(before, "type", "_and_")#    attr type_1 type_2# 1:    1    foo    bar# 2:   30    foo  bar_2# 3:    4    foo    bar# 4:    6    foo  bar_2
David Arenburg's user avatar
David Arenburg
92.4k18 gold badges145 silver badges202 bronze badges
answeredSep 27, 2014 at 15:46
A5C1D2H2I1M1N2O1R2T1's user avatar

3 Comments

3 years later - this option is working best for a similar problem I have - however the dataframe I am working with has 54 columns and I need to split all of them into two. Is there a way to do this using this method - short of typing out the above command 54 times? Many thanks, Nicki.
@Nicki, Have you tried providing a vector of the column names or the column positions? That should do it....
It wasnt just renaming the columns - I needed to literally split the columns as above effectively doubling the number of columns in my df. The below was what I used in the end: df2 <- cSplit(df1, splitCols = 1:54, "/")
31

The subject isalmost exhausted, I 'd like though to offer a solution to a slightly more general version where you don't know the number of output columns, a priori. So for example you have

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2', 'foo_and_bar_2_and_bar_3', 'foo_and_bar'))  attr                    type1    1             foo_and_bar2   30           foo_and_bar_23    4 foo_and_bar_2_and_bar_34    6             foo_and_bar

We can't use dplyrseparate() because we don't know the number of the result columns before the split, so I have then created a function that usesstringr to split a column, given the pattern and a name prefix for the generated columns. I hope the coding patterns used, are correct.

split_into_multiple <- function(column, pattern = ", ", into_prefix){  cols <- str_split_fixed(column, pattern, n = Inf)  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful  cols[which(cols == "")] <- NA  cols <- as.tibble(cols)  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m'   # where m = # columns of 'cols'  m <- dim(cols)[2]  names(cols) <- paste(into_prefix, 1:m, sep = "_")  return(cols)}

We can then usesplit_into_multiple in a dplyr pipe as follows:

after <- before %>%   bind_cols(split_into_multiple(.$type, "_and_", "type")) %>%   # selecting those that start with 'type_' will remove the original 'type' column  select(attr, starts_with("type_"))>after  attr type_1 type_2 type_31    1    foo    bar   <NA>2   30    foo  bar_2   <NA>3    4    foo  bar_2  bar_34    6    foo    bar   <NA>

And then we can usegather to tidy up...

after %>%   gather(key, val, -attr, na.rm = T)   attr    key   val1     1 type_1   foo2    30 type_1   foo3     4 type_1   foo4     6 type_1   foo5     1 type_2   bar6    30 type_2 bar_27     4 type_2 bar_28     6 type_2   bar11    4 type_3 bar_3
answeredNov 1, 2017 at 17:26
Yannis P.'s user avatar

2 Comments

This is very useful. Years later, I am wondering if it is possible to introduce colnames that can be inserted in a for loop. For example, I want to split_into_multiple 10 columns (or more) and I don't want to split them each column at a time. I want the resulting split columns to bind together. I can't find the way to programatically select the column names (it gives me error) the option !!as.name() says there's not such attribute... How would you do it?
I am glad you still find it useful @MEC. My R knowledge has abandoned me and don't know how to hint you with this part. I think Pythonpandas might offer something but needs some homework...
19

An easy way is to usesapply() and the[ function:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))out <- strsplit(as.character(before$type),'_and_')

For example:

> data.frame(t(sapply(out, `[`)))   X1    X21 foo   bar2 foo bar_23 foo   bar4 foo bar_2

sapply()'s result is a matrix and needs transposing and casting back to a data frame. It is then some simple manipulations that yield the result you wanted:

after <- with(before, data.frame(attr = attr))after <- cbind(after, data.frame(t(sapply(out, `[`))))names(after)[2:3] <- paste("type", 1:2, sep = "_")

At this point,after is what you wanted

> after  attr type_1 type_21    1    foo    bar2   30    foo  bar_23    4    foo    bar4    6    foo  bar_2
answeredDec 3, 2010 at 23:36
Gavin Simpson's user avatar

Comments

12

Since R version 3.4.0 you can usestrcapture() from theutils package (included with base R installs), binding the output onto the other column(s).

out <- strcapture(    "(.*)_and_(.*)",    as.character(before$type),    data.frame(type_1 = character(), type_2 = character()))cbind(before["attr"], out)#   attr type_1 type_2# 1    1    foo    bar# 2   30    foo  bar_2# 3    4    foo    bar# 4    6    foo  bar_2
answeredAug 28, 2017 at 19:15
Rich Scriven's user avatar

1 Comment

This is the most coherent 'built for purpose'.
10

Here is a base R one liner that overlaps a number of previous solutions, but returns a data.frame with the proper names.

out <- setNames(data.frame(before$attr,                  do.call(rbind, strsplit(as.character(before$type),                                          split="_and_"))),                  c("attr", paste0("type_", 1:2)))out  attr type_1 type_21    1    foo    bar2   30    foo  bar_23    4    foo    bar4    6    foo  bar_2

It usesstrsplit to break up the variable, anddata.frame withdo.call/rbind to put the data back into a data.frame. The additional incremental improvement is the use ofsetNames to add variable names to the data.frame.

answeredJul 22, 2016 at 20:34
lmo's user avatar

Comments

8

This question is pretty old but I'll add the solution I found the be the simplest at present.

library(reshape2)before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))newColNames <- c("type1", "type2")newCols <- colsplit(before$type, "_and_", newColNames)after <- cbind(before, newCols)after$type <- NULLafter
answeredSep 28, 2017 at 20:14
Swifty McSwifterton's user avatar

1 Comment

This is by far the easiest when it comes to managing df vectors
7

base but probably slow:

n <- 1for(i in strsplit(as.character(before$type),'_and_')){     before[n, 'type_1'] <- i[[1]]     before[n, 'type_2'] <- i[[2]]     n <- n + 1}##   attr          type type_1 type_2## 1    1   foo_and_bar    foo    bar## 2   30 foo_and_bar_2    foo  bar_2## 3    4   foo_and_bar    foo    bar## 4    6 foo_and_bar_2    foo  bar_2
answeredFeb 17, 2018 at 3:44
jpmorris's user avatar

Comments

6

Another approach if you want to stick withstrsplit() is to use theunlist() command. Here's a solution along those lines.

tmp <- matrix(unlist(strsplit(as.character(before$type), '_and_')), ncol=2,   byrow=TRUE)after <- cbind(before$attr, as.data.frame(tmp))names(after) <- c("attr", "type_1", "type_2")
answeredDec 3, 2010 at 23:52
ashaw's user avatar

Comments

3

Since this question was askedseparate has been superseded byseparate_longer_* andseparate_wider_* functions.

The way to do it now is:

library(tidyr)separate_wider_delim(before, type, delim = "_and_", names_sep = "_")

You could also useseparate_wider_regex, but I'll leave that as an exercise to the reader :-)

answeredJul 21, 2023 at 11:09
Mark's user avatar

Comments

1

Here is another base R solution. We can useread.table but since it accepts only one-bytesep argument and here we have multi-byte separator we can usegsub to replace the multibyte separator to any one-byte separator and use that assep argument inread.table

cbind(before[1], read.table(text = gsub('_and_', '\t', before$type),                  sep = "\t", col.names = paste0("type_", 1:2)))#  attr type_1 type_2#1    1    foo    bar#2   30    foo  bar_2#3    4    foo    bar#4    6    foo  bar_2

In this case, we can also make it shorter by replacing it with defaultsep argument so we don't have to mention it explicitly

cbind(before[1], read.table(text = gsub('_and_', ' ', before$type),                  col.names = paste0("type_", 1:2)))
answeredJan 4, 2020 at 2:22
Ronak Shah's user avatar

Comments

1

Surprisingly, another tidyverse solution is still missing - you can also usetidyr::extract, with a regex.

library(tidyr)before <- data.frame(attr = c(1, 30, 4, 6), type = c("foo_and_bar", "foo_and_bar_2"))## regex - getting all characters except an underscore till the first underscore, ## inspired by Akrun https://stackoverflow.com/a/49752920/7941188 extract(before, col = type, into = paste0("type", 1:2), regex = "(^[^_]*)_(.*)")#>   attr type1     type2#> 1    1   foo   and_bar#> 2   30   foo and_bar_2#> 3    4   foo   and_bar#> 4    6   foo and_bar_2
answeredFeb 9, 2022 at 18:51
tjebo's user avatar

Comments

1

Another base R solution that also is a general way to split a column in several columns is:

Data

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))

Procedure

attach(before)before$type2 <- gsub("(\\w*)_and_(\\w*)", "c('\\1', '\\2')", type)#this recode the column type to c("blah", "blah") formcbind(before,t(sapply(1:nrow(before), function(x) eval(parse(text=before$type2[x])))))#this split the desired column into several ones named 1 2 3 and so on

OUTPUT

  attr          type             type2   1     21    1   foo_and_bar   c('foo', 'bar') foo   bar2   30 foo_and_bar_2 c('foo', 'bar_2') foo bar_23    4   foo_and_bar   c('foo', 'bar') foo   bar4    6 foo_and_bar_2 c('foo', 'bar_2') foo bar_2
answeredJul 14, 2023 at 12:59
Alan Gómez's user avatar

Comments

Protected question. To answer this question, you need to have at least 10 reputation on this site (not counting theassociation bonus). The reputation requirement helps protect this question from spam and non-answer activity.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.