I'd like to take data of the form
before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2')) attr type1 1 foo_and_bar2 30 foo_and_bar_23 4 foo_and_bar4 6 foo_and_bar_2and usesplit() on the column "type" from above to get something like this:
attr type_1 type_21 1 foo bar2 30 foo bar_23 4 foo bar4 6 foo bar_2I came up with something unbelievably complex involving some form ofapply that worked, but I've since misplaced that. It seemed far too complicated to be the best way. I can usestrsplit as below, but then unclear how to get that back into 2 columns in the data frame.
> strsplit(as.character(before$type),'_and_')[[1]][1] "foo" "bar"[[2]][1] "foo" "bar_2"[[3]][1] "foo" "bar"[[4]][1] "foo" "bar_2"Thanks for any pointers. I've not quite groked R lists just yet.
18 Answers18
Usestringr::str_split_fixed
library(stringr)str_split_fixed(before$type, "_and_", 2)8 Comments
left_right <- str_split_fixed(as.character(split_df),'\">',2)str_split_fixed("aaa...bbb", fixed("..."), 2) works fine withfixed() to "Match a fixed string" in thepattern= argument.. means 'any character' in regex.fixed("|")You can use thetidyr package.
before <- data.frame( attr = c(1, 30 ,4 ,6 ), type = c('foo_and_bar', 'foo_and_bar_2'))library(tidyr)before |> separate_wider_delim(type, delim = "_and_", names = c("foo", "bar"))# # A tibble: 4 × 3# attr foo bar # <dbl> <chr> <chr># 1 1 foo bar # 2 30 foo bar_2# 3 4 foo bar # 4 6 foo bar_2(Or using older versions oftidyr)
before %>% separate(type, c("foo", "bar"), "_and_")## attr foo bar## 1 1 foo bar## 2 30 foo bar_2## 3 4 foo bar## 4 6 foo bar_23 Comments
str_split_fixed and adding columns to existing dataframe)?_? I want the values asfoo_and,bar/bar_2?tidyr::separate has been superseded bytidyr::separate_wider_delim.5 years later adding the obligatorydata.table solution
library(data.table) ## v 1.9.6+ setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_")]before# attr type type1 type2# 1: 1 foo_and_bar foo bar# 2: 30 foo_and_bar_2 foo bar_2# 3: 4 foo_and_bar foo bar# 4: 6 foo_and_bar_2 foo bar_2We could also both make sure that the resulting columns will have correct typesand improve performance by addingtype.convert andfixed arguments (since"_and_" isn't really a regex)
setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_", type.convert = TRUE, fixed = TRUE)]3 Comments
'_and_' patterns vary, you can find out the maximum number of matches (i.e. future columns) withmax(lengths(strsplit(before$type, '_and_')))strsplit it creates a single vector with 2 values in each slot, sotstrsplit transposes it into 2 vectors with a single value in each.paste0 is just used in order to create the column names, it is not used on the values. On the LHS of the equation are the column names, on the RHS is the split + transpose operation on the column.:= stands for "assign in place", hence you don't see the<- assignment operator there.Yet another approach: userbind onout:
before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2')) out <- strsplit(as.character(before$type),'_and_') do.call(rbind, out) [,1] [,2] [1,] "foo" "bar" [2,] "foo" "bar_2"[3,] "foo" "bar" [4,] "foo" "bar_2"And to combine:
data.frame(before$attr, do.call(rbind, out))2 Comments
strcapture("(.*)_and_(.*)", as.character(before$type), data.frame(type_1 = "", type_2 = ""))Notice that sapply with "[" can be used to extract either the first or second items in those lists so:
before$type_1 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 1)before$type_2 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 2)before$type <- NULLAnd here's a gsub method:
before$type_1 <- gsub("_and_.+$", "", before$type)before$type_2 <- gsub("^.+_and_", "", before$type)before$type <- NULLComments
here is a one liner along the same lines as aniko's solution, but using hadley's stringr package:
do.call(rbind, stringr::str_split(before$type, '_and_'))2 Comments
stringr package.strsplit() ?To add to the options, you could also use mysplitstackshape::cSplit function like this:
library(splitstackshape)cSplit(before, "type", "_and_")# attr type_1 type_2# 1: 1 foo bar# 2: 30 foo bar_2# 3: 4 foo bar# 4: 6 foo bar_23 Comments
The subject isalmost exhausted, I 'd like though to offer a solution to a slightly more general version where you don't know the number of output columns, a priori. So for example you have
before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2', 'foo_and_bar_2_and_bar_3', 'foo_and_bar')) attr type1 1 foo_and_bar2 30 foo_and_bar_23 4 foo_and_bar_2_and_bar_34 6 foo_and_barWe can't use dplyrseparate() because we don't know the number of the result columns before the split, so I have then created a function that usesstringr to split a column, given the pattern and a name prefix for the generated columns. I hope the coding patterns used, are correct.
split_into_multiple <- function(column, pattern = ", ", into_prefix){ cols <- str_split_fixed(column, pattern, n = Inf) # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful cols[which(cols == "")] <- NA cols <- as.tibble(cols) # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' # where m = # columns of 'cols' m <- dim(cols)[2] names(cols) <- paste(into_prefix, 1:m, sep = "_") return(cols)}We can then usesplit_into_multiple in a dplyr pipe as follows:
after <- before %>% bind_cols(split_into_multiple(.$type, "_and_", "type")) %>% # selecting those that start with 'type_' will remove the original 'type' column select(attr, starts_with("type_"))>after attr type_1 type_2 type_31 1 foo bar <NA>2 30 foo bar_2 <NA>3 4 foo bar_2 bar_34 6 foo bar <NA>And then we can usegather to tidy up...
after %>% gather(key, val, -attr, na.rm = T) attr key val1 1 type_1 foo2 30 type_1 foo3 4 type_1 foo4 6 type_1 foo5 1 type_2 bar6 30 type_2 bar_27 4 type_2 bar_28 6 type_2 bar11 4 type_3 bar_32 Comments
pandas might offer something but needs some homework...An easy way is to usesapply() and the[ function:
before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))out <- strsplit(as.character(before$type),'_and_')For example:
> data.frame(t(sapply(out, `[`))) X1 X21 foo bar2 foo bar_23 foo bar4 foo bar_2sapply()'s result is a matrix and needs transposing and casting back to a data frame. It is then some simple manipulations that yield the result you wanted:
after <- with(before, data.frame(attr = attr))after <- cbind(after, data.frame(t(sapply(out, `[`))))names(after)[2:3] <- paste("type", 1:2, sep = "_")At this point,after is what you wanted
> after attr type_1 type_21 1 foo bar2 30 foo bar_23 4 foo bar4 6 foo bar_2Comments
Since R version 3.4.0 you can usestrcapture() from theutils package (included with base R installs), binding the output onto the other column(s).
out <- strcapture( "(.*)_and_(.*)", as.character(before$type), data.frame(type_1 = character(), type_2 = character()))cbind(before["attr"], out)# attr type_1 type_2# 1 1 foo bar# 2 30 foo bar_2# 3 4 foo bar# 4 6 foo bar_21 Comment
Here is a base R one liner that overlaps a number of previous solutions, but returns a data.frame with the proper names.
out <- setNames(data.frame(before$attr, do.call(rbind, strsplit(as.character(before$type), split="_and_"))), c("attr", paste0("type_", 1:2)))out attr type_1 type_21 1 foo bar2 30 foo bar_23 4 foo bar4 6 foo bar_2It usesstrsplit to break up the variable, anddata.frame withdo.call/rbind to put the data back into a data.frame. The additional incremental improvement is the use ofsetNames to add variable names to the data.frame.
Comments
This question is pretty old but I'll add the solution I found the be the simplest at present.
library(reshape2)before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))newColNames <- c("type1", "type2")newCols <- colsplit(before$type, "_and_", newColNames)after <- cbind(before, newCols)after$type <- NULLafter1 Comment
base but probably slow:
n <- 1for(i in strsplit(as.character(before$type),'_and_')){ before[n, 'type_1'] <- i[[1]] before[n, 'type_2'] <- i[[2]] n <- n + 1}## attr type type_1 type_2## 1 1 foo_and_bar foo bar## 2 30 foo_and_bar_2 foo bar_2## 3 4 foo_and_bar foo bar## 4 6 foo_and_bar_2 foo bar_2Comments
Another approach if you want to stick withstrsplit() is to use theunlist() command. Here's a solution along those lines.
tmp <- matrix(unlist(strsplit(as.character(before$type), '_and_')), ncol=2, byrow=TRUE)after <- cbind(before$attr, as.data.frame(tmp))names(after) <- c("attr", "type_1", "type_2")Comments
Since this question was askedseparate has been superseded byseparate_longer_* andseparate_wider_* functions.
The way to do it now is:
library(tidyr)separate_wider_delim(before, type, delim = "_and_", names_sep = "_")You could also useseparate_wider_regex, but I'll leave that as an exercise to the reader :-)
Comments
Here is another base R solution. We can useread.table but since it accepts only one-bytesep argument and here we have multi-byte separator we can usegsub to replace the multibyte separator to any one-byte separator and use that assep argument inread.table
cbind(before[1], read.table(text = gsub('_and_', '\t', before$type), sep = "\t", col.names = paste0("type_", 1:2)))# attr type_1 type_2#1 1 foo bar#2 30 foo bar_2#3 4 foo bar#4 6 foo bar_2In this case, we can also make it shorter by replacing it with defaultsep argument so we don't have to mention it explicitly
cbind(before[1], read.table(text = gsub('_and_', ' ', before$type), col.names = paste0("type_", 1:2)))Comments
Surprisingly, another tidyverse solution is still missing - you can also usetidyr::extract, with a regex.
library(tidyr)before <- data.frame(attr = c(1, 30, 4, 6), type = c("foo_and_bar", "foo_and_bar_2"))## regex - getting all characters except an underscore till the first underscore, ## inspired by Akrun https://stackoverflow.com/a/49752920/7941188 extract(before, col = type, into = paste0("type", 1:2), regex = "(^[^_]*)_(.*)")#> attr type1 type2#> 1 1 foo and_bar#> 2 30 foo and_bar_2#> 3 4 foo and_bar#> 4 6 foo and_bar_2Comments
Another base R solution that also is a general way to split a column in several columns is:
Data
before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))Procedure
attach(before)before$type2 <- gsub("(\\w*)_and_(\\w*)", "c('\\1', '\\2')", type)#this recode the column type to c("blah", "blah") formcbind(before,t(sapply(1:nrow(before), function(x) eval(parse(text=before$type2[x])))))#this split the desired column into several ones named 1 2 3 and so onOUTPUT
attr type type2 1 21 1 foo_and_bar c('foo', 'bar') foo bar2 30 foo_and_bar_2 c('foo', 'bar_2') foo bar_23 4 foo_and_bar c('foo', 'bar') foo bar4 6 foo_and_bar_2 c('foo', 'bar_2') foo bar_2Comments
Explore related questions
See similar questions with these tags.














