Split data frame string column into multiple columns

Question 1

I'd like to take data of the form

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))  attr          type1    1   foo_and_bar2   30 foo_and_bar_23    4   foo_and_bar4    6 foo_and_bar_2

and usesplit() on the column "type" from above to get something like this:

  attr type_1 type_21    1    foo    bar2   30    foo  bar_23    4    foo    bar4    6    foo  bar_2

I came up with something unbelievably complex involving some form ofapply that worked, but I've since misplaced that. It seemed far too complicated to be the best way. I can usestrsplit as below, but then unclear how to get that back into 2 columns in the data frame.

> strsplit(as.character(before$type),'_and_')[[1]][1] "foo" "bar"[[2]][1] "foo"   "bar_2"[[3]][1] "foo" "bar"[[4]][1] "foo"   "bar_2"

Thanks for any pointers. I've not quite groked R lists just yet.

Question 2

Usestringr::str_split_fixed

library(stringr)str_split_fixed(before$type, "_and_", 2)

Question 3

this worked pretty fine for my problem today as well.. but it was adding a 'c' at the beginning of each row. Any idea why is that???left_right <- str_split_fixed(as.character(split_df),'\">',2)

Question 4

I would like to split with a pattern that has "...", when I apply that function, it returns nothing. What could be the problem. my type is something like "test...score"

Question 5

@user3841581 - old query of yours I know, but this is covered in the documentation -str_split_fixed("aaa...bbb", fixed("..."), 2) works fine withfixed() to "Match a fixed string" in thepattern= argument.. means 'any character' in regex.

Question 6

@Martin That c() around the regex is unnecessary and confusing. It suggests the could be more than one pattern and I doubt that is the case.

Question 7

I think the comment from @Martin above explains my issue. The char I needed to split my field on was a "|". The accepted answer didn't work for me until I addedfixed("|")

Question 8

You can use thetidyr package.

before <- data.frame(  attr = c(1, 30 ,4 ,6 ),   type = c('foo_and_bar', 'foo_and_bar_2'))library(tidyr)before |>  separate_wider_delim(type, delim = "_and_", names = c("foo", "bar"))# # A tibble: 4 × 3#    attr foo   bar  #   <dbl> <chr> <chr># 1     1 foo   bar  # 2    30 foo   bar_2# 3     4 foo   bar  # 4     6 foo   bar_2

(Or using older versions oftidyr)

before %>%  separate(type, c("foo", "bar"), "_and_")##   attr foo   bar## 1    1 foo   bar## 2   30 foo bar_2## 3    4 foo   bar## 4    6 foo bar_2

Question 9

Is there a way to limit number of splits with separate? Let's say I want to split on '_' only once (or do it withstr_split_fixed and adding columns to existing dataframe)?

Question 10

@hadley How about if I want to split based on second_? I want the values asfoo_and,bar/bar_2?

Question 11

tidyr::separate has been superseded bytidyr::separate_wider_delim.

Question 12

5 years later adding the obligatorydata.table solution

library(data.table) ## v 1.9.6+ setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_")]before#    attr          type type1 type2# 1:    1   foo_and_bar   foo   bar# 2:   30 foo_and_bar_2   foo bar_2# 3:    4   foo_and_bar   foo   bar# 4:    6 foo_and_bar_2   foo bar_2

We could also both make sure that the resulting columns will have correct typesand improve performance by addingtype.convert andfixed arguments (since"_and_" isn't really a regex)

setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_", type.convert = TRUE, fixed = TRUE)]

Question 13

if the number of your'_and_' patterns vary, you can find out the maximum number of matches (i.e. future columns) withmax(lengths(strsplit(before$type, '_and_')))

Question 14

This is my favorite answer, works very well! Could you please explain how it works. Why transpose(strsplit(…)) and isn't paste0 for concatenating strings - not splitting them...

Question 15

@Gecko I'm not sure what is the question. If you just usestrsplit it creates a single vector with 2 values in each slot, sotstrsplit transposes it into 2 vectors with a single value in each.paste0 is just used in order to create the column names, it is not used on the values. On the LHS of the equation are the column names, on the RHS is the split + transpose operation on the column.:= stands for "assign in place", hence you don't see the<- assignment operator there.

Question 16

Yet another approach: userbind onout:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))  out <- strsplit(as.character(before$type),'_and_') do.call(rbind, out)     [,1]  [,2]   [1,] "foo" "bar"  [2,] "foo" "bar_2"[3,] "foo" "bar"  [4,] "foo" "bar_2"

And to combine:

data.frame(before$attr, do.call(rbind, out))

Question 17

Another alternative on newer R versions isstrcapture("(.*)_and_(.*)", as.character(before$type), data.frame(type_1 = "", type_2 = ""))

Question 18

This is the correct solution. Simple and doesn't require thrid-party packages.

Question 19

Notice that sapply with "[" can be used to extract either the first or second items in those lists so:

before$type_1 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 1)before$type_2 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 2)before$type <- NULL

And here's a gsub method:

before$type_1 <- gsub("_and_.+$", "", before$type)before$type_2 <- gsub("^.+_and_", "", before$type)before$type <- NULL

Question 20

here is a one liner along the same lines as aniko's solution, but using hadley's stringr package:

do.call(rbind, stringr::str_split(before$type, '_and_'))

Question 21

Good catch, best solution for me. Though a bit slower than with thestringr package.

Question 22

did this function get renamed tostrsplit() ?

Question 23

To add to the options, you could also use mysplitstackshape::cSplit function like this:

library(splitstackshape)cSplit(before, "type", "_and_")#    attr type_1 type_2# 1:    1    foo    bar# 2:   30    foo  bar_2# 3:    4    foo    bar# 4:    6    foo  bar_2

Question 24

3 years later - this option is working best for a similar problem I have - however the dataframe I am working with has 54 columns and I need to split all of them into two. Is there a way to do this using this method - short of typing out the above command 54 times? Many thanks, Nicki.

Question 25

@Nicki, Have you tried providing a vector of the column names or the column positions? That should do it....

Question 26

It wasnt just renaming the columns - I needed to literally split the columns as above effectively doubling the number of columns in my df. The below was what I used in the end: df2 <- cSplit(df1, splitCols = 1:54, "/")

Question 27

The subject isalmost exhausted, I 'd like though to offer a solution to a slightly more general version where you don't know the number of output columns, a priori. So for example you have

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2', 'foo_and_bar_2_and_bar_3', 'foo_and_bar'))  attr                    type1    1             foo_and_bar2   30           foo_and_bar_23    4 foo_and_bar_2_and_bar_34    6             foo_and_bar

We can't use dplyrseparate() because we don't know the number of the result columns before the split, so I have then created a function that usesstringr to split a column, given the pattern and a name prefix for the generated columns. I hope the coding patterns used, are correct.

split_into_multiple <- function(column, pattern = ", ", into_prefix){  cols <- str_split_fixed(column, pattern, n = Inf)  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful  cols[which(cols == "")] <- NA  cols <- as.tibble(cols)  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m'   # where m = # columns of 'cols'  m <- dim(cols)[2]  names(cols) <- paste(into_prefix, 1:m, sep = "_")  return(cols)}

We can then usesplit_into_multiple in a dplyr pipe as follows:

after <- before %>%   bind_cols(split_into_multiple(.$type, "_and_", "type")) %>%   # selecting those that start with 'type_' will remove the original 'type' column  select(attr, starts_with("type_"))>after  attr type_1 type_2 type_31    1    foo    bar   <NA>2   30    foo  bar_2   <NA>3    4    foo  bar_2  bar_34    6    foo    bar   <NA>

And then we can usegather to tidy up...

after %>%   gather(key, val, -attr, na.rm = T)   attr    key   val1     1 type_1   foo2    30 type_1   foo3     4 type_1   foo4     6 type_1   foo5     1 type_2   bar6    30 type_2 bar_27     4 type_2 bar_28     6 type_2   bar11    4 type_3 bar_3

Question 28

This is very useful. Years later, I am wondering if it is possible to introduce colnames that can be inserted in a for loop. For example, I want to split_into_multiple 10 columns (or more) and I don't want to split them each column at a time. I want the resulting split columns to bind together. I can't find the way to programatically select the column names (it gives me error) the option !!as.name() says there's not such attribute... How would you do it?

Question 29

I am glad you still find it useful @MEC. My R knowledge has abandoned me and don't know how to hint you with this part. I think Pythonpandas might offer something but needs some homework...

Question 30

An easy way is to usesapply() and the[ function:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))out <- strsplit(as.character(before$type),'_and_')

For example:

> data.frame(t(sapply(out, `[`)))   X1    X21 foo   bar2 foo bar_23 foo   bar4 foo bar_2

sapply()'s result is a matrix and needs transposing and casting back to a data frame. It is then some simple manipulations that yield the result you wanted:

after <- with(before, data.frame(attr = attr))after <- cbind(after, data.frame(t(sapply(out, `[`))))names(after)[2:3] <- paste("type", 1:2, sep = "_")

At this point,after is what you wanted

> after  attr type_1 type_21    1    foo    bar2   30    foo  bar_23    4    foo    bar4    6    foo  bar_2

Question 31

Since R version 3.4.0 you can usestrcapture() from theutils package (included with base R installs), binding the output onto the other column(s).

out <- strcapture(    "(.*)_and_(.*)",    as.character(before$type),    data.frame(type_1 = character(), type_2 = character()))cbind(before["attr"], out)#   attr type_1 type_2# 1    1    foo    bar# 2   30    foo  bar_2# 3    4    foo    bar# 4    6    foo  bar_2

Question 32

This is the most coherent 'built for purpose'.

Question 33

Here is a base R one liner that overlaps a number of previous solutions, but returns a data.frame with the proper names.

out <- setNames(data.frame(before$attr,                  do.call(rbind, strsplit(as.character(before$type),                                          split="_and_"))),                  c("attr", paste0("type_", 1:2)))out  attr type_1 type_21    1    foo    bar2   30    foo  bar_23    4    foo    bar4    6    foo  bar_2

It usesstrsplit to break up the variable, anddata.frame withdo.call/rbind to put the data back into a data.frame. The additional incremental improvement is the use ofsetNames to add variable names to the data.frame.

Question 34

This question is pretty old but I'll add the solution I found the be the simplest at present.

library(reshape2)before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))newColNames <- c("type1", "type2")newCols <- colsplit(before$type, "_and_", newColNames)after <- cbind(before, newCols)after$type <- NULLafter

Question 35

This is by far the easiest when it comes to managing df vectors

Question 36

base but probably slow:

n <- 1for(i in strsplit(as.character(before$type),'_and_')){     before[n, 'type_1'] <- i[[1]]     before[n, 'type_2'] <- i[[2]]     n <- n + 1}##   attr          type type_1 type_2## 1    1   foo_and_bar    foo    bar## 2   30 foo_and_bar_2    foo  bar_2## 3    4   foo_and_bar    foo    bar## 4    6 foo_and_bar_2    foo  bar_2

Question 37

Another approach if you want to stick withstrsplit() is to use theunlist() command. Here's a solution along those lines.

tmp <- matrix(unlist(strsplit(as.character(before$type), '_and_')), ncol=2,   byrow=TRUE)after <- cbind(before$attr, as.data.frame(tmp))names(after) <- c("attr", "type_1", "type_2")

Question 38

Since this question was askedseparate has been superseded byseparate_longer_* andseparate_wider_* functions.

The way to do it now is:

library(tidyr)separate_wider_delim(before, type, delim = "_and_", names_sep = "_")

You could also useseparate_wider_regex, but I'll leave that as an exercise to the reader :-)

Question 39

Here is another base R solution. We can useread.table but since it accepts only one-bytesep argument and here we have multi-byte separator we can usegsub to replace the multibyte separator to any one-byte separator and use that assep argument inread.table

cbind(before[1], read.table(text = gsub('_and_', '\t', before$type),                  sep = "\t", col.names = paste0("type_", 1:2)))#  attr type_1 type_2#1    1    foo    bar#2   30    foo  bar_2#3    4    foo    bar#4    6    foo  bar_2

In this case, we can also make it shorter by replacing it with defaultsep argument so we don't have to mention it explicitly

cbind(before[1], read.table(text = gsub('_and_', ' ', before$type),                  col.names = paste0("type_", 1:2)))

Question 40

Surprisingly, another tidyverse solution is still missing - you can also usetidyr::extract, with a regex.

library(tidyr)before <- data.frame(attr = c(1, 30, 4, 6), type = c("foo_and_bar", "foo_and_bar_2"))## regex - getting all characters except an underscore till the first underscore, ## inspired by Akrun https://stackoverflow.com/a/49752920/7941188 extract(before, col = type, into = paste0("type", 1:2), regex = "(^[^_]*)_(.*)")#>   attr type1     type2#> 1    1   foo   and_bar#> 2   30   foo and_bar_2#> 3    4   foo   and_bar#> 4    6   foo and_bar_2

Question 41

Another base R solution that also is a general way to split a column in several columns is:

Data

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))

Procedure

attach(before)before$type2 <- gsub("(\\w*)_and_(\\w*)", "c('\\1', '\\2')", type)#this recode the column type to c("blah", "blah") formcbind(before,t(sapply(1:nrow(before), function(x) eval(parse(text=before$type2[x])))))#this split the desired column into several ones named 1 2 3 and so on

OUTPUT

  attr          type             type2   1     21    1   foo_and_bar   c('foo', 'bar') foo   bar2   30 foo_and_bar_2 c('foo', 'bar_2') foo bar_23    4   foo_and_bar   c('foo', 'bar') foo   bar4    6 foo_and_bar_2 c('foo', 'bar_2') foo bar_2

hadley 104k35 gold badges186 silver badges248 bronze badges · Accepted Answer · 2010-12-04 04:21:27Z

363

Usestringr::str_split_fixed

library(stringr)str_split_fixed(before$type, "_and_", 2)

Share

Improve this answer

answeredDec 4, 2010 at 4:21

hadley

104k35 gold badges186 silver badges248 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

LearneR

LearneR Over a year ago

this worked pretty fine for my problem today as well.. but it was adding a 'c' at the beginning of each row. Any idea why is that???left_right <- str_split_fixed(as.character(split_df),'\">',2)

2015-07-28T06:53:12.787Z+00:00

user3841581

user3841581 Over a year ago

I would like to split with a pattern that has "...", when I apply that function, it returns nothing. What could be the problem. my type is something like "test...score"

2016-03-14T08:15:50.817Z+00:00

thelatemail

thelatemail Over a year ago

@user3841581 - old query of yours I know, but this is covered in the documentation -str_split_fixed("aaa...bbb", fixed("..."), 2) works fine withfixed() to "Match a fixed string" in thepattern= argument.. means 'any character' in regex.

2017-08-09T04:30:09.35Z+00:00

IRTFM

IRTFM Over a year ago

@Martin That c() around the regex is unnecessary and confusing. It suggests the could be more than one pattern and I doubt that is the case.

2023-10-24T14:41:33.327Z+00:00

David Weisser

David Weisser Over a year ago

I think the comment from @Martin above explains my issue. The char I needed to split my field on was a "|". The accepted answer didn't work for me until I addedfixed("|")

2023-11-03T15:30:35.92Z+00:00

|

Movatterモバイル変換

Collectives™ on Stack Overflow

Split data frame string column into multiple columns

18 Answers18

8 Comments

3 Comments

3 Comments

2 Comments

Comments

2 Comments

3 Comments

2 Comments

Comments

1 Comment

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related

Hot Network Questions

Subscribe to RSS