Error with Custom Function and dplyr across()

Question 1

I have a dataframe like this:

dat <- data.frame(ID=c("12345", "54321", "12345", "54321", "99999", "12345"),                  cat1 = c(1,0,0,0,0,0),                  cat2= c(1,1,1,1,1,1),                  cat3= c(1,0,1,0,1,0),                  cat4=c(0,0,0,0,0,0))

And I would like to spit out a table like this:

cat1 1cat2 3cat3 2cat4 0

That counts how many unique ID's there are when the value of each column is 1. This is my code for doing this for a single column:

dat |>  filter(cat1==1) |>  summarise(n=n_distinct(ID))

And this works fine. But I want to be able to do this across all needed columns at once. I'm attempting to write a function and then use summarize(across()) like this:

my_function <- function(data, column) {  data |>    filter({{column}}==1) |>    summarise(n=n_distinct(ID))}

And then apply this function to all needed columns like this:

dat |>  summarise(across(c(cat1, cat2, cat3, cat4), ~my_function(.x)))

I keep getting this error though:

Error insummarise():ℹ In argument:across(c(cat1, cat2, cat3, cat4), ~my_function(.x)).Caused by error inacross():! Can't compute columncat1.Caused by error inUseMethod():! no applicable method for 'filter' applied to an object of class "c('double', 'numeric')"

What am I doing wrong?

Thank you!

Question 2

You already have several solutions thatdo what you want. The answer to your actual question (What am I doing wrong?) is that the function passed toacross should take a single argument, which represents the column being manipulated. Your function has two parameters, the first of which is assumed to be the dataframe in which the column exists. Hence the error. More fundamentally, consider making your data frame tidy. It isn't at the moment because information you need is contained in the columns' names, not in their values.

Question 3

Thank you - this was the explanation I was looking for even if some of the other solutions work.

Question 4

Expanding on my comment and providing a tidy solution...

You already have several solutions that do what you want. The answer to your actual question (What am I doing wrong?) is that the function passed to across should take a single argument, which represents the column being manipulated. Your function has two parameters, the first of which is assumed to be the dataframe in which the column exists. Hence the error.

This means that your current approach won't work in this case. Your summary function needs access to both the column name and its values. That's why your data frame isn't tidy. There are other ways to get what you want usingsummarise andacross, but they won't be pretty because the code would be fighting against the fundamental assumption of the tidyverse - that it is working with tidy data.

You can learn more about tidy datahere.

Here is a tidy approach to your problem.

First, tidy your data:

long_dat <- dat %>%   pivot_longer(    starts_with("cat"),    values_to = "value",    names_to = "column"  ) long_dat# A tibble: 24 × 3   ID    column value   <chr> <chr>  <dbl> 1 12345 cat1       1 2 12345 cat2       1 3 12345 cat3       1 4 12345 cat4       0 5 54321 cat1       0 6 54321 cat2       1 7 54321 cat3       0 8 54321 cat4       0 9 12345 cat1       010 12345 cat2       1

Now, summarise the observations that meet your criterion:

long_dat %>%   filter(value == 1) %>%   group_by(column) %>%   summarise(n = length(unique(ID)))# A tibble: 3 × 2  column     n  <chr>  <int>1 cat1       12 cat2       33 cat3       2

To obtain zero counts for categories that have no rows that meet your criterion, a little more work is required.

long_dat %>%   filter(value == 1) %>%   group_by(column) %>%   summarise(n = length(unique(ID))) %>%   right_join(long_dat %>% distinct(column), by = "column") %>%   replace_na(list(n = 0))# A tibble: 4 × 2  column     n  <chr>  <int>1 cat1       12 cat2       33 cat3       24 cat4       0

Question 5

You could do this quickly in base R usingsapply:

sapply(dat[,-1], \(x) length(unique(dat$ID[x == 1])))# cat1 cat2 cat3 cat4 #    1    3    2    0

This outputs a names vector of integers. If you wanted in in a table structure, wrap the whole thing inas.table()

Question 6

In base R you could do:

a <- col(dat[-1])[!duplicated(dat$ID) & unlist(dat[-1]) == 1]table(factor(names(dat[-1]))[a])cat1 cat2 cat3 cat4    1    3    2    0

Question 7

Do you mean this?

> dat %>%+     reframe(across(everything(), sum), .by = ID)     ID cat1 cat2 cat3 cat41 12345    1    3    2    02 54321    0    2    0    03 99999    0    1    1    0

or probably more likely to be the solution as @Onyambu suggested

> dat %>%+     reframe(across(everything(), sum), .by = ID) %>%+     reframe(across(-ID, ~ sum(.x > 0)))  cat1 cat2 cat3 cat41    1    3    2    0

Question 8

OP is interested in number of unique ID's. Perhaps adding%>% summarise(across(-ID, ~sum(.x>0))) is sufficient

Question 9

@Onyambu aha, I think you are right. Thanks!

Question 10

This is whereaggregate is made for

> aggregate(.~ID, dat, sum)     ID cat1 cat2 cat3 cat41 12345    1    3    2    02 54321    0    2    0    03 99999    0    1    1    0

or probably more likely to be the solution as @Onyambu suggested

> colSums(aggregate(.~ID, dat, sum)[-1] > 0)cat1 cat2 cat3 cat4    1    3    2    0

No need for any dependencies.

Question 11

The code is incoplete. Note that OP is interested in number of unique IDs. Thus you should havecolSums( aggregate(.~ID, dat, sum)[-1] > 0)

Question 12

@Onyambu, thank you.

Limey 12.9k2 gold badges17 silver badges42 bronze badges · Accepted Answer · 2025-02-26 15:24:04Z

Expanding on my comment and providing a tidy solution...

You already have several solutions that do what you want. The answer to your actual question (What am I doing wrong?) is that the function passed to across should take a single argument, which represents the column being manipulated. Your function has two parameters, the first of which is assumed to be the dataframe in which the column exists. Hence the error.

This means that your current approach won't work in this case. Your summary function needs access to both the column name and its values. That's why your data frame isn't tidy. There are other ways to get what you want usingsummarise andacross, but they won't be pretty because the code would be fighting against the fundamental assumption of the tidyverse - that it is working with tidy data.

You can learn more about tidy datahere.

Here is a tidy approach to your problem.

First, tidy your data:

long_dat <- dat %>%   pivot_longer(    starts_with("cat"),    values_to = "value",    names_to = "column"  ) long_dat# A tibble: 24 × 3   ID    column value   <chr> <chr>  <dbl> 1 12345 cat1       1 2 12345 cat2       1 3 12345 cat3       1 4 12345 cat4       0 5 54321 cat1       0 6 54321 cat2       1 7 54321 cat3       0 8 54321 cat4       0 9 12345 cat1       010 12345 cat2       1

Now, summarise the observations that meet your criterion:

long_dat %>%   filter(value == 1) %>%   group_by(column) %>%   summarise(n = length(unique(ID)))# A tibble: 3 × 2  column     n  <chr>  <int>1 cat1       12 cat2       33 cat3       2

To obtain zero counts for categories that have no rows that meet your criterion, a little more work is required.

long_dat %>%   filter(value == 1) %>%   group_by(column) %>%   summarise(n = length(unique(ID))) %>%   right_join(long_dat %>% distinct(column), by = "column") %>%   replace_na(list(n = 0))# A tibble: 4 × 2  column     n  <chr>  <int>1 cat1       12 cat2       33 cat3       24 cat4       0

Movatterモバイル変換

Collectives™ on Stack Overflow

Error with Custom Function and dplyr across()

5 Answers5

Comments

Comments

Comments

2 Comments

2 Comments

Your Answer

Sign up orlog in

Post as a guest

Related

Hot Network Questions

Subscribe to RSS