R tutorials#

R tutorial on adding a lubridate binding#

In this tutorial, we will document the contribution of a bindingto Arrow R package following the steps specified by theQuick Reference section of the guide and a more detailedSteps in making your first PR section. Navigate there whenever there issome information you may find is missing here.

The binding will be added to theexpression.R file in theR package. But you can also follow these steps in case you areadding a binding that will live somewhere else.

This tutorial is different from theSteps in making your first PR as wewill be working on a specific case. This tutorial is not meantas a step-by-step guide.

Let’s start!

Set up#

Let’s set up the Arrow repository. We presume here that Git isalready installed. Otherwise please see theSet up section.

Once theApache Arrow repositoryis forked (seeFork the repository) we will clone it and add thelink of the main repository to our upstream.

$gitclonehttps://github.com/<yourusername>/arrow.git$cdarrow$gitremoteaddupstreamhttps://github.com/apache/arrow

Building R package#

The steps to follow for building the R package differs depending on the operatingsystem you are using. For this reason we will only refer tothe instructions for the building process in this tutorial.

See also

For theintroduction to the building process refer to theBuilding the Arrow libraries 🏋🏿‍♀️ section.

For theinstructions on how to build the R package refer to theR developer docs.

The issue#

In this tutorial we will be tackling an issue for implementinga simple binding formday() function that will match that of theexisting R function fromlubridate.

Note

If you do not have an issue and you need help finding one please referto theFinding good first issues 🔎 part of the guide.

Once you have an issue picked out and assigned to yourself, you canproceed to the next step.

Start the work on a new branch#

Before we start working on adding the binding we shouldcreate a new branch from the updated main.

$gitcheckoutmain$gitfetchupstream$gitpull--ff-onlyupstreammain$gitcheckout-bARROW-14816

Now we can start with researching the R function and the C++ Arrowcompute function we want to expose or connect to.

Examine the lubridate mday() function

Going through thelubridate documentationwe can see thatmday() takes a date objectand returns the day of the month as a numeric object.

We can run some examples in the R console to help us understandthe function better:

>library(lubridate)>mday(as.Date("2000-12-31"))[1]31>mday(ymd(080306))[1]6

Examine the Arrow C++ day() function

From thecompute function documentationwe can see thatday is a unary function, which means that it takesa single data input. The data input must be aTemporalclass andthe returned value is anInteger/numeric type.

TheTemporalclass is specified as: Date types (Date32, Date64),Time types (Time32, Time64), Timestamp, Duration, Interval.

We can call an Arrow C++ function from an R console usingcall_functionto see how it works:

>call_function("day",Scalar$create(lubridate::ymd("2000-12-31")))Scalar31

We can see that lubridate and Arrow functions operate on and returnequivalent data types. lubridate’smday() function has no additionalarguments and there are also no option classes associated with Arrow C++functionday().

Looking at the code inexpressions.R we can see the day functionis already specified/mapped on the R package side:apache/arrow

We only need to addmday() to the list of expressions connectingit to the C++day function.

# second is defined in dplyr-functions.R# wday is defined in dplyr-functions.R"mday"="day","yday"="day_of_year","year"="year",

Adding a test#

Now we need to add a test that checks if everything works well.If there are additional options or edge cases, we would have toadd more. Looking at tests for similar functions (for exampleyday() orday()) we can see that a good place to add twotests we have is intest-dplyr-funcs-datetime.R:

test_that("extract mday from timestamp",{compare_dplyr_binding(.input%>%mutate(x=mday(datetime))%>%collect(),test_df)})

And

test_that("extract mday from date",{compare_dplyr_binding(.input%>%mutate(x=mday(date))%>%collect(),test_df)})

Now we need to see if the tests are passing or we need to do somemore research and code corrections.

devtools::test(filter="datetime")>devtools::test(filter="datetime")LoadingarrowSeearrow_info()foravailablefeaturesTestingarrowSeearrow_info()foravailablefeatures|FWSOK|Context|1230|dplyr-funcs-datetime[1.4s]────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Failure(test-dplyr-funcs-datetime.R:187:3):strftime``%>%`(...)`didnotthrowtheexpectederror.Backtrace:1.testthat::expect_error(...)test-dplyr-funcs-datetime.R:187:22.testthat:::expect_condition_matching(...)────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────══Results═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════Duration:1.4s[FAIL1|WARN0|SKIP0|PASS230]

There is a failure we get for thestrftime function but lookingat the code we see is not connected to our work. We can move on andmaybe ask others if they are getting similar fail when running the tests.It could be we only need to rebuild the library.

Check styling#

We should also run linters to check that the styling of the codefollows thetidyverse style. Todo that we run the following command in the shell:

$makestyleR -s -e 'setwd(".."); if (requireNamespace("styler")) styler::style_file(setdiff(system("git diff --name-only | grep r/.*R$", intern = TRUE), file.path("r", source("r/.styler_excludes.R")$value)))'Loading required namespace: stylerStyling  2  files: r/R/expression.R                             ✔ r/tests/testthat/test-dplyr-funcs-datetime.R ℹ────────────────────────────────────────────Status   Count Legend✔  1  File unchanged.ℹ  1  File changed.✖  0  Styling threw an error.────────────────────────────────────────────Please review the changes carefully!

Creating a Pull Request#

First let’s review our changes in the shell usinggitstatus to seewhich files have been changed and to commit only the ones we are working on.

$gitstatusOn branch ARROW-14816Changes not staged for commit:  (use "git add <file>..." to update what will be committed)  (use "git restore <file>..." to discard changes in working directory)   modified:   R/expression.R   modified:   tests/testthat/test-dplyr-funcs-datetime.R

Andgitdiff to see the changes in the files in order to spot any error we might have made.

$gitdiffdiff --git a/r/R/expression.R b/r/R/expression.Rindex 37fc21c25..0e71803ec 100644--- a/r/R/expression.R+++ b/r/R/expression.R@@ -70,6 +70,7 @@   "quarter" = "quarter",   #secondisdefinedindplyr-functions.R   #wdayisdefinedindplyr-functions.R+  "mday" = "day",   "yday" = "day_of_year",   "year" = "year",diff --git a/r/tests/testthat/test-dplyr-funcs-datetime.R b/r/tests/testthat/test-dplyr-funcs-datetime.Rindex 359a5403a..228eca56a 100644--- a/r/tests/testthat/test-dplyr-funcs-datetime.R+++ b/r/tests/testthat/test-dplyr-funcs-datetime.R@@ -444,6 +444,15 @@ test_that("extract wday from timestamp", {   ) })+test_that("extract mday from timestamp", {+  compare_dplyr_binding(+    .input %>%+      mutate(x = mday(datetime)) %>%+      collect(),+    test_df+  )+})+ test_that("extract yday from timestamp", {   compare_dplyr_binding(     .input %>%@@ -626,6 +635,15 @@ test_that("extract wday from date", {   ) })+test_that("extract mday from date", {+  compare_dplyr_binding(+    .input %>%+      mutate(x = mday(date)) %>%+      collect(),+    test_df+  )+})+ test_that("extract yday from date", {   compare_dplyr_binding(     .input %>%

Everything looks OK. Now we can make the commit(save our changes to the branch history):

$gitcommit-am"Adding a binding and a test for mday() lubridate"[ARROW-14816 ed37d3a3b] Adding a binding and a test for mday() lubridate 2 files changed, 19 insertions(+)

We can usegitlog to check the history of commits:

$gitlogcommit ed37d3a3b3eef76b696532f10562fea85f809fab (HEAD -> ARROW-14816)Author: Alenka Frim <frim.alenka@gmail.com>Date:   Fri Jan 21 09:15:31 2022 +0100    Adding a binding and a test for mday() lubridatecommit c5358787ee8f7b80f067292f49e5f032854041b9 (upstream/main, upstream/HEAD, main, ARROW-15346, ARROW-10643)Author: Krisztián Szűcs <szucs.krisztian@gmail.com>Date:   Thu Jan 20 09:45:59 2022 +0900    ARROW-15372: [C++][Gandiva] Gandiva now depends on boost/crc.hpp which is missing from the trimmed boost archive    See build error https://github.com/ursacomputing/crossbow/runs/4871392838?check_suite_focus=true#step:5:11762    Closes #12190 from kszucs/ARROW-15372    Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>    Signed-off-by: Sutou Kouhei <kou@clear-code.com>

If we started the branch some time ago, we may need to rebaseto upstream main to make sure there are no merge conflicts:

$gitpullupstreammain--rebase

And now we can push our work to the forked Arrow repositoryon GitHub called origin.

$gitpushoriginARROW-14816Enumerating objects: 233, done.Counting objects: 100% (233/233), done.Delta compression using up to 8 threadsCompressing objects: 100% (130/130), done.Writing objects: 100% (151/151), 35.78 KiB | 8.95 MiB/s, done.Total 151 (delta 129), reused 33 (delta 20), pack-reused 0remote: Resolving deltas: 100% (129/129), completed with 80 local objects.remote:remote: Create a pull request for 'ARROW-14816' on GitHub by visiting:remote:      https://github.com/AlenkaF/arrow/pull/new/ARROW-14816remote:To https://github.com/AlenkaF/arrow.git * [new branch]          ARROW-14816 -> ARROW-14816

Now we have to go to theArrow repository on GitHubto create a Pull Request. On the GitHub Arrowpage (main or forked) we will see a yellow noticebar with a note that we made recent pushes to the branchARROW-14816. That’s great, now we can make the Pull Requestby clicking onCompare & pull request.

GitHub page of the Apache Arrow repository showing a notice bar indicating change has been made in our branch and a Pull Request can be created.

Notice bar on the Apache Arrow repository.#

First we need to change the Title toARROW-14816: [R] Implementbindings for lubridate::mday() in order to match it with theissue. Note a punctuation mark was added!

Extra note: when this tutorial was created, we had been using the Jira issuetracker. As we are currently using GitHub issues, the title would be prefixedwith GH-14816: [R] Implement bindings for lubridate::mday().

We will also add a description to make it clear to others what we are trying to do.

GitHub page of the Pull Request showing the editor for the title and a description.

Editing the title and the description of our Pull Request.#

Once we clickCreate pull request our code can be reviewed asa Pull Request in the Apache Arrow repository.

GitHub page of the Pull Request showing the title and a description.

Here it is, our Pull Request!#

The pull request gets connected to the issue and the CI is running.After some time passes and we get a review we can correct the code,comment, resolve conversations and so on.

See also

For more information about Pull Request workflow seeLifecycle of a pull request.

The Pull Request we made can be viewedhere.