Movatterモバイル変換


[0]ホーム

URL:


Skip to contents

Scaling and Modeling with tidyquant

MattDancho

2025-11-01

Source:vignettes/TQ03-scaling-and-modeling-with-tidyquant.Rmd
TQ03-scaling-and-modeling-with-tidyquant.Rmd

Designed for the data science workflow of thetidyverse

Overview

The greatest benefit totidyquant is the ability toapply the data science workflow to easily model and scale your financialanalysis as described inR forData Science. Scaling is the process of creating an analysisfor one asset and then extending it to multiple groups. This idea ofscaling is incredibly useful to financial analysts because typically onewants to compare many assets to make informed decisions. Fortunately,thetidyquant package integrates with thetidyverse making scaling super simple!

Alltidyquant functions return data in thetibble (tidy data frame) format, which allows forinteraction within thetidyverse. This means we can:

  • Seamlessly scale data retrieval and mutations
  • Use the pipe (%>%) for chaining operations
  • Usedplyr andtidyr:select,filter,group_by,nest/unnest,spread/gather, etc
  • Usepurrr: mapping functions withmap()
  • Model financial analysis using the data science workflow inR for Data Science

We’ll go through some useful techniques for getting and manipulatinggroups of data.

Prerequisites

Load thetidyquant package to get started.

# Loads tidyquant, xts, quantmod, TTR, and PerformanceAnalyticslibrary(tidyverse)library(tidyquant)

1.0 Scaling the Getting of Financial Data

A very basic example is retrieving the stock prices for multiplestocks. There are three primary ways to do this:

Method 1: Map a character vector with multiple stock symbols

c("AAPL","GOOG","META")%>%tq_get(get="stock.prices", from="2016-01-01", to="2017-01-01")
### A tibble: 756 × 8##    symbol date        open  high   low close    volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL   2016-01-04  25.7  26.3  25.5  26.3 270597600     23.8## 2 AAPL   2016-01-05  26.4  26.5  25.6  25.7 223164000     23.2## 3 AAPL   2016-01-06  25.1  25.6  25.0  25.2 273829600     22.7## 4 AAPL   2016-01-07  24.7  25.0  24.1  24.1 324377600     21.8## 5 AAPL   2016-01-08  24.6  24.8  24.2  24.2 283192000     21.9## 6 AAPL   2016-01-11  24.7  24.8  24.3  24.6 198957600     22.2## 7 AAPL   2016-01-12  25.1  25.2  24.7  25.0 196616800     22.6## 8 AAPL   2016-01-13  25.1  25.3  24.3  24.3 249758400     22.0## 9 AAPL   2016-01-14  24.5  25.1  23.9  24.9 252680400     22.5##10 AAPL   2016-01-15  24.0  24.4  23.8  24.3 319335600     21.9### ℹ 746 more rows

The output is a single level tibble with all or the stock prices inone tibble. The auto-generated column name is “symbol”, which can bepreemptively renamed by giving the vector a name(e.g. stocks <- c("AAPL", "GOOG", "META")) and thenpiping totq_get.

Method 2: Map a tibble with stocks in first column

First, get a stock list in data frame format either by making thetibble or retrieving fromtq_index /tq_exchange. The stock symbols must be in the firstcolumn.

Method 2A: Make a tibble

stock_list<-tibble(stocks=c("AAPL","JPM","CVX"),                     industry=c("Technology","Financial","Energy"))stock_list
### A tibble: 3 × 2##   stocks industry##<chr><chr>##1 AAPL   Technology##2 JPM    Financial##3 CVX    Energy

Second, send the stock list totq_get. Notice how thesymbol and industry columns are automatically expanded the length of thestock prices.

stock_list%>%tq_get(get="stock.prices", from="2016-01-01", to="2017-01-01")
### A tibble: 756 × 9##    stocks industry   date        open  high   low close    volume adjusted##<chr><chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL   Technology 2016-01-04  25.7  26.3  25.5  26.3 270597600     23.8## 2 AAPL   Technology 2016-01-05  26.4  26.5  25.6  25.7 223164000     23.2## 3 AAPL   Technology 2016-01-06  25.1  25.6  25.0  25.2 273829600     22.7## 4 AAPL   Technology 2016-01-07  24.7  25.0  24.1  24.1 324377600     21.8## 5 AAPL   Technology 2016-01-08  24.6  24.8  24.2  24.2 283192000     21.9## 6 AAPL   Technology 2016-01-11  24.7  24.8  24.3  24.6 198957600     22.2## 7 AAPL   Technology 2016-01-12  25.1  25.2  24.7  25.0 196616800     22.6## 8 AAPL   Technology 2016-01-13  25.1  25.3  24.3  24.3 249758400     22.0## 9 AAPL   Technology 2016-01-14  24.5  25.1  23.9  24.9 252680400     22.5##10 AAPL   Technology 2016-01-15  24.0  24.4  23.8  24.3 319335600     21.9### ℹ 746 more rows

Method 2B: Use index or exchange

Get an index…

tq_index("DOW")
### A tibble: 31 × 8##    symbol company      identifier sedol weight sector shares_held local_currency##<chr><chr><chr><chr><dbl><chr><dbl><chr>## 1 GS     GOLDMAN SAC… 38141G104  2407… 0.102  -          5449197 USD## 2 CAT    CATERPILLAR… 149123101  2180… 0.0755 -          5449197 USD## 3 MSFT   MICROSOFT C… 594918104  2588… 0.0681 -          5449197 USD## 4 HD     HOME DEPOT … 437076102  2434… 0.0491 -          5449197 USD## 5 AXP    AMERICAN EX… 025816109  2026… 0.0465 -          5449197 USD## 6 SHW    SHERWIN WIL… 824348106  2804… 0.0450 -          5449197 USD## 7 V      VISA INC CL… 92826C839  B2PZ… 0.0447 -          5449197 USD## 8 UNH    UNITEDHEALT… 91324P102  2917… 0.0446 -          5449197 USD## 9 IBM    INTL BUSINE… 459200101  2005… 0.0401 -          5449197 USD##10 JPM    JPMORGAN CH… 46625H100  2190… 0.0401 -          5449197 USD### ℹ 21 more rows

…or, get an exchange.

tq_exchange("NYSE")

Send the index or exchange totq_get.ImportantNote: This can take several minutes depending on the size of the indexor exchange, which is why only the first three stocks are evaluated inthe vignette.

tq_index("DOW")%>%slice(1:3)%>%tq_get(get="stock.prices")
### A tibble: 8,175 × 15##    symbol company      identifier sedol weight sector shares_held local_currency##<chr><chr><chr><chr><dbl><chr><dbl><chr>## 1 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD## 2 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD## 3 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD## 4 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD## 5 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD## 6 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD## 7 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD## 8 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD## 9 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD##10 GS     GOLDMAN SAC… 38141G104  2407…  0.102 -          5449197 USD### ℹ 8,165 more rows### ℹ 7 more variables: date <date>, open <dbl>, high <dbl>, low <dbl>,###   close <dbl>, volume <dbl>, adjusted <dbl>

You can use any applicable “getter” to get data foreverystock in an index or an exchange! This includes:“stock.prices”, “key.ratios”, “key.stats”, and more.

2.0 Scaling the Mutation of Financial Data

Once you get the data, you typically want to do something with it.You can easily do this at scale. Let’s get the yearly returns formultiple stocks usingtq_transmute. First, get the prices.We’ll use theFANG data set, but you typically will usetq_get to retrieve data in “tibble” format.

FANG
### A tibble: 4,032 × 8##    symbol date        open  high   low close    volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 META   2013-01-02  27.4  28.2  27.4  28    69846400     28## 2 META   2013-01-03  27.9  28.5  27.6  27.8  63140600     27.8## 3 META   2013-01-04  28.0  28.9  27.8  28.8  72715400     28.8## 4 META   2013-01-07  28.7  29.8  28.6  29.4  83781800     29.4## 5 META   2013-01-08  29.5  29.6  28.9  29.1  45871300     29.1## 6 META   2013-01-09  29.7  30.6  29.5  30.6 104787700     30.6## 7 META   2013-01-10  30.6  31.5  30.3  31.3  95316400     31.3## 8 META   2013-01-11  31.3  32.0  31.1  31.7  89598000     31.7## 9 META   2013-01-14  32.1  32.2  30.6  31.0  98892800     31.0##10 META   2013-01-15  30.6  31.7  29.9  30.1 173242600     30.1### ℹ 4,022 more rows

Second, usegroup_by to group by stock symbol. Third,apply the mutation. We can do this in one easy workflow. TheperiodReturn function is applied to each group of stockprices, and a new data frame was returned with the annual returns in thecorrect periodicity.

FANG_returns_yearly<-FANG%>%group_by(symbol)%>%tq_transmute(select=adjusted,                 mutate_fun=periodReturn,                 period="yearly",                 col_rename="yearly.returns")

Last, we can visualize the returns.

FANG_returns_yearly%>%ggplot(aes(x=year(date), y=yearly.returns, fill=symbol))+geom_bar(position="dodge", stat="identity")+labs(title="FANG: Annual Returns",         subtitle="Mutating at scale is quick and easy!",         y="Returns", x="", color="")+scale_y_continuous(labels=scales::percent)+coord_flip()+theme_tq()+scale_fill_tq()

3.0 Modeling Financial Data using purrr

Eventually you will want to begin modeling (or more generallyapplying functions) at scale! One of thebest featuresof thetidyverse is the ability to map functions to nestedtibbles usingpurrr. From the Many Models chapter of “R for Data Science”, we can apply thesame modeling workflow to financial analysis. Using a two stepworkflow:

  1. Model a single stock
  2. Scale to many stocks

Let’s go through an example to illustrate.

Example: Applying a Regression Model to Detect a Positive Trend

In this example, we’ll use a simple linear model to identify thetrend in annual returns to determine if the stock returns are decreasingor increasing over time.

Analyze a Single Stock

First, let’s collect stock data withtq_get()

AAPL<-tq_get("AAPL", from="2007-01-01", to="2016-12-31")AAPL
### A tibble: 2,518 × 8##    symbol date        open  high   low close     volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL   2007-01-03  3.08  3.09  2.92  2.991238319600     2.52## 2 AAPL   2007-01-04  3.00  3.07  2.99  3.06  847260400     2.57## 3 AAPL   2007-01-05  3.06  3.08  3.01  3.04  834741600     2.55## 4 AAPL   2007-01-08  3.07  3.09  3.05  3.05  797106800     2.57## 5 AAPL   2007-01-09  3.09  3.32  3.04  3.313349298400     2.78## 6 AAPL   2007-01-10  3.38  3.49  3.34  3.462952880000     2.91## 7 AAPL   2007-01-11  3.43  3.46  3.40  3.421440252800     2.88## 8 AAPL   2007-01-12  3.38  3.39  3.33  3.381312690400     2.84## 9 AAPL   2007-01-16  3.42  3.47  3.41  3.471244076400     2.91##10 AAPL   2007-01-17  3.48  3.49  3.39  3.391646260000     2.85### ℹ 2,508 more rows

Next, come up with a function to help us collect annual log returns.The function below mutates the stock prices to period returns usingtq_transmute(). We add thetype = "log" andperiod = "monthly" arguments to ensure we retrieve a tibbleof monthly log returns. Last, we take the mean of the monthly returns toget MMLR.

get_annual_returns<-function(stock.returns){stock.returns%>%tq_transmute(select=adjusted,                     mutate_fun=periodReturn,                     type="log",                     period="yearly")}

Let’s testget_annual_returns out. We now have theannual log returns over the past ten years.

AAPL_annual_log_returns<-get_annual_returns(AAPL)AAPL_annual_log_returns
### A tibble: 10 × 2##    date       yearly.returns##<date><dbl>## 1 2007-12-31         0.860## 2 2008-12-31        -0.842## 3 2009-12-31         0.904## 4 2010-12-31         0.426## 5 2011-12-30         0.228## 6 2012-12-31         0.282## 7 2013-12-31         0.0776## 8 2014-12-31         0.341## 9 2015-12-31        -0.0306##10 2016-12-30         0.118

Let’s visualize to identify trends. We can see from the linear trendline that AAPL’s stock returns are declining.

AAPL_annual_log_returns%>%ggplot(aes(x=year(date), y=yearly.returns))+geom_hline(yintercept=0, color=palette_light()[[1]])+geom_point(size=2, color=palette_light()[[3]])+geom_line(linewidth=1, color=palette_light()[[3]])+geom_smooth(method="lm", se=FALSE)+labs(title="AAPL: Visualizing Trends in Annual Returns",         x="", y="Annual Returns", color="")+theme_tq()

Now, we can get the linear model using thelm()function. However, there is one problem: the output is not “tidy”.

mod<-lm(yearly.returns~year(date), data=AAPL_annual_log_returns)mod
#### Call:## lm(formula = yearly.returns ~ year(date), data = AAPL_annual_log_returns)#### Coefficients:## (Intercept)   year(date)##    58.86282     -0.02915

We can utilize thebroom package to get “tidy” data fromthe model. There’s three primary functions:

  1. augment: adds columns to the original data such aspredictions, residuals and cluster assignments
  2. glance: provides a one-row summary of model-levelstatistics
  3. tidy: summarizes a model’s statistical findings such ascoefficients of a regression

We’ll usetidy to retrieve the model coefficients.

### A tibble: 2 × 5##   term        estimate std.error statistic p.value##<chr><dbl><dbl><dbl><dbl>##1 (Intercept)  58.9     113.         0.520   0.617##2 year(date)   -0.0291    0.0562    -0.518   0.618

Adding to our workflow, we have the following:

get_model<-function(stock_data){annual_returns<-get_annual_returns(stock_data)mod<-lm(yearly.returns~year(date), data=annual_returns)tidy(mod)}

Testing it out on a single stock. We can see that the “term” thatcontains the direction of the trend (the slope) is “year(date)”. Theinterpretation is that as year increases one unit, the annual returnsdecrease by 3%.

get_model(AAPL)
### A tibble: 2 × 5##   term        estimate std.error statistic p.value##<chr><dbl><dbl><dbl><dbl>##1 (Intercept)  58.9     113.         0.520   0.617##2 year(date)   -0.0291    0.0562    -0.518   0.618

Now that we have identified the trend direction, it looks like we areready to scale.

Scale to Many Stocks

Once the analysis for one stock is done scale to many stocks issimple. For brevity, we’ll randomly sample ten stocks from theS&P500 with a call todplyr::sample_n().

set.seed(10)stocks_tbl<-tq_index("SP500")%>%sample_n(5)stocks_tbl
### A tibble: 5 × 8##   symbol company      identifier sedol  weight sector shares_held local_currency##<chr><chr><chr><chr><dbl><chr><dbl><chr>##1 MTCH   MATCH GROUP… 57667L107  BK80… 1.33e-4 -          2839119 USD##2 MCO    MOODY S CORP 615369105  2252… 1.27e-3 -          1830057 USD##3 JBL    JABIL INC    466313103  2471… 4.12e-4 -          1284541 USD##4 DVN    DEVON ENERG… 25179M103  2480… 3.48e-4 -          7516996 USD##5 BXP    BXP INC      101121101  2019… 1.77e-4 -          1740920 USD

We can now apply our analysis function to the stocks usingdplyr::mutate() andpurrr::map(). Themutate() function adds a column to our tibble, and themap() function maps our customget_modelfunction to our tibble of stocks using thesymbol column.Thetidyr::unnest() function unrolls the nested data frameso all of the model statistics are accessible in the top data framelevel. Thefilter,arrange andselect steps just manipulate the data frame to isolate andarrange the data for our viewing.

stocks_model_stats<-stocks_tbl%>%select(symbol,company)%>%tq_get(from="2007-01-01", to="2016-12-31")%>%# Nestgroup_by(symbol,company)%>%nest()%>%# Apply the get_model() function to the new "nested" data columnmutate(model=map(data,get_model))%>%# Unnest and collect slopeunnest(model)%>%filter(term=="year(date)")%>%arrange(desc(estimate))%>%select(-term)stocks_model_stats
### A tibble: 5 × 7### Groups:   symbol, company [5]##   symbol company           data     estimate std.error statistic p.value##<chr><chr><list><dbl><dbl><dbl><dbl>##1 MCO    MOODY S CORP<tibble>   0.0669    0.0384     1.74    0.120##2 JBL    JABIL INC<tibble>   0.0380    0.0521     0.729   0.487##3 MTCH   MATCH GROUP INC<tibble>   0.0246    0.0349     0.703   0.502##4 BXP    BXP INC<tibble>   0.0240    0.0255     0.940   0.375##5 DVN    DEVON ENERGY CORP<tibble>  -0.0127    0.0356    -0.357   0.730

We’re done! We now have the coefficient of the linear regression thattracks the direction of the trend line. We can easily extend this typeof analysis to larger lists or stock indexes. For example, the entireS&P500 could be analyzed removing thesample_n()following the call totq_index("SP500").

4.0 Error Handling when Scaling

Eventually you will run into a stock index, stock symbol, FRED datacode, etc that cannot be retrieved. Possible reasons are:

  • An index becomes out of date
  • A company goes private
  • A stock ticker symbol changes
  • Yahoo / FRED just doesn’t like your stock symbol / FRED code

This becomes painful when scaling if the functions return errors. So,thetq_get() function is designed to handle errorsgracefully. What this means is anNA value isreturned when an error is generated along with agentle errorwarning.

tq_get("XYZ","stock.prices")
### A tibble: 2,502 × 8##    symbol date        open  high   low close   volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 XYZ    2015-11-19  11.2  14.8   9    13.1 47466100     13.1## 2 XYZ    2015-11-20  13.9  14.1  12.5  12.9 16550300     12.9## 3 XYZ    2015-11-23  13    13.1  12.1  12.1  5172200     12.1## 4 XYZ    2015-11-24  12    12.2  11.5  12.0  4714700     12.0## 5 XYZ    2015-11-25  12.1  12.4  11.9  11.9  3583400     11.9## 6 XYZ    2015-11-27  12.1  12.3  11.9  12.1942300     12.1## 7 XYZ    2015-11-30  12.3  12.4  11.9  12.0  1997100     12.0## 8 XYZ    2015-12-01  12.1  12.2  11.9  11.9  1256100     11.9## 9 XYZ    2015-12-02  12.0  12.2  11.9  11.9  1708100     11.9##10 XYZ    2015-12-03  12.0  12.2  11.9  11.9  1431400     11.9### ℹ 2,492 more rows

Pros and Cons to Built-In Error-Handling

There are pros and cons to this approach that you may not agree with,but I believe helps in the long run. Just be aware of what happens:

  • Pros: Long running scripts are not interruptedbecause of one error

  • Cons: Errors can be inadvertently handled orflow downstream if the user does not read the warnings

Bad Apples Fail Gracefully, tq_get

Let’s see an example when usingtq_get() to get thestock prices for a long list of stocks with oneBAD APPLE.The argumentcomplete_cases comes in handy. The default isTRUE, which removes “bad apples” so future analysis havecomplete cases to compute on. Note that a gentle warning stating that anerror occurred and was dealt with by removing the rows from theresults.

c("AAPL","GOOG","BAD APPLE")%>%tq_get(get="stock.prices", complete_cases=TRUE)
## Warning: There was 1 warning in `dplyr::mutate()`.## In argument: `data.. = purrr::map(...)`.## Caused by warning:##! x = 'BAD APPLE', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "BAD APPLE", env = <environment>, : Unable to import "BAD APPLE".## cannot open the connection##  Removing BAD APPLE.
### A tibble: 5,450 × 8##    symbol date        open  high   low close    volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL   2015-01-02  27.8  27.9  26.8  27.3 212818400     24.3## 2 AAPL   2015-01-05  27.1  27.2  26.4  26.6 257142000     23.6## 3 AAPL   2015-01-06  26.6  26.9  26.2  26.6 263188400     23.6## 4 AAPL   2015-01-07  26.8  27.0  26.7  26.9 160423600     23.9## 5 AAPL   2015-01-08  27.3  28.0  27.2  28.0 237458000     24.8## 6 AAPL   2015-01-09  28.2  28.3  27.6  28.0 214798000     24.9## 7 AAPL   2015-01-12  28.1  28.2  27.2  27.3 198603200     24.2## 8 AAPL   2015-01-13  27.9  28.2  27.2  27.6 268367600     24.5## 9 AAPL   2015-01-14  27.3  27.6  27.1  27.5 195826400     24.4##10 AAPL   2015-01-15  27.5  27.5  26.7  26.7 240056000     23.7### ℹ 5,440 more rows

Now switchingcomplete_cases = FALSE will retain anyerrors asNA values in a nested data frame. Notice that theerror message and output change. The error message now states that theNA values exist in the output and the return is a “nested”data structure.

c("AAPL","GOOG","BAD APPLE")%>%tq_get(get="stock.prices", complete_cases=FALSE)
## Warning: There was 1 warning in `dplyr::mutate()`.## In argument: `data.. = purrr::map(...)`.## Caused by warning:##! x = 'BAD APPLE', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "BAD APPLE", env = <environment>, : Unable to import "BAD APPLE".## cannot open the connection
### A tibble: 5,451 × 8##    symbol date        open  high   low close    volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL   2015-01-02  27.8  27.9  26.8  27.3 212818400     24.3## 2 AAPL   2015-01-05  27.1  27.2  26.4  26.6 257142000     23.6## 3 AAPL   2015-01-06  26.6  26.9  26.2  26.6 263188400     23.6## 4 AAPL   2015-01-07  26.8  27.0  26.7  26.9 160423600     23.9## 5 AAPL   2015-01-08  27.3  28.0  27.2  28.0 237458000     24.8## 6 AAPL   2015-01-09  28.2  28.3  27.6  28.0 214798000     24.9## 7 AAPL   2015-01-12  28.1  28.2  27.2  27.3 198603200     24.2## 8 AAPL   2015-01-13  27.9  28.2  27.2  27.6 268367600     24.5## 9 AAPL   2015-01-14  27.3  27.6  27.1  27.5 195826400     24.4##10 AAPL   2015-01-15  27.5  27.5  26.7  26.7 240056000     23.7### ℹ 5,441 more rows

In both cases, the prudent user will review the warnings to determinewhat happened and whether or not this is acceptable. In thecomplete_cases = FALSE example, if the user attempts toperform downstream computations at scale, the computations will likelyfail grinding the analysis to a halt. But, the advantage is that theuser will more easily be able to filter to the problem root to determinewhat happened and decide whether this is acceptable or not.


[8]ページ先頭

©2009-2025 Movatter.jp