
Scaling and Modeling with tidyquant
MattDancho
2025-11-01
Source:vignettes/TQ03-scaling-and-modeling-with-tidyquant.RmdTQ03-scaling-and-modeling-with-tidyquant.RmdDesigned for the data science workflow of the
tidyverse
Overview
The greatest benefit totidyquant is the ability toapply the data science workflow to easily model and scale your financialanalysis as described inR forData Science. Scaling is the process of creating an analysisfor one asset and then extending it to multiple groups. This idea ofscaling is incredibly useful to financial analysts because typically onewants to compare many assets to make informed decisions. Fortunately,thetidyquant package integrates with thetidyverse making scaling super simple!
Alltidyquant functions return data in thetibble (tidy data frame) format, which allows forinteraction within thetidyverse. This means we can:
- Seamlessly scale data retrieval and mutations
- Use the pipe (
%>%) for chaining operations - Use
dplyrandtidyr:select,filter,group_by,nest/unnest,spread/gather, etc - Use
purrr: mapping functions withmap() - Model financial analysis using the data science workflow inR for Data Science
We’ll go through some useful techniques for getting and manipulatinggroups of data.
1.0 Scaling the Getting of Financial Data
A very basic example is retrieving the stock prices for multiplestocks. There are three primary ways to do this:
Method 1: Map a character vector with multiple stock symbols
### A tibble: 756 × 8## symbol date open high low close volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL 2016-01-04 25.7 26.3 25.5 26.3 270597600 23.8## 2 AAPL 2016-01-05 26.4 26.5 25.6 25.7 223164000 23.2## 3 AAPL 2016-01-06 25.1 25.6 25.0 25.2 273829600 22.7## 4 AAPL 2016-01-07 24.7 25.0 24.1 24.1 324377600 21.8## 5 AAPL 2016-01-08 24.6 24.8 24.2 24.2 283192000 21.9## 6 AAPL 2016-01-11 24.7 24.8 24.3 24.6 198957600 22.2## 7 AAPL 2016-01-12 25.1 25.2 24.7 25.0 196616800 22.6## 8 AAPL 2016-01-13 25.1 25.3 24.3 24.3 249758400 22.0## 9 AAPL 2016-01-14 24.5 25.1 23.9 24.9 252680400 22.5##10 AAPL 2016-01-15 24.0 24.4 23.8 24.3 319335600 21.9### ℹ 746 more rowsThe output is a single level tibble with all or the stock prices inone tibble. The auto-generated column name is “symbol”, which can bepreemptively renamed by giving the vector a name(e.g. stocks <- c("AAPL", "GOOG", "META")) and thenpiping totq_get.
Method 2: Map a tibble with stocks in first column
First, get a stock list in data frame format either by making thetibble or retrieving fromtq_index /tq_exchange. The stock symbols must be in the firstcolumn.
Method 2A: Make a tibble
stock_list<-tibble(stocks=c("AAPL","JPM","CVX"), industry=c("Technology","Financial","Energy"))stock_list### A tibble: 3 × 2## stocks industry##<chr><chr>##1 AAPL Technology##2 JPM Financial##3 CVX EnergySecond, send the stock list totq_get. Notice how thesymbol and industry columns are automatically expanded the length of thestock prices.
### A tibble: 756 × 9## stocks industry date open high low close volume adjusted##<chr><chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL Technology 2016-01-04 25.7 26.3 25.5 26.3 270597600 23.8## 2 AAPL Technology 2016-01-05 26.4 26.5 25.6 25.7 223164000 23.2## 3 AAPL Technology 2016-01-06 25.1 25.6 25.0 25.2 273829600 22.7## 4 AAPL Technology 2016-01-07 24.7 25.0 24.1 24.1 324377600 21.8## 5 AAPL Technology 2016-01-08 24.6 24.8 24.2 24.2 283192000 21.9## 6 AAPL Technology 2016-01-11 24.7 24.8 24.3 24.6 198957600 22.2## 7 AAPL Technology 2016-01-12 25.1 25.2 24.7 25.0 196616800 22.6## 8 AAPL Technology 2016-01-13 25.1 25.3 24.3 24.3 249758400 22.0## 9 AAPL Technology 2016-01-14 24.5 25.1 23.9 24.9 252680400 22.5##10 AAPL Technology 2016-01-15 24.0 24.4 23.8 24.3 319335600 21.9### ℹ 746 more rowsMethod 2B: Use index or exchange
Get an index…
tq_index("DOW")### A tibble: 31 × 8## symbol company identifier sedol weight sector shares_held local_currency##<chr><chr><chr><chr><dbl><chr><dbl><chr>## 1 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD## 2 CAT CATERPILLAR… 149123101 2180… 0.0755 - 5449197 USD## 3 MSFT MICROSOFT C… 594918104 2588… 0.0681 - 5449197 USD## 4 HD HOME DEPOT … 437076102 2434… 0.0491 - 5449197 USD## 5 AXP AMERICAN EX… 025816109 2026… 0.0465 - 5449197 USD## 6 SHW SHERWIN WIL… 824348106 2804… 0.0450 - 5449197 USD## 7 V VISA INC CL… 92826C839 B2PZ… 0.0447 - 5449197 USD## 8 UNH UNITEDHEALT… 91324P102 2917… 0.0446 - 5449197 USD## 9 IBM INTL BUSINE… 459200101 2005… 0.0401 - 5449197 USD##10 JPM JPMORGAN CH… 46625H100 2190… 0.0401 - 5449197 USD### ℹ 21 more rows…or, get an exchange.
tq_exchange("NYSE")Send the index or exchange totq_get.ImportantNote: This can take several minutes depending on the size of the indexor exchange, which is why only the first three stocks are evaluated inthe vignette.
### A tibble: 8,175 × 15## symbol company identifier sedol weight sector shares_held local_currency##<chr><chr><chr><chr><dbl><chr><dbl><chr>## 1 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD## 2 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD## 3 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD## 4 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD## 5 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD## 6 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD## 7 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD## 8 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD## 9 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD##10 GS GOLDMAN SAC… 38141G104 2407… 0.102 - 5449197 USD### ℹ 8,165 more rows### ℹ 7 more variables: date <date>, open <dbl>, high <dbl>, low <dbl>,### close <dbl>, volume <dbl>, adjusted <dbl>You can use any applicable “getter” to get data foreverystock in an index or an exchange! This includes:“stock.prices”, “key.ratios”, “key.stats”, and more.
2.0 Scaling the Mutation of Financial Data
Once you get the data, you typically want to do something with it.You can easily do this at scale. Let’s get the yearly returns formultiple stocks usingtq_transmute. First, get the prices.We’ll use theFANG data set, but you typically will usetq_get to retrieve data in “tibble” format.
FANG### A tibble: 4,032 × 8## symbol date open high low close volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 META 2013-01-02 27.4 28.2 27.4 28 69846400 28## 2 META 2013-01-03 27.9 28.5 27.6 27.8 63140600 27.8## 3 META 2013-01-04 28.0 28.9 27.8 28.8 72715400 28.8## 4 META 2013-01-07 28.7 29.8 28.6 29.4 83781800 29.4## 5 META 2013-01-08 29.5 29.6 28.9 29.1 45871300 29.1## 6 META 2013-01-09 29.7 30.6 29.5 30.6 104787700 30.6## 7 META 2013-01-10 30.6 31.5 30.3 31.3 95316400 31.3## 8 META 2013-01-11 31.3 32.0 31.1 31.7 89598000 31.7## 9 META 2013-01-14 32.1 32.2 30.6 31.0 98892800 31.0##10 META 2013-01-15 30.6 31.7 29.9 30.1 173242600 30.1### ℹ 4,022 more rowsSecond, usegroup_by to group by stock symbol. Third,apply the mutation. We can do this in one easy workflow. TheperiodReturn function is applied to each group of stockprices, and a new data frame was returned with the annual returns in thecorrect periodicity.
FANG_returns_yearly<-FANG%>%group_by(symbol)%>%tq_transmute(select=adjusted, mutate_fun=periodReturn, period="yearly", col_rename="yearly.returns")Last, we can visualize the returns.
FANG_returns_yearly%>%ggplot(aes(x=year(date), y=yearly.returns, fill=symbol))+geom_bar(position="dodge", stat="identity")+labs(title="FANG: Annual Returns", subtitle="Mutating at scale is quick and easy!", y="Returns", x="", color="")+scale_y_continuous(labels=scales::percent)+coord_flip()+theme_tq()+scale_fill_tq()
3.0 Modeling Financial Data using purrr
Eventually you will want to begin modeling (or more generallyapplying functions) at scale! One of thebest featuresof thetidyverse is the ability to map functions to nestedtibbles usingpurrr. From the Many Models chapter of “R for Data Science”, we can apply thesame modeling workflow to financial analysis. Using a two stepworkflow:
- Model a single stock
- Scale to many stocks
Let’s go through an example to illustrate.
Example: Applying a Regression Model to Detect a Positive Trend
In this example, we’ll use a simple linear model to identify thetrend in annual returns to determine if the stock returns are decreasingor increasing over time.
Analyze a Single Stock
First, let’s collect stock data withtq_get()
AAPL<-tq_get("AAPL", from="2007-01-01", to="2016-12-31")AAPL### A tibble: 2,518 × 8## symbol date open high low close volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL 2007-01-03 3.08 3.09 2.92 2.991238319600 2.52## 2 AAPL 2007-01-04 3.00 3.07 2.99 3.06 847260400 2.57## 3 AAPL 2007-01-05 3.06 3.08 3.01 3.04 834741600 2.55## 4 AAPL 2007-01-08 3.07 3.09 3.05 3.05 797106800 2.57## 5 AAPL 2007-01-09 3.09 3.32 3.04 3.313349298400 2.78## 6 AAPL 2007-01-10 3.38 3.49 3.34 3.462952880000 2.91## 7 AAPL 2007-01-11 3.43 3.46 3.40 3.421440252800 2.88## 8 AAPL 2007-01-12 3.38 3.39 3.33 3.381312690400 2.84## 9 AAPL 2007-01-16 3.42 3.47 3.41 3.471244076400 2.91##10 AAPL 2007-01-17 3.48 3.49 3.39 3.391646260000 2.85### ℹ 2,508 more rowsNext, come up with a function to help us collect annual log returns.The function below mutates the stock prices to period returns usingtq_transmute(). We add thetype = "log" andperiod = "monthly" arguments to ensure we retrieve a tibbleof monthly log returns. Last, we take the mean of the monthly returns toget MMLR.
get_annual_returns<-function(stock.returns){stock.returns%>%tq_transmute(select=adjusted, mutate_fun=periodReturn, type="log", period="yearly")}Let’s testget_annual_returns out. We now have theannual log returns over the past ten years.
AAPL_annual_log_returns<-get_annual_returns(AAPL)AAPL_annual_log_returns### A tibble: 10 × 2## date yearly.returns##<date><dbl>## 1 2007-12-31 0.860## 2 2008-12-31 -0.842## 3 2009-12-31 0.904## 4 2010-12-31 0.426## 5 2011-12-30 0.228## 6 2012-12-31 0.282## 7 2013-12-31 0.0776## 8 2014-12-31 0.341## 9 2015-12-31 -0.0306##10 2016-12-30 0.118Let’s visualize to identify trends. We can see from the linear trendline that AAPL’s stock returns are declining.
AAPL_annual_log_returns%>%ggplot(aes(x=year(date), y=yearly.returns))+geom_hline(yintercept=0, color=palette_light()[[1]])+geom_point(size=2, color=palette_light()[[3]])+geom_line(linewidth=1, color=palette_light()[[3]])+geom_smooth(method="lm", se=FALSE)+labs(title="AAPL: Visualizing Trends in Annual Returns", x="", y="Annual Returns", color="")+theme_tq()
Now, we can get the linear model using thelm()function. However, there is one problem: the output is not “tidy”.
#### Call:## lm(formula = yearly.returns ~ year(date), data = AAPL_annual_log_returns)#### Coefficients:## (Intercept) year(date)## 58.86282 -0.02915We can utilize thebroom package to get “tidy” data fromthe model. There’s three primary functions:
augment: adds columns to the original data such aspredictions, residuals and cluster assignmentsglance: provides a one-row summary of model-levelstatisticstidy: summarizes a model’s statistical findings such ascoefficients of a regression
We’ll usetidy to retrieve the model coefficients.
### A tibble: 2 × 5## term estimate std.error statistic p.value##<chr><dbl><dbl><dbl><dbl>##1 (Intercept) 58.9 113. 0.520 0.617##2 year(date) -0.0291 0.0562 -0.518 0.618Adding to our workflow, we have the following:
get_model<-function(stock_data){annual_returns<-get_annual_returns(stock_data)mod<-lm(yearly.returns~year(date), data=annual_returns)tidy(mod)}Testing it out on a single stock. We can see that the “term” thatcontains the direction of the trend (the slope) is “year(date)”. Theinterpretation is that as year increases one unit, the annual returnsdecrease by 3%.
get_model(AAPL)### A tibble: 2 × 5## term estimate std.error statistic p.value##<chr><dbl><dbl><dbl><dbl>##1 (Intercept) 58.9 113. 0.520 0.617##2 year(date) -0.0291 0.0562 -0.518 0.618Now that we have identified the trend direction, it looks like we areready to scale.
Scale to Many Stocks
Once the analysis for one stock is done scale to many stocks issimple. For brevity, we’ll randomly sample ten stocks from theS&P500 with a call todplyr::sample_n().
### A tibble: 5 × 8## symbol company identifier sedol weight sector shares_held local_currency##<chr><chr><chr><chr><dbl><chr><dbl><chr>##1 MTCH MATCH GROUP… 57667L107 BK80… 1.33e-4 - 2839119 USD##2 MCO MOODY S CORP 615369105 2252… 1.27e-3 - 1830057 USD##3 JBL JABIL INC 466313103 2471… 4.12e-4 - 1284541 USD##4 DVN DEVON ENERG… 25179M103 2480… 3.48e-4 - 7516996 USD##5 BXP BXP INC 101121101 2019… 1.77e-4 - 1740920 USDWe can now apply our analysis function to the stocks usingdplyr::mutate() andpurrr::map(). Themutate() function adds a column to our tibble, and themap() function maps our customget_modelfunction to our tibble of stocks using thesymbol column.Thetidyr::unnest() function unrolls the nested data frameso all of the model statistics are accessible in the top data framelevel. Thefilter,arrange andselect steps just manipulate the data frame to isolate andarrange the data for our viewing.
stocks_model_stats<-stocks_tbl%>%select(symbol,company)%>%tq_get(from="2007-01-01", to="2016-12-31")%>%# Nestgroup_by(symbol,company)%>%nest()%>%# Apply the get_model() function to the new "nested" data columnmutate(model=map(data,get_model))%>%# Unnest and collect slopeunnest(model)%>%filter(term=="year(date)")%>%arrange(desc(estimate))%>%select(-term)stocks_model_stats### A tibble: 5 × 7### Groups: symbol, company [5]## symbol company data estimate std.error statistic p.value##<chr><chr><list><dbl><dbl><dbl><dbl>##1 MCO MOODY S CORP<tibble> 0.0669 0.0384 1.74 0.120##2 JBL JABIL INC<tibble> 0.0380 0.0521 0.729 0.487##3 MTCH MATCH GROUP INC<tibble> 0.0246 0.0349 0.703 0.502##4 BXP BXP INC<tibble> 0.0240 0.0255 0.940 0.375##5 DVN DEVON ENERGY CORP<tibble> -0.0127 0.0356 -0.357 0.730We’re done! We now have the coefficient of the linear regression thattracks the direction of the trend line. We can easily extend this typeof analysis to larger lists or stock indexes. For example, the entireS&P500 could be analyzed removing thesample_n()following the call totq_index("SP500").
4.0 Error Handling when Scaling
Eventually you will run into a stock index, stock symbol, FRED datacode, etc that cannot be retrieved. Possible reasons are:
- An index becomes out of date
- A company goes private
- A stock ticker symbol changes
- Yahoo / FRED just doesn’t like your stock symbol / FRED code
This becomes painful when scaling if the functions return errors. So,thetq_get() function is designed to handle errorsgracefully. What this means is anNA value isreturned when an error is generated along with agentle errorwarning.
tq_get("XYZ","stock.prices")### A tibble: 2,502 × 8## symbol date open high low close volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 XYZ 2015-11-19 11.2 14.8 9 13.1 47466100 13.1## 2 XYZ 2015-11-20 13.9 14.1 12.5 12.9 16550300 12.9## 3 XYZ 2015-11-23 13 13.1 12.1 12.1 5172200 12.1## 4 XYZ 2015-11-24 12 12.2 11.5 12.0 4714700 12.0## 5 XYZ 2015-11-25 12.1 12.4 11.9 11.9 3583400 11.9## 6 XYZ 2015-11-27 12.1 12.3 11.9 12.1942300 12.1## 7 XYZ 2015-11-30 12.3 12.4 11.9 12.0 1997100 12.0## 8 XYZ 2015-12-01 12.1 12.2 11.9 11.9 1256100 11.9## 9 XYZ 2015-12-02 12.0 12.2 11.9 11.9 1708100 11.9##10 XYZ 2015-12-03 12.0 12.2 11.9 11.9 1431400 11.9### ℹ 2,492 more rowsPros and Cons to Built-In Error-Handling
There are pros and cons to this approach that you may not agree with,but I believe helps in the long run. Just be aware of what happens:
Pros: Long running scripts are not interruptedbecause of one error
Cons: Errors can be inadvertently handled orflow downstream if the user does not read the warnings
Bad Apples Fail Gracefully, tq_get
Let’s see an example when usingtq_get() to get thestock prices for a long list of stocks with oneBAD APPLE.The argumentcomplete_cases comes in handy. The default isTRUE, which removes “bad apples” so future analysis havecomplete cases to compute on. Note that a gentle warning stating that anerror occurred and was dealt with by removing the rows from theresults.
## Warning: There was 1 warning in `dplyr::mutate()`.##ℹ In argument: `data.. = purrr::map(...)`.## Caused by warning:##! x = 'BAD APPLE', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "BAD APPLE", env = <environment>, : Unable to import "BAD APPLE".## cannot open the connection## Removing BAD APPLE.### A tibble: 5,450 × 8## symbol date open high low close volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL 2015-01-02 27.8 27.9 26.8 27.3 212818400 24.3## 2 AAPL 2015-01-05 27.1 27.2 26.4 26.6 257142000 23.6## 3 AAPL 2015-01-06 26.6 26.9 26.2 26.6 263188400 23.6## 4 AAPL 2015-01-07 26.8 27.0 26.7 26.9 160423600 23.9## 5 AAPL 2015-01-08 27.3 28.0 27.2 28.0 237458000 24.8## 6 AAPL 2015-01-09 28.2 28.3 27.6 28.0 214798000 24.9## 7 AAPL 2015-01-12 28.1 28.2 27.2 27.3 198603200 24.2## 8 AAPL 2015-01-13 27.9 28.2 27.2 27.6 268367600 24.5## 9 AAPL 2015-01-14 27.3 27.6 27.1 27.5 195826400 24.4##10 AAPL 2015-01-15 27.5 27.5 26.7 26.7 240056000 23.7### ℹ 5,440 more rowsNow switchingcomplete_cases = FALSE will retain anyerrors asNA values in a nested data frame. Notice that theerror message and output change. The error message now states that theNA values exist in the output and the return is a “nested”data structure.
## Warning: There was 1 warning in `dplyr::mutate()`.##ℹ In argument: `data.. = purrr::map(...)`.## Caused by warning:##! x = 'BAD APPLE', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "BAD APPLE", env = <environment>, : Unable to import "BAD APPLE".## cannot open the connection### A tibble: 5,451 × 8## symbol date open high low close volume adjusted##<chr><date><dbl><dbl><dbl><dbl><dbl><dbl>## 1 AAPL 2015-01-02 27.8 27.9 26.8 27.3 212818400 24.3## 2 AAPL 2015-01-05 27.1 27.2 26.4 26.6 257142000 23.6## 3 AAPL 2015-01-06 26.6 26.9 26.2 26.6 263188400 23.6## 4 AAPL 2015-01-07 26.8 27.0 26.7 26.9 160423600 23.9## 5 AAPL 2015-01-08 27.3 28.0 27.2 28.0 237458000 24.8## 6 AAPL 2015-01-09 28.2 28.3 27.6 28.0 214798000 24.9## 7 AAPL 2015-01-12 28.1 28.2 27.2 27.3 198603200 24.2## 8 AAPL 2015-01-13 27.9 28.2 27.2 27.6 268367600 24.5## 9 AAPL 2015-01-14 27.3 27.6 27.1 27.5 195826400 24.4##10 AAPL 2015-01-15 27.5 27.5 26.7 26.7 240056000 23.7### ℹ 5,441 more rowsIn both cases, the prudent user will review the warnings to determinewhat happened and whether or not this is acceptable. In thecomplete_cases = FALSE example, if the user attempts toperform downstream computations at scale, the computations will likelyfail grinding the analysis to a halt. But, the advantage is that theuser will more easily be able to filter to the problem root to determinewhat happened and decide whether this is acceptable or not.