This package provides access to data frames of values from theCOVIDcastendpoint of the Epidata API. Using thecovidcast_signal() function, you can fetch any data you maybe interested in analyzing, then useplot.covidcast_signal() to make plots and maps. Since thedata is provided as a simple data frame, you can also wrangle it intowhatever form you need to conduct your desired analyses using otherpackages and functions.
This package isavailable onCRAN, so the easiest way to install it is simply
install.packages("covidcast")To obtain smoothed estimates of COVID-like illness from ourCOVID-19 Trends andImpact Survey for every county in the United States between2020-05-01 and 2020-05-07, we can usecovidcast_signal():
library(covidcast)library(dplyr)cli<-covidcast_signal(data_source="fb-survey", signal="smoothed_wcli", start_day="2020-05-01", end_day="2020-05-07", geo_type="county")knitr::kable(head(cli))| data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fb-survey | smoothed_wcli | 01000 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.8260625 | 0.1341381 | 1676.2773 |
| fb-survey | smoothed_wcli | 01001 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 1.0707790 | 0.8213119 | 109.0866 |
| fb-survey | smoothed_wcli | 01003 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.5081644 | 0.2800777 | 572.3194 |
| fb-survey | smoothed_wcli | 01015 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.5277609 | 0.5192431 | 118.8275 |
| fb-survey | smoothed_wcli | 01031 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.3733811 | 0.3367309 | 112.2687 |
| fb-survey | smoothed_wcli | 01045 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 1.2369542 | 0.6464530 | 108.5803 |
covidcast_signal() returns a data frame. (Here we’reusingknitr::kable() to make it more readable.) Each rowrepresents one observation in one county on one day. The county FIPScode is given in thegeo_value column, the date in thetime_value column. Herevalue is the requestedsignal—in this case, the smoothed estimate of the percentage of peoplewith COVID-like illness, based on the symptom surveys, andstderr is its standard error. See thecovidcast_signal() documentation for details on thereturned data frame.
To get a basic summary of the returned data frame:
summary(cli)A `covidcast_signal` dataframe with 7030 rows and 15 columns.data_source : fb-surveysignal : smoothed_wcligeo_type : countyfirst date : 2020-05-01last date : 2020-05-07median number of geo_values per day : 1015The COVIDcast API makes estimates available at several differentgeographic levels, andcovidcast_signal() defaults torequesting county-level data. To request estimates for states instead ofcounties, we use thegeo_type argument:
cli<-covidcast_signal(data_source="fb-survey", signal="smoothed_wcli", start_day="2020-05-01", end_day="2020-05-07", geo_type="state")knitr::kable(head(cli))| data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fb-survey | smoothed_wcli | ak | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.3661909 | 0.1469918 | 1560.000 |
| fb-survey | smoothed_wcli | al | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.7764020 | 0.1010989 | 7360.237 |
| fb-survey | smoothed_wcli | ar | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.7065584 | 0.1051584 | 4781.483 |
| fb-survey | smoothed_wcli | az | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.6025853 | 0.0724214 | 10973.073 |
| fb-survey | smoothed_wcli | ca | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.4139045 | 0.0306336 | 50482.138 |
| fb-survey | smoothed_wcli | co | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.5984794 | 0.0717395 | 9888.894 |
One can also select a specific geographic region by its ID. Forexample, this is the FIPS code for Allegheny County, Pennsylvania:
cli<-covidcast_signal(data_source="fb-survey", signal="smoothed_wcli", start_day="2020-05-01", end_day="2020-05-07", geo_type="county", geo_value="42003")knitr::kable(head(cli))| data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fb-survey | smoothed_wcli | 42003 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.6270520 | 0.2511377 | 2554.564 |
| fb-survey | smoothed_wcli | 42003 | 2020-05-02 | fb-survey | county | day | 2020-09-03 | 124 | 0 | 0 | 0 | 0.6453498 | 0.2599037 | 2509.176 |
| fb-survey | smoothed_wcli | 42003 | 2020-05-03 | fb-survey | county | day | 2020-09-03 | 123 | 0 | 0 | 0 | 0.5523067 | 0.2497662 | 2473.456 |
| fb-survey | smoothed_wcli | 42003 | 2020-05-04 | fb-survey | county | day | 2020-09-03 | 122 | 0 | 0 | 0 | 0.1430772 | 0.0804642 | 2493.730 |
| fb-survey | smoothed_wcli | 42003 | 2020-05-05 | fb-survey | county | day | 2020-09-03 | 121 | 0 | 0 | 0 | 0.1861889 | 0.0960907 | 2415.204 |
| fb-survey | smoothed_wcli | 42003 | 2020-05-06 | fb-survey | county | day | 2020-09-03 | 120 | 0 | 0 | 0 | 0.3124150 | 0.1218194 | 2465.422 |
By default, this package submits queries to the API anonymously. Allthe examples in the package documentation are compatible with anonymoususe of the API, butthereare some limits on anonymous queries, including rate limits on thenumber of queries that can be submitted per hour. To lift these limits,see the “API keys” section of thecovidcast_signal()documentation for information on how to register for and use an APIkey.
This package provides convenient functions for plotting and mappingthese signals. For example, simple line charts are easy toconstruct:
plot(cli, plot_type="line", title="Survey results in Allegheny County, PA")
For more details and examples, including choropleth and bubble maps,seevignette("plotting-signals").
Above we used data fromDelphi’s symptomsurveys, but the COVIDcast API includes numerous data streams:medical claims data, cases and deaths, mobility, and many others; newsignals are added regularly. This can make it a challenge to find thedata stream that you are most interested in.
TheCOVIDcastData Sources and Signals documentation lists all data sources andsignals available through COVIDcast. When you find a signal of interest,get the data source name (such asjhu-csse orfb-survey) and the signal name (such asconfirmed_incidence_num orsmoothed_wcli).These are provided as arguments tocovidcast_signal() torequest the data you want.
The COVIDcast API identifies counties by their 5-digit FIPS code andmetropolitan areas by their CBSA ID number. (See thegeographiccoding documentation for details.) This means that to query aspecific county or metropolitan area, we must have some way to quicklyfind its identifier.
This package includes several utilities intended to make the processeasier. For example, if we look at?county_census, we findthat the package provides census data (such as population) on everycounty in the United States, including its FIPS code. Similarly, bylooking at?msa_census we can find data about metropolitanstatistical areas, their corresponding CBSA IDs, and recent censusdata.
(Note: themsa_census data includes types of area beyondmetropolitan statistical areas, including micropolitan statisticalareas. TheLSAD column identifies the type of each area.The COVIDcast API only provides estimates for metropolitan statisticalareas, not for their divisions or for micropolitan areas.)
Building on these datasets, the convenience functionsname_to_fips() andname_to_cbsa() conductgrep()-based searching of county or metropolitan area namesto find FIPS or CBSA codes, respectively:
name_to_fips("Allegheny")Allegheny County "42003"name_to_cbsa("Pittsburgh")Pittsburgh, PA "38300"Since these functions return vectors of IDs, we can use them toconstruct thegeo_values argument tocovidcast_signal() to select specific regions to query.
You may also want to convert FIPS codes or CBSA IDs back towell-known names, for instance to report in tables or graphics. Thepackage provides inverse mappingscounty_fips_to_name() andcbsa_to_name() that work in the analogous way:
county_fips_to_name("42003")42003"Allegheny County"cbsa_to_name("38300")38300"Pittsburgh, PA"See their documentation for more details (for example, the optionsfor handling matches when counties have the same name).
If we are interested in exploring the available signals and theirmetadata, we can usecovidcast_meta() to fetch a data frameof the available signals:
meta<-covidcast_meta()knitr::kable(head(meta))| data_source | signal | time_type | geo_type | min_time | max_time | num_locations | min_value | max_value | mean_value | stdev_value | last_update | max_issue | min_lag | max_lag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chng | smoothed_adj_outpatient_cli | day | county | 2020-02-01 | 2023-02-14 | 3118 | 0.0009331 | 99.92012 | 2.227852 | 3.843579 | 1683566979 | 2023-02-19 | 3 | 674 |
| chng | smoothed_adj_outpatient_cli | day | hhs | 2020-02-01 | 2023-06-14 | 10 | 0.0061953 | 20.77577 | 2.530157 | 2.531306 | 1687231582 | 2023-06-19 | 5 | 674 |
| chng | smoothed_adj_outpatient_cli | day | hrr | 2020-02-01 | 2023-06-14 | 306 | 0.0010292 | 50.81590 | 2.350355 | 2.763442 | 1687231582 | 2023-06-19 | 5 | 674 |
| chng | smoothed_adj_outpatient_cli | day | msa | 2020-02-01 | 2023-06-14 | 392 | 0.0007662 | 99.99898 | 2.153204 | 3.000248 | 1687231583 | 2023-06-19 | 5 | 674 |
| chng | smoothed_adj_outpatient_cli | day | nation | 2020-02-01 | 2023-06-14 | 1 | 0.0154639 | 12.08697 | 2.778260 | 2.344107 | 1687231583 | 2023-06-19 | 5 | 674 |
| chng | smoothed_adj_outpatient_cli | day | state | 2020-02-01 | 2023-06-14 | 55 | 0.0013343 | 33.23859 | 2.264207 | 2.563880 | 1687231583 | 2023-06-19 | 5 | 674 |
Thecovidcast_meta() documentation describes the columnsand their meanings. The metadata data frame can be filtered and slicedas desired to obtain information about signals of interest. To get abasic summary of the metadata:
summary(meta)(We silenced the evaluation because the output ofsummary() here is still quite long.)
The COVIDcast API records not just each signal’s estimate for a givenlocation on a given day, but alsowhen that estimate was made,and all updates to that estimate.
For example, consider using ourdoctorvisits signal, which estimates the percentage of outpatient doctorvisits that are COVID-related, and consider a result row withtime_value 2020-05-01 forgeo_values = "pa".This is an estimate for the percentage in Pennsylvania on May 1, 2020.That estimate wasissued on May 5, 2020, the delay being due tothe aggregation of data by our source and the time taken by theCOVIDcast API to ingest the data provided. Later, the estimate for May1st could be updated, perhaps because additional visit data from May 1starrived at our source and was reported to us. This constitutes a newissue of the data.
By default,covidcast_signal() fetches the most recentissue available. This is the best option for users who simply want tograph the latest data or construct dashboards. But if we are interestedin knowingwhen data was reported, we can request specific dataversions using theas_of,issues, orlag arguments. (Note these are mutually exclusive; only onecan be specified at a time.)
First, we can request the data that was availableas of aspecific date, using theas_of argument:
covidcast_signal(data_source="doctor-visits", signal="smoothed_adj_cli", start_day="2020-05-01", end_day="2020-05-01", geo_type="state", geo_values="pa", as_of="2020-05-07")A `covidcast_signal` dataframe with 1 rows and 15 columns.data_source : doctor-visitssignal : smoothed_adj_cligeo_type : state data_source signal geo_value time_value source geo_type1 doctor-visits smoothed_adj_cli pa 2020-05-01 doctor-visits state time_type issue lag missing_value missing_stderr missing_sample_size1 day 2020-05-07 6 0 5 5 value stderr sample_size1 2.581509 NA NAThis shows that an estimate of about 2.3% was issued on May 7. If wedon’t specifyas_of, we get the most recent estimateavailable:
covidcast_signal(data_source="doctor-visits", signal="smoothed_adj_cli", start_day="2020-05-01", end_day="2020-05-01", geo_type="state", geo_values="pa")A `covidcast_signal` dataframe with 1 rows and 15 columns.data_source : doctor-visitssignal : smoothed_adj_cligeo_type : state data_source signal geo_value time_value source geo_type1 doctor-visits smoothed_adj_cli pa 2020-05-01 doctor-visits state time_type issue lag missing_value missing_stderr missing_sample_size1 day 2020-07-04 64 0 5 5 value stderr sample_size1 5.973572 NA NANote the substantial change in the estimate, to over 5%, reflectingnew data that became availableafter May 7 about visitsoccurring on May 1. This illustrates the importance of issue datetracking, particularly for forecasting tasks. To backtest a forecastingmodel on past data, it is important to use the data that would have beenavailableat the time, not data that arrived much later.
By using theissues argument, we can request all issuesin a certain time period:
covidcast_signal(data_source="doctor-visits", signal="smoothed_adj_cli", start_day="2020-05-01", end_day="2020-05-01", geo_type="state", geo_values="pa", issues=c("2020-05-01","2020-05-15"))%>%knitr::kable()| data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-07 | 6 | 0 | 5 | 5 | 2.581509 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-08 | 7 | 0 | 5 | 5 | 3.278896 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-09 | 8 | 0 | 5 | 5 | 3.321781 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-12 | 11 | 0 | 5 | 5 | 3.588683 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-13 | 12 | 0 | 5 | 5 | 3.631978 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-14 | 13 | 0 | 5 | 5 | 3.658009 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-15 | 14 | 0 | 5 | 5 | 3.662286 | NA | NA |
This estimate was clearly updated many times as new data for May 1starrived. Note that these results include only data issued or updatedbetween 2020-05-01 and 2020-05-15. If a value was first reported on2020-04-15, and never updated, a query for issues between 2020-05-01 and2020-05-15 will not include that value among its results.
After fetching multiple issues of data, we can use thelatest_issue() orearliest_issue() functionsto subset the data frame to view only the latest or earliest issue ofeach observation.
Finally, we can use thelag argument to request onlydata reported with a certain lag. For example, requesting a lag of 7days means to request only issues 7 days after the correspondingtime_value:
covidcast_signal(data_source="doctor-visits", signal="smoothed_adj_cli", start_day="2020-05-01", end_day="2020-05-07", geo_type="state", geo_values="pa", lag=7)%>%knitr::kable()Warning: Data not fetched for the following days: 2020-05-01| data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-08 | 7 | 0 | 5 | 5 | 3.278896 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-02 | doctor-visits | state | day | 2020-05-09 | 7 | 0 | 5 | 5 | 3.225292 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-05 | doctor-visits | state | day | 2020-05-12 | 7 | 0 | 5 | 5 | 2.779908 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-06 | doctor-visits | state | day | 2020-05-13 | 7 | 0 | 5 | 5 | 2.557698 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-07 | doctor-visits | state | day | 2020-05-14 | 7 | 0 | 5 | 5 | 2.191677 | NA | NA |
Note that though this query requested all values between 2020-05-01and 2020-05-07, May 3rd and May 4th werenot included in theresults set. This is because the query will only include a result forMay 3rd if a value were issued on May 10th (a 7-day lag), but in factthe value was not updated on that day:
covidcast_signal(data_source="doctor-visits", signal="smoothed_adj_cli", start_day="2020-05-03", end_day="2020-05-03", geo_type="state", geo_values="pa", issues=c("2020-05-09","2020-05-15"))%>%knitr::kable()| data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-09 | 6 | 0 | 5 | 5 | 2.788618 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-12 | 9 | 0 | 5 | 5 | 3.015368 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-13 | 10 | 0 | 5 | 5 | 3.039310 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-14 | 11 | 0 | 5 | 5 | 3.021245 | NA | NA |
| doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-15 | 12 | 0 | 5 | 5 | 3.048725 | NA | NA |