| Title: | Processing Agro-Environmental Data |
| Version: | 0.2.0 |
| Description: | A set of tools for processing and analyzing data developed in the context of the "Who Has Eaten the Planet" (WHEP) project, funded by the European Research Council (ERC). For more details on multi-regional input–output model "Food and Agriculture Biomass Input–Output" (FABIO) see Bruckner et al. (2019) <doi:10.1021/acs.est.9b03554>. |
| License: | MIT + file LICENSE |
| Imports: | cli, dplyr, fs, FAOSTAT, httr, mipfp, nanoparquet, pins,purrr, readr, rlang, stringr, tidyr, withr, yaml, zoo |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| Suggests: | ggplot2, googlesheets4, here, knitr, pointblank, rmarkdown,testthat (≥ 3.0.0), tibble |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| URL: | https://eduaguilera.github.io/whep/,https://github.com/eduaguilera/whep |
| BugReports: | https://github.com/eduaguilera/whep/issues |
| Depends: | R (≥ 4.2.0) |
| LazyData: | true |
| NeedsCompilation: | no |
| Packaged: | 2025-10-15 13:57:15 UTC; catalin |
| Author: | Catalin Covaci |
| Maintainer: | Catalin Covaci <catalin.covaci@csic.es> |
| Repository: | CRAN |
| Date/Publication: | 2025-10-15 15:20:02 UTC |
whep: Processing Agro-Environmental Data
Description

A set of tools for processing and analyzing data developed in the context of the "Who Has Eaten the Planet" (WHEP) project, funded by the European Research Council (ERC). For more details on multi-regional input–output model "Food and Agriculture Biomass Input–Output" (FABIO) see Bruckner et al. (2019)doi:10.1021/acs.est.9b03554.
Author(s)
Maintainer: Catalin Covacicatalin.covaci@csic.es (ORCID)
Authors:
Eduardo Aguileraeduardo.aguilera@csic.es (ORCID) [copyright holder]
Other contributors:
João Serrajserra@agro.au.dk (ORCID) [contributor]
European Research Council [funder]
See Also
Useful links:
Report bugs athttps://github.com/eduaguilera/whep/issues
Get area codes from area names
Description
Add a new column to an existing tibble with the corresponding codefor each name. The codes are assumed to be from those defined bytheFABIO model.
Usage
add_area_code(table, name_column = "area_name", code_column = "area_code")Arguments
table | The table that will be modified with a new column. |
name_column | The name of the column in |
code_column | The name of the output column containing the codes. |
Value
A tibble with all the contents oftable and an extra columnnamedcode_column, which contains the codes. If there is no code match,anNA is included.
Examples
table <- tibble::tibble( area_name = c("Armenia", "Afghanistan", "Dummy Country", "Albania"))add_area_code(table)table |> dplyr::rename(my_area_name = area_name) |> add_area_code(name_column = "my_area_name")add_area_code(table, code_column = "my_custom_code")Get area names from area codes
Description
Add a new column to an existing tibble with the corresponding namefor each code. The codes are assumed to be from those defined bytheFABIO model, which them themselves come fromFAOSTAT internalcodes. Equivalences with ISO 3166-1 numeric can be found in theArea Codes CSV from the zip file that can be downloaded fromFAOSTAT. TODO: Think aboutthis, would be nice to use ISO3 codes but won't be enough for our periods.
Usage
add_area_name(table, code_column = "area_code", name_column = "area_name")Arguments
table | The table that will be modified with a new column. |
code_column | The name of the column in |
name_column | The name of the output column containing the names. |
Value
A tibble with all the contents oftable and an extra columnnamedname_column, which contains the names. If there is no name match,anNA is included.
Examples
table <- tibble::tibble(area_code = c(1, 2, 4444, 3))add_area_name(table)table |> dplyr::rename(my_area_code = area_code) |> add_area_name(code_column = "my_area_code")add_area_name(table, name_column = "my_custom_name")Get commodity balance sheet item codes from item names
Description
Add a new column to an existing tibble with the corresponding codefor each commodity balance sheet item name. The codes are assumed to befrom those defined by FAOSTAT.
Usage
add_item_cbs_code( table, name_column = "item_cbs_name", code_column = "item_cbs_code")Arguments
table | The table that will be modified with a new column. |
name_column | The name of the column in |
code_column | The name of the output column containing the codes. |
Value
A tibble with all the contents oftable and an extra columnnamedcode_column, which contains the codes. If there is no code match,anNA is included.
Examples
table <- tibble::tibble( item_cbs_name = c("Cottonseed", "Eggs", "Dummy Item"))add_item_cbs_code(table)table |> dplyr::rename(my_item_cbs_name = item_cbs_name) |> add_item_cbs_code(name_column = "my_item_cbs_name")add_item_cbs_code(table, code_column = "my_custom_code")Get commodity balance sheet item names from item codes
Description
Add a new column to an existing tibble with the corresponding namefor each commodity balance sheet item code. The codes are assumed to befrom those defined by FAOSTAT.
Usage
add_item_cbs_name( table, code_column = "item_cbs_code", name_column = "item_cbs_name")Arguments
table | The table that will be modified with a new column. |
code_column | The name of the column in |
name_column | The name of the output column containing the names. |
Value
A tibble with all the contents oftable and an extra columnnamedname_column, which contains the names. If there is no name match,anNA is included.
Examples
table <- tibble::tibble(item_cbs_code = c(2559, 2744, 9876))add_item_cbs_name(table)table |> dplyr::rename(my_item_cbs_code = item_cbs_code) |> add_item_cbs_name(code_column = "my_item_cbs_code")add_item_cbs_name(table, name_column = "my_custom_name")Get production item codes from item names
Description
Add a new column to an existing tibble with the corresponding codefor each production item name. The codes are assumed to be from thosedefined by FAOSTAT.
Usage
add_item_prod_code( table, name_column = "item_prod_name", code_column = "item_prod_code")Arguments
table | The table that will be modified with a new column. |
name_column | The name of the column in |
code_column | The name of the output column containing the codes. |
Value
A tibble with all the contents oftable and an extra columnnamedcode_column, which contains the codes. If there is no code match,anNA is included.
Examples
table <- tibble::tibble( item_prod_name = c("Rice", "Cabbages", "Dummy Item"))add_item_prod_code(table)table |> dplyr::rename(my_item_prod_name = item_prod_name) |> add_item_prod_code(name_column = "my_item_prod_name")add_item_prod_code(table, code_column = "my_custom_code")Get production item names from item codes
Description
Add a new column to an existing tibble with the corresponding namefor each production item code. The codes are assumed to be from thosedefined by FAOSTAT.
Usage
add_item_prod_name( table, code_column = "item_prod_code", name_column = "item_prod_name")Arguments
table | The table that will be modified with a new column. |
code_column | The name of the column in |
name_column | The name of the output column containing the names. |
Value
A tibble with all the contents oftable and an extra columnnamedname_column, which contains the names. If there is no name match,anNA is included.
Examples
table <- tibble::tibble(item_prod_code = c(27, 358, 12345))add_item_prod_name(table)table |> dplyr::rename(my_item_prod_code = item_prod_code) |> add_item_prod_name(code_column = "my_item_prod_code")add_item_prod_name(table, name_column = "my_custom_name")Supply and use tables
Description
Create a table with processes, their inputs (use) and theiroutputs (supply).
Usage
build_supply_use( cbs_version = NULL, feed_intake_version = NULL, primary_prod_version = NULL, primary_residues_version = NULL, processing_coefs_version = NULL)Arguments
cbs_version | File version passed to |
feed_intake_version | File version passed to |
primary_prod_version | File version passed to |
primary_residues_version | File version passed to |
processing_coefs_version | File version passed to |
Value
A tibble with the supply and use data for processes.It contains the following columns:
year: The year in which the recorded event occurred.area_code: The code of the country where the data is from. For codedetails see e.g.add_area_name().proc_group: The type of process taking place. It can be one of:crop_production: Production of crops and their residues, e.g. riceproduction, coconut production, etc.husbandry: Animal husbandry, e.g. dairy cattle husbandry, non-dairycattle husbandry, layers chickens farming, etc.processing: Derived subproducts obtained from processing other items.The items used as inputs are those that have a non-zero processing use inthe commodity balance sheet. Seeget_wide_cbs()for more details.In each process there is a single input. In some processes like olive oilextraction or soyabean oil extraction this might make sense. Others likealcohol production need multiple inputs (e.g. multiple crops work), soin this data there would not be a process like alcohol production butrather avirtual process like 'Wheat and products processing', givingall its possible outputs. This is a constraint because of how the data wasobtained and might be improved in the future. Seeget_processing_coefs()for more details.
proc_cbs_code: The code of the main item in the process taking place.Together withproc_group, these two columns uniquely represent aprocess. The main item is predictable depending on the value ofproc_group:crop_production: The code is from the item for which seed usage(if any) is reported in the commodity balance sheet (seeget_wide_cbs()for more). For example, the rice code for a riceproduction process or the cottonseed code for the cotton production one.husbandry: The code of the farmed animal, e.g. bees for beekeeping,non-dairy cattle for non-dairy cattle husbandry, etc.processing: The code of the item that is used as input, i.e., the onethat is processed to get other derived products. This uniquely defines aprocess within the group because of the nature of the data that was used,which you can see inget_processing_coefs().
For code details see e.g.
add_item_cbs_name().item_cbs_code: The code of the item produced or used in the process.Note that this might be the same value asproc_cbs_code, e.g., in riceproduction process for the row defining the amount of rice produced orthe amount of rice seed as input, but it might also have a differentvalue, e.g. for the row defining the amount of straw residue from riceproduction. For code details see e.g.add_item_cbs_name().type: Can have two values:use: The given item is an input of the process.supply: The given item is an output of the process.
value: Quantity in tonnes.
Examples
# Note: These are smaller samples to show outputs, not the real data.# For all data, call the function with default versions (i.e. no arguments).build_supply_use( cbs_version = "example", feed_intake_version = "example", primary_prod_version = "example", primary_residues_version = "example", processing_coefs_version = "example")Trade data sources
Description
Create a new dataframe where each row has a year range into one where eachrow is a single year, effectively 'expanding' the whole year range.
Usage
expand_trade_sources(trade_sources)Arguments
trade_sources | A tibble dataframe where each row contains theyear range. |
Value
A tibble dataframe where each row corresponds to a single year fora given source.
Examples
trade_sources <- tibble::tibble( Name = c("a", "b", "c"), Trade = c("t1", "t2", "t3"), Info_Format = c("year", "partial_series", "year"), Timeline_Start = c(1, 1, 2), Timeline_End = c(3, 4, 5), Timeline_Freq = c(1, 1, 2), `Imp/Exp` = "Imp", SACO_link = NA,)expand_trade_sources(trade_sources)Bilateral trade data
Description
Reports trade between pairs of countries in given years.
Usage
get_bilateral_trade(trade_version = NULL, cbs_version = NULL)Arguments
trade_version | File version used for bilateral trade input.Seewhep_inputs for version details. |
cbs_version | File version passed to |
Value
A tibble with the reported trade between countries. For efficientmemory usage, the tibble is not exactly in tidy format.It contains the following columns:
year: The year in which the recorded event occurred.item_cbs_code: FAOSTAT internal code for the item that is being traded.For code details see e.g.add_item_cbs_name().bilateral_trade: Square matrix ofNxNdimensions whereNis thetotal number of countries being considered. The matrix row and columnnames are exactly equal and they represent country codes.Row name: The code of the country where the data is from. For codedetails see e.g.
add_area_name().Column name: FAOSTAT internal code for the country that is importing theitem. See row name explanation above.
If
mis the matrix, the value atm["A", "B"]is the trade in tonnesfrom country"A"to country"B", for the corresponding year and item.The matrix can be consideredbalanced. This means:The sum of all values from row
"A", where"A"is any country,should match the total exports from country"A"reported in thecommodity balance sheet (which is considered more accurate for totals).The sum of all values from column
"A", where"A"is any country,should match the total imports into country"A"reported in thecommodity balance sheet (which is considered more accurate for totals).
The sums may not be exactly the expected values because of precisionissues and/or the iterative proportional fitting algorithm not convergingfast enough, but should be relatively very close to the desired totals.
The step by step approach to obtain this data tries to follow the FABIOmodel and is explained below. All the steps are performed separately foreach group of year and item.
From the FAOSTAT reported bilateral trade, there are sometimes two valuesfor one trade flow: the exported amount claimed by the reporter countryand the import amount claimed by the partner country. Here, the exportdata was preferred, i.e., if country
"A"says it exportedXtonnes tocountry"B"but country"B"claims they gotYtonnes from country"A", we trust the export dataX. This choice is only needed if thereexists a reported amount from both sides. Otherwise, the single existingreport is chosen.Complete the country data, that is, add any missing combinations ofcountry trade with NAs, which will be estimated later. In the matrixform, this doesn't increase the memory usage since we had to build amatrix anyway (for the balancing algorithm), and theempty parts alsotake up memory. This is also done for total imports/exports from thecommodity balance sheet, but these are directly filled with 0s instead.
The total imports and exports from the commodity balance sheet arebalanced by downscaling the largest of the two to match the lowest.This is done in the following way:
If
total_imports > total_exports: Setimportastotal_exports * import / total_import.If
total_exports > total_exports: Setexportastotal_exports * export / total_export.
The missing data in the matrix must be estimated. It's done like this:
For each pair of exporter
iand importerj, we estimate a bilateraltradem[i, j]using the export shares ofiand import shares ofjfrom the commodity balance sheet:est_1 <- exports[i] * imports[j] / sum(imports), i.e., totalexports of countryispread among other countries' import shares.est_2 <- imports[j] * exports[i] / sum(exports), i.e. totalimports of countryjspread among other countries' export shares.est <- (est_1 + est_2) / 2, i.e., the mean of both estimates.
In the above computations, exports and imports are the original valuesbefore they were balanced.
The estimates for data that already existed (i.e. non-NA) are discarded.For the ones left, for each row (i.e. exporter country), we get thedifference between its balanced total export and the sum of originalnon-estimated data. The result is the
gapwe can actually fill withestimates, so as to not get past the reported total export. If the sumof non-discarded estimates is larger, it must be downscaled and spreadby computinggap * non_discarded_estimate / sum(non_discarded_estimates).The estimates are divided by atrust factor, in the sense that wedon't rely on the whole value, thinking that a non-present value mightactually be because that specific trade was 0, so we don't overestimatetoo much. The chosen factor is 10%, so only 10% of the estimate's valueis actually used to fill the NA from the original bilateral tradematrix.
The matrix is balanced, as mentioned before, using theiterative proportional fitting algorithm. The target sums for rows and columns are respectively the balancedexports and imports computed from the commodity balance sheet. Thealgorithm is performed directly using themipfp R package.
Examples
# Note: These are smaller samples to show outputs, not the real data.# For all data, call the function with default versions (i.e. no arguments).get_bilateral_trade( trade_version = "example", cbs_version = "example")Scrapes activity_data from FAOSTAT and slightly post-processes it
Description
Important: Dynamically allows for the introduction of subsets as"...".
Note: overhead by individually scraping FAOSTAT code QCL for crop data;it's fine.
Usage
get_faostat_data(activity_data, ...)Arguments
activity_data | activity data required from FAOSTAT; needsto be one of |
... | can be whichever column name from |
Value
data.frame of FAOSTAT foractivity_data; default is forall years and countries.
Examples
get_faostat_data("livestock", year = 2010, area = "Portugal")Livestock feed intake
Description
Get amount of items used for feeding livestock.
Usage
get_feed_intake(version = NULL)Arguments
version | File version to use as input. Seewhep_inputs for details. |
Value
A tibble with the feed intake data.It contains the following columns:
year: The year in which the recorded event occurred.area_code: The code of the country where the data is from. For codedetails see e.g.add_area_name().live_anim_code: Commodity balance sheet code for the type of livestockthat is fed. For code details see e.g.add_item_cbs_name().item_cbs_code: The code of the item that is used for feeding the animal.For code details see e.g.add_item_cbs_name().feed_type: The type of item that is being fed. It can be one of:animals: Livestock product, e.g.Bovine Meat,Butter, Ghee, etc.crops: Crop product, e.g.Vegetables, Other,Oats, etc.residues: Crop residue, e.g.Straw,Fodder legumes, etc.grass: Grass, e.g.Grassland,Temporary grassland, etc.scavenging: Other residues. SingleScavengingitem.
supply: The computed amount in tonnes of this item that should be fed tothis animal, when sharing the total itemfeeduse from the CommodityBalance Sheet among all livestock.intake: The actual amount in tonnes that the animal needs, which can beless than the theoretical used amount fromsupply.intake_dry_matter: The amount specified byintakebut only consideringdry matter, so it should be less thanintake.loss: The amount that is not used for feed. This issupply - intake.loss_share: The percent that is lost. This isloss / supply.
Examples
# Note: These are smaller samples to show outputs, not the real data.# For all data, call the function with default version (i.e. no arguments).get_feed_intake(version = "example")Primary items production
Description
Get amount of crops, livestock and livestock products.
Usage
get_primary_production(version = NULL)Arguments
version | File version to use as input. Seewhep_inputs for details. |
Value
A tibble with the item production data.It contains the following columns:
year: The year in which the recorded event occurred.area_code: The code of the country where the data is from. For codedetails see e.g.add_area_name().item_prod_code: FAOSTAT internal code for each produced item.item_cbs_code: FAOSTAT internal code for each commodity balance sheetitem. The commodity balance sheet contains an aggregated version ofproduction items. This field is the code for the correspondingaggregated item.live_anim_code: Commodity balance sheet code for the type of livestockthat produces the livestock product. It can be:NA: The entry is not a livestock product.Non-
NA: The code for the livestock type. The name can also beretrieved by usingadd_item_cbs_name().
unit: Measurement unit for the data. Here, keep in mind three groups ofitems: crops (e.g.Apples and products,Beans...), livestock (e.g.Cattle, dairy,Goats...) and livestock products (e.g.Poultry Meat,Offals, Edible...). Then the unit can be one of:tonnes: Available for crops and livestock products.ha: Hectares, available for crops.t_ha: Tonnes per hectare, available for crops.heads: Number of animals, available for livestock.LU: Standard Livestock Unit measure, available for livestock.t_head: tonnes per head, available for livestock products.t_LU: tonnes per Livestock Unit, available for livestock products.
value: The amount of item produced, measured inunit.
Examples
# Note: These are smaller samples to show outputs, not the real data.# For all data, call the function with default version (i.e. no arguments).get_primary_production(version = "example")Crop residue items
Description
Get type and amount of residue produced for each crop production item.
Usage
get_primary_residues(version = NULL)Arguments
version | File version to use as input. Seewhep_inputs for details. |
Value
A tibble with the crop residue data.It contains the following columns:
year: The year in which the recorded event occurred.area_code: The code of the country where the data is from. For codedetails see e.g.add_area_name().item_cbs_code_crop: FAOSTAT internal code for each commodity balancesheet item. This is the crop that is generating the residue.item_cbs_code_residue: FAOSTAT internal code for each commodity balancesheet item. This is the obtained residue. In the commodity balance sheet,this can be three different items right now:2105:Straw2106:Other crop residues2107:Firewood
These are actually not FAOSTAT defined items, but custom defined by us.When necessary, FAOSTAT codes are extended for our needs.
value: The amount of residue produced, measured in tonnes.
Examples
# Note: These are smaller samples to show outputs, not the real data.# For all data, call the function with default version (i.e. no arguments).get_primary_residues(version = "example")Processed products share factors
Description
Reports quantities of commodity balance sheet items used forprocessingand quantities of their corresponding processed output items.
Usage
get_processing_coefs(version = NULL)Arguments
version | File version to use as input. Seewhep_inputs for details. |
Value
A tibble with the quantities for each processed product.It contains the following columns:
year: The year in which the recorded event occurred.area_code: The code of the country where the data is from. For codedetails see e.g.add_area_name().item_cbs_code_to_process: FAOSTAT internal code for each one of theitems that are being processed and will give other subproduct items.For code details see e.g.add_item_cbs_name().value_to_process: tonnes of this item that are being processed. Itmatches the amount found in theprocessingcolumn from the dataobtained byget_wide_cbs().item_cbs_code_processed: FAOSTAT internal code for each one of thesubproduct items that are obtained when processing. For code detailssee e.g.add_item_cbs_name().initial_conversion_factor: estimate for the number of tonnes ofitem_cbs_code_processedobtained for each tonne ofitem_cbs_code_to_process. It will be used to compute thefinal_conversion_factor, which leaves everything balanced.TODO: explain how it's computed.initial_value_processed: first estimate for the number of tonnes ofitem_cbs_code_processedobtained fromitem_cbs_code_to_process. Itis computed asvalue_to_process * initial_conversion_factor.conversion_factor_scaling: computed scaling needed to adaptinitial_conversion_factorso as to get a final balanced total ofsubproduct quantities. TODO: explain how it's computed.final_conversion_factor: final used estimate for the number of tonnes ofitem_cbs_code_processedobtained for each tonne ofitem_cbs_code_to_process. It is computed asinitial_conversion_factor * conversion_factor_scaling.final_value_processed: final estimate for the number of tonnes ofitem_cbs_code_processedobtained fromitem_cbs_code_to_process. Itis computed asinitial_value_processed * final_conversion_factor.
For the final data obtained, the quantitiesfinal_value_processed arebalanced in the following sense: the total sum offinal_value_processedfor each unique tuple of(year, area_code, item_cbs_code_processed)should be exactly the quantity reported for that year, country anditem_cbs_code_processed item in theproduction column obtained fromget_wide_cbs(). This is because they are not primary products, so theamount from 'production' is actually the amount of subproduct obtained.TODO: Fix few data where this doesn't hold.
Examples
# Note: These are smaller samples to show outputs, not the real data.# For all data, call the function with default version (i.e. no arguments).get_processing_coefs(version = "example")Commodity balance sheet data
Description
States supply and use parts for each commodity balance sheet (CBS) item.
Usage
get_wide_cbs(version = NULL)Arguments
version | File version to use as input. Seewhep_inputs for details. |
Value
A tibble with the commodity balance sheet data in wide format.It contains the following columns:
year: The year in which the recorded event occurred.area_code: The code of the country where the data is from. For codedetails see e.g.add_area_name().item_cbs_code: FAOSTAT internal code for each item. For code detailssee e.g.add_item_cbs_name().
The other columns are quantities (measured in tonnes), where total supplyand total use should be balanced.
For supply:
production: Produced locally.import: Obtained from importing from other countries.stock_retrieval: Available as net stock from previous years. For ease,only one stock column is included here as supply. If the value ispositive, there is a stock quantity available as supply. Otherwise, itmeans a larger quantity was stored for later years and cannot be used assupply, having to deduce it from total supply. Since in this case it isnegative, the total supply is still computed as the sum of all of these.
For use:
food: Food for humans.feed: Food for animals.export: Released as export for other countries.seed: Intended for new production.processing: The product will be used to obtain other subproducts.other_uses: Any other use not included in the above ones.
There is an additional columndomestic_supply which is computed as thetotal use excludingexport.
Examples
# Note: These are smaller samples to show outputs, not the real data.# For all data, call the function with default version (i.e. no arguments).get_wide_cbs(version = "example")Commodity balance sheet items
Description
Defines name/code correspondences for commodity balance sheet (CBS) items.
Usage
items_cbsFormat
A tibble where each row corresponds to one CBS item.It contains the following columns:
item_cbs_code: A numeric code used to refer to the CBS item.item_cbs_name: A natural language name for the item.item_type: An ad-hoc grouping of items. This is a work in progressevolving depending on our needs, so for now it only has two possiblevalues:livestock: The CBS item represents a live animal.other: Not any of the previous groups.
Source
Inspired byFAOSTAT data.
Primary production items
Description
Defines name/code correspondences for production items.
Usage
items_prodFormat
A tibble where each row corresponds to one production item.It contains the following columns:
item_prod_code: A numeric code used to refer to the item.item_prod_name: A natural language name for the item.item_type: An ad-hoc grouping of items. This is a work in progressevolving depending on our needs, so for now it only has two possiblevalues:crop_product: The CBS item represents a crop product.other: Not any of the previous groups.
Source
Inspired byFAOSTAT data.
Fill gaps by linear interpolation, or carrying forward or backward.
Description
Fills gaps (NA values) in a time-dependent variable bylinear interpolation between two points, or carrying forward or backwardsthe last or initial values, respectively. It also creates a new variableindicating the source of the filled values.
Usage
linear_fill( df, var, time_index, interpolate = TRUE, fill_forward = TRUE, fill_backward = TRUE, .by = NULL)Arguments
df | A tibble data frame containing one observation per row. |
var | The variable of df containing gaps to be filled. |
time_index | The time index variable (usually year). |
interpolate | Logical. If |
fill_forward | Logical. If |
fill_backward | Logical. If |
.by | A character vector with the grouping variables (optional). |
Value
A tibble data frame (ungrouped) where gaps in var have been filled,and a new "source" variable has been created indicating if the value isoriginal or, in case it has been estimated, the gapfilling method that hasbeen used.
Examples
sample_tibble <- tibble::tibble( category = c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"), year = c( "2015", "2016", "2017", "2018", "2019", "2020", "2015", "2016", "2017", "2018", "2019", "2020" ), value = c(NA, 3, NA, NA, 0, NA, 1, NA, NA, NA, 5, NA),)linear_fill(sample_tibble, value, year, .by = c("category"))linear_fill( sample_tibble, value, year, interpolate = FALSE, .by = c("category"),)Polities
Description
Defines name/code correspondences for polities (political entities).
Usage
politiesFormat
A tibble where each row corresponds to one polity.It contains the following columns:TODO: On polities Pull Request, coming soon
Fill gaps using a proxy variable
Description
Fills gaps in a variable based on changes in a proxy variable, using ratiosbetween the filled variable and the proxy variable, and labels outputaccordingly.
Usage
proxy_fill(df, var, proxy_var, time_index, ...)Arguments
df | A tibble data frame containing one observation per row. |
var | The variable of df containing gaps to be filled. |
proxy_var | The variable to be used as proxy. |
time_index | The time index variable (usually year). |
... | Optionally, additional arguments that will be passed to |
Value
A tibble dataframe (ungrouped) where gaps in var have been filled,a new proxy_ratio variable has been created,and a new "source" variable has been created indicating if the value isoriginal or, in case it has been estimated, the gapfilling method that hasbeen used.
Examples
sample_tibble <- tibble::tibble( category = c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"), year = c( "2015", "2016", "2017", "2018", "2019", "2020", "2015", "2016", "2017", "2018", "2019", "2020" ), value = c(NA, 3, NA, NA, 0, NA, 1, NA, NA, NA, 5, NA), proxy_variable = c(1, 2, 2, 2, 2, 2, 1, 2, 3, 4, 5, 6))proxy_fill(sample_tibble, value, proxy_variable, year, .by = c("category"))Fill gaps summing the previous value of a variable to the value ofanother variable.
Description
Fills gaps in a variable with the sum of its previous value and the valueof another variable. When a gap has multiple observations, the values areaccumulated along the series. When there is a gap at the start of theseries, it can either remain unfilled or assume an invisible 0 value beforethe first observation and start filling with cumulative sum.
Usage
sum_fill(df, var, change_var, start_with_zero = TRUE, .by = NULL)Arguments
df | A tibble data frame containing one observation per row. |
var | The variable of df containing gaps to be filled. |
change_var | The variable whose values will be used to fill the gaps. |
start_with_zero | Logical. If TRUE, assumes an invisible 0 value beforethe first observation and fills with cumulative sum starting from the firstchange_var value. If FALSE (default), starting NA values remain unfilled. |
.by | A character vector with the grouping variables (optional). |
Value
A tibble dataframe (ungrouped) where gaps in var have been filled,and a new "source" variable has been created indicating if the value isoriginal or, in case it has been estimated, the gapfilling method that hasbeen used.
Examples
sample_tibble <- tibble::tibble( category = c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"), year = c( "2015", "2016", "2017", "2018", "2019", "2020", "2015", "2016", "2017", "2018", "2019", "2020" ), value = c(NA, 3, NA, NA, 0, NA, 1, NA, NA, NA, 5, NA), change_variable = c(1, 2, 3, 4, 1, 1, 0, 0, 0, 0, 0, 1))sum_fill( sample_tibble, value, change_variable, start_with_zero = FALSE, .by = c("category"))sum_fill( sample_tibble, value, change_variable, start_with_zero = TRUE, .by = c("category"))External inputs
Description
The information needed for accessing external datasets used as inputsin our modeling.
Usage
whep_inputsFormat
A tibble where each row corresponds to one external input dataset.It contains the following columns:
alias: An internal name used to refer to this dataset, which is theexpected name when trying to get the dataset withwhep_read_file().board_url: The public static URL where the data is found, followingthe concept of aboard from thepinspackage, which is what weuse for storing these input datasets.version: The specific version of the dataset, as defined by thepinspackage. The version is a string similar to"20250714T123343Z-114b5".This version is the one used by default if noversionis specified whencallingwhep_read_file(). If you want to use a different one, you canfind the available versions of a file by usingwhep_list_file_versions().
Source
Created by the package authors.
Input file versions
Description
Lists all existing versions of an input file fromwhep_inputs.
Usage
whep_list_file_versions(file_alias)Arguments
file_alias | Internal name of the requested file. You can find thepossible values in the |
Value
A tibble where each row is a version. For details about its format,seepins::pin_versions().
Examples
whep_list_file_versions("read_example")Download, cache and read files
Description
Used to fetch input files that are needed for the package's functionsand that were built in external sources and are too large to includedirectly. This is a public function for transparency purposes, so thatusers can inspect the original inputs of this package that were notdirectly processed here.
If the requested file doesn't exist locally, it is downloaded from a publiclink and cached before reading it. This is all implemented using thepins package. It supports multiplefile formats and file versioning.
Usage
whep_read_file(file_alias, type = "parquet", version = NULL)Arguments
file_alias | Internal name of the requested file. You can find thepossible values in the |
type | The extension of the file that must be read. Possible values:
Saving each file in both formats is for transparency and accessibilitypurposes, e.g., having to share the data with non-programmers who caneasily import a CSV into a spreadsheet. You will most likely never haveto set this option manually unless for some reason a file could not besupplied in e.g. |
version | The version of the file that must be read. Possible values:
|
Value
A tibble with the dataset. Some information about each dataset canbe found in the code where it's used as input for further processing.
Examples
whep_read_file("read_example")whep_read_file("read_example", type = "parquet", version = "latest")whep_read_file( "read_example", type = "csv", version = "20250721T152646Z-ce61b")