Movatterモバイル変換

Objectives

This vignette provides a guide for contributors on how to effectivelyuse the KOSIS (Korean Statistical Information Service) interface fordata retrieval for addition to the package’s bundlecensuskordata.frame. In vignette 04, weintroduced how KOSIS API is used to retrieve the data of interest usingKOSIS’ OpenAPI URL. In many cases, it would be handy for users todownload data directly from KOSIS web pages. By the end of this guide,contributors will be able to:

Navigate the KOSIS interface to locate relevant datasets.
Familiarize themselves with various data download options availableon KOSIS platform.
Extract and format data for inclusion in the package.

Navigating KOSIS Interface

To begin, visit the KOSIS website atKOSIS. Use the search bar or browsethrough categories to find datasets relevant to your area of interest.Sincetidycensuskr offers data atSi (city),Gun (county), andGu (borough) levels, you could type“Si, Gun, Gu” in the search bar to query the list of datasets available.As of November 20, 2025, there are 11 datasets available with thatkeyword. If you press one of the datasets in the list, you will bedirected to the new window with the dataset navigation tool. Below showsthe screen capture of the KOSIS interface for the dataset “Deaths, Deathrates, Age-standardized death rates by cause(50 item) and sex: Si, Gun,and Gu”.

Setting Download Options

The default view will show a selected set of variables orSi-Gun-Gu regions. You can customize theselection by clicking on the “Setting” button on the rightmost side ofthe toolbar. It will prompt a sidebar where you can select the variablesof interest, years, and regions. To select allSi-Gun-Gu regions, click on the “Level 2Selection” button under the “Region” tab. It will activate allcheckboxes forSi-Gun-Gu regions. Note thatthe “Level 1 Selection” button only selectsSi (city/province)level regions, which is default for many datasets. Another to note isthat some single-district cities like Sejong are sometimes not listedunder “Level 2 Selection.” In this case, you need to manually check thebox for such regions.

Notes on Size Restriction

Choice of an extended number of combinations results in too largequeries to handle for KOSIS servers, which is prohibited by KOSISsettings. The default setting is 20,000 cells in one query instance. Youmight encounter an error message like the screen capture below.

To avoid this issue, try to limit the number of selected years orvariables. For example, if you are interested in only the most recentyear, deselect all other years except for the latest one. Similarly, ifyou are only interested in a subset of variables, deselect the rest.This will lead to many separate files to download, requiring furtherpostprocessing steps to combine these files into one for cleaning.

Downloading Data

Once you have set your desired options, click the “Download” buttonat the top right of the sidebar. You will see a popup in the center ofthe screen with download format options. Most of the datasets supportExcel Worksheet (.xls) and Comma-separated values (.csv) formats, forsome smaller datasets, additional formats like SAS or modern ExcelWorksheet (.xlsx) are also available.

It is very important to select “Including code” checkbox at themiddle of the popup. This option ensures that the downloaded dataincludes the necessary statistical codes for regions and variables,which are essential for proper data merging and analysis. For metadatainformation, you can download a text metadata file by clicking “Downloadmetadata (TXT)” button.

Oft-used datasets are pre-generated and stored on KOSIS servers forquick access. In this case, you will see an additional section in thepopup named “Statistical Table File Service,” under which a “Shortcut”button is available. Another popup window will appear, providing a listof direct download links for pre-generated files. These files aretypically provided by year with auxiliary variables for records.

Post-processing Downloaded Data

After downloading the data files, you may need to perform somepost-processing steps to clean and format the data for inclusion in thepackage. This may involve:

Reading the data into R using appropriate functions (e.g.,read.csv() for CSV files orreadxl::read_excel() for Excel files).
Renaming columns to match the naming conventions used in thepackage.

Standard column names includeadm1,adm1_code,adm2,adm2_code,year,type,class1,class2,value, andunit.

Column name	Description
adm1	Si-Do (province) level administrative unit name
adm1_code	Si-Do (province) level administrative unit code
adm2	Si-Gun-Gu (district) level administrative unit name
adm2_code	Si-Gun-Gu (district) level administrative unit code
year	Year of the dataset
type	Data type (e.g., population, economy)
class1	First classification level
class2	Second classification level
value	Measured value
unit	Unit of measurement

Converting data types as necessary (e.g., ensuring numeric columnsare of typenumeric).
Merging multiple files if the data was downloaded in parts due tosize restrictions.
Validating the data to ensure accuracy and completeness.
Appending the cleaned data tocensuskor and registerthe dataset in the bundled dataset (i.e.,usethis::use_data(censuskor, overwrite = TRUE)).

Assigning Proper`adm2_code`

KOSIS cleaning requires special attention to ensure that theadm2_code values are correctly assigned. Theadm2_code is a unique identifier for eachSi-Gun-Gu (district) level administrative unit in South Korea.It is crucial for linking census data to spatial boundary files. Weprovide a reference table foradm2_code values in thepackage, namely inextdata/lookup_district_code.csv in thepackage installation directory orinst/extdata/lookup_district_code.csv if you cloned theGitHub repository. The lookup table contains the following columns:

Column name	Description
sido_kr	Province name in Korean
sigungu_kr	District name in Korean
sigungu_1_kr	Alternative district name in Korean
sigungu_2_kr	Alternative district name in Korean
sido_en	Province name in English
sigun_en	District name in English
sigungu_1_en	Alternative district name in English
sigungu_2_en	Alternative district name in English
sdsgg_en	Combined province and district name in English
base_year	Base year for the code
tax_exclude	Indicator for tax exclusion
adm2_code	OfficialSi-Gun-Gu (district) level administrative unitcode
adm2_code_new	NewSi-Gun-Gu (district) level administrative unitcode
sgg_population	District code for population data
sgg_housing	District code for housing data
sgg_tax_global	District code for global tax data
sgg_tax_income	District code for income tax data
sigungu_doj	District code for Ministry of Justice data (i.e., maritalmigrants)
sigungu_dcee	District code for Ministry of Climate, Energy, and Environment(i.e., wastewater data)

To note,sigungu_kr,sigungu_1_kr, andsigungu_2_kr columns provide many versions of districtnames in Korean with or without the name of basic local governments(기초지방자치단체, upper unit of each district):

sigungu_kr: Standard district namewith basic local governments for _non-_autonomousdistricts (e.g., “Ilsandong-gu, Goyang-si” (“고양시 일산동구”))
sigungu_1_kr: Name of basic local governments (e.g.,“고양시” in all of “덕양구”, “일산동구”, and “일산서구”) filled in for_non-_autonomous districts
sigungu_2_kr: Standard district namewithout basic local governments for _non-_autonomousdistricts (e.g., “Ilsandong-gu” (“일산동구”))

This data can be expanded upon addition of new datasets that usedifferent district code systems. For contributors, the target code is toassign is usuallyadm2_code field values. Depending on theretrieved data file’s layout, contributors need to match the districtnames or other code systems to theadm2_code values in thelookup table. We reflected the district changes over years by includingthebase_year column in the lookup table. When joining thelookup table to the post-processed data, use the code or name columnsand thebase_year column to ensureaccurate matching.

It is extremely important to note that year matching should be donewith care. Thebase_year column indicates the year when thecorrespondingadm2_code was valid. When joining, ensurethat theyear column in your post-processed data isless than or equal to thebase_year in thelookup table. This ensures that you are using the correctadm2_code for the specific year of your dataset.

Example Code for Post-processing

Here are example code snippets demonstrating how to read a downloadedCSV file, clean it, and assign properadm2_code values:

Using Korean district name andbase_year to join:

Assume that the post-processed data includesadm2kr(district name in Korean) andyear columns.

library(dplyr)# fixed path to the lookup tablelookup_path<-system.file("extdata/lookup_district_code.csv",package ="tidycensuskr")lookup_district_code<-read.csv(lookup_path)# Read the postprocessed CSV filepratedata<-read.csv("path/to/downloaded_file.csv")joinby<- dplyr::join_by(  adm2kr== sigungu_2_kr,  year<= base_year)# join with lookup table to assign adm2_codecleaned_data<- pratedata|>  dplyr::left_join(    lookup_district_code,by = joinby  )

Using alternative district code (e.g.,sgg_population)andbase_year to join:

Let’s say the post-processed data includessggcd(alternative district code for Ministry of Justice data) andyear columns.

library(dplyr)# fixed path to the lookup tablelookup_path<-system.file("extdata/lookup_district_code.csv",package ="tidycensuskr")lookup_district_code<-read.csv(lookup_path)# Read the postprocessed CSV filedojdata<-read.csv("path/to/downloaded_file.csv")joinby<- dplyr::join_by(  sggcd== sigungu_doj,  year<= base_year)# join with lookup table to assign adm2_codecleaned_data<- dojdata|>  dplyr::left_join(    lookup_district_code,by = joinby  )

Movatterモバイル変換

Using KOSIS Interface for Contributors

November 24, 2025

Objectives

Navigating KOSIS Interface