Movatterモバイル変換

Data wrangling

From Wikipedia, the free encyclopedia

Restructuring data into a desired format

Data transformation
Concepts
Metadata Data element Data mapping Data migration Data transformation Model transformation Macro Preprocessor
Transformation languages
ATL AWK MOFM2T QVT XML languages
Techniques and transforms
Identity transform Data refinement
Applications
Data conversion Data migration Data integration Extract, transform, load (ETL) Web template system
Related
Data wrangling Transformation languages
v t e

Data wrangling, sometimes referred to asdata munging, is the process of transforming andmapping data from one "raw" data form into anotherformat with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

The process of data wrangling may include furthermunging,data visualization, data aggregation, training astatistical model, as well as many other potential uses. Data wrangling typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.^[1] It is closely aligned with theETL process.

Background

[edit]

The "wrangler" non-technical term is often said to derive from work done by theUnited States Library of Congress'sNational Digital Information Infrastructure and Preservation Program (NDIIPP) and their program partner theEmory University Libraries based MetaArchive Partnership. The term "mung" has roots inmunging as described in theJargon File.^[2] The term "data wrangler" was also suggested as the best analogy to describe someone working with data.^[3]

One of the first mentions of data wrangling in a scientific context was by Donald Cline during the NASA/NOAA Cold Lands Processes Experiment.^[4] Cline stated the data wranglers "coordinate the acquisition of the entire collection of the experiment data." Cline also specifies duties typically handled by astorage administrator for working with large amounts ofdata. This can occur in areas like majorresearch projects and the making offilms with a large amount of complexcomputer-generated imagery. In research, this involves bothdata transfer from research instrument to storage grid or storage facility as well as data manipulation for re-analysis via high-performance computing instruments or access via cyberinfrastructure-baseddigital libraries.

With the upcoming of artificial intelligence indata science it has become increasingly important for automation of data wrangling to have very strict checks and balances, which is why the munging process of data has not been automated bymachine learning. Data munging requires more than just an automated solution, it requires knowledge of what information should be removed and artificial intelligence is not to the point of understanding such things.^[5]

Connection to data mining

[edit]

Data wrangling is a superset ofdata mining and requires processes that some data mining uses, but not always. The process of data mining is to find patterns within large data sets, where data wrangling transforms data in order to deliver insights about that data. Even though data wrangling is a superset of data mining does not mean that data mining does not use it, there are many use cases for data wrangling in data mining. Data wrangling can benefit data mining by removing data that does not benefit the overall set, or is not formatted properly, which will yield better results for the overall data mining process.

An example of data mining that is closely related to data wrangling is ignoring data from a set that is not connected to the goal: say there is a data set related to the state of Texas and the goal is to get statistics on the residents of Houston, the data in the set related to the residents of Dallas is not useful to the overall set and can be removed before processing to improve the efficiency of the data mining process.

Benefits

[edit]

Well-designed data wrangling can deliver the following benefits in analytic workflows:

Analysis readiness and improved data quality. By transforming datasets into consistent, "tidy" structures, wrangling reduces ad-hoc manipulation and makes data easier to model and visualize.^[6]

Reproducibility and auditability. Capturing transformations as scripts or notebooks creates an explicit record of how results were produced; interactive systems such as Wrangler generate editable histories and auditable transformation scripts, reducing manual, one-off editing.^[7]^[8]

Efficiency and reuse. Mixed-initiative tools can infer candidate transforms and support reuse of saved transformation scripts when data are updated, improving productivity.^[7]

Integration and enrichment. Data wrangling tools help merge heterogeneous sources and enrich records via reconciliation or web services. For example, OpenRefine supports clustering, transformation, and reconciliation against external sources.^[9]

Note: Benefits depend on documenting steps and using version-controlled workflows; otherwise, wrangling can be time-consuming and error-prone.^[8]

Core ideas

[edit]

Turning messy data into useful statistics

The main steps in data wrangling are as follows:

Data discovery
This all-encompassing term describes how to understand your data. This is the first step to familiarize yourself with your data.
Structuring
The next step is to organize the data. Raw data is typically unorganized and much of it may not be useful for the end product. This step is important for easier computation and analysis in the later steps.
Cleaning
There are many different forms of cleaning data, for example one form of cleaning data is catching dates formatted in a different way and another form is removing outliers that will skew results and also formattingnull values. This step is important in assuring the overall quality of the data.
Enriching
At this step determine whether or not additional data would benefit the data set that could be easily added.
Validating
This step is similar to structuring and cleaning. Use repetitive sequences ofvalidation rules to assuredata consistency as well as quality and security. An example of a validation rule is confirming the accuracy of fields via cross checking data.
Publishing
Prepare the data set for use downstream, which could include use for users or software. Be sure to document any steps and logic during wrangling.

These steps are an iterative process that should yield a clean and usable data set that can then be used for analysis. This process is tedious but rewarding as it allows analysts to get the information they need out of a large set of data that would otherwise be unreadable.

Starting data
Name	Phone	Birth date	State
John, Smith	445-881-4478	August 12, 1989	Maine
Jennifer Tal	+1-189-456-4513	11/12/1965	Tx
Gates, Bill	(876)546-8165	June 15, 72	Kansas
Alan Fitch	5493156648	2-6-1985	Oh
Jacob Alan	156-4896	January 3	Alabama

Result
Name	Phone	Birth date	State
John Smith	445-881-4478	1989-08-12	Maine
Jennifer Tal	189-456-4513	1965-11-12	Texas
Bill Gates	876-546-8165	1972-06-15	Kansas
Alan Fitch	549-315-6648	1985-02-06	Ohio

The result of using the data wrangling process on this small data set shows a significantly easier data set to read. All names are now formatted the same way, {first name last name}, phone numbers are also formatted the same way {area code-XXX-XXXX}, dates are formatted numerically {YYYY-mm-dd}, and states are no longer abbreviated. The entry for Jacob Alan did not have fully formed data (the area code on the phone number is missing and the birth date had no year), so it was discarded from the data set. Now that the resulting data set is cleaned and readable, it is ready to be either deployed or evaluated.

Typical use

[edit]

The data transformations are typically applied to distinct entities (e.g. fields, rows, columns, data values, etc.) within a data set, and could include such actions as extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating, and filtering to create desired wrangling outputs that can be leveraged downstream.

The recipients could be individuals, such asdata architects ordata scientists who will investigate the data further, business users who will consume the data directly in reports, or systems that will further process the data and write it into targets such asdata warehouses,data lakes, or downstream applications.

Modus operandi

[edit]

Depending on the amount and format of the incoming data, data wrangling has traditionally been performed manually (e.g. via spreadsheets such as Excel), tools likeKNIME or via scripts in languages such asPython orSQL.R, a language often used in data mining and statistical data analysis, is now also sometimes used for data wrangling.^[10] Data wranglers typically have skills sets within: R or Python, SQL, PHP, Scala, and more languages typically used for analyzing data.

Visual data wrangling systems were developed to make data wrangling accessible for non-programmers, and simpler for programmers. Some of these also include embedded AIrecommenders andprogramming by example facilities to provide user assistance, andprogram synthesis techniques to autogenerate scalable dataflow code. Early prototypes of visual data wrangling tools includeOpenRefine and the Stanford/BerkeleyWrangler research system;^[11] the latter evolved intoTrifacta.

Other terms for these processes have included data franchising,^[12]data preparation, and data munging.

Example

[edit]

Given a set of data that contains information on medical patients your goal is to find correlation for a disease. Before you can start iterating through the data ensure that you have an understanding of the result, are you looking for patients who have the disease? Are there other diseases that can be the cause? Once an understanding of the outcome is achieved then the data wrangling process can begin.

Start by determining the structure of the outcome, what is important to understand the disease diagnosis.

Once a final structure is determined, clean the data by removing any data points that are not helpful or are malformed, this could include patients that have not been diagnosed with any disease.

After cleaning look at the data again, is there anything that can be added to the data set that is already known that would benefit it? An example could be most common diseases in the area, America and India are very different when it comes to most common diseases.

Now comes the validation step, determine validation rules for which data points need to be checked for validity, this could include date of birth or checking for specific diseases.

After the validation step the data should now be organized and prepared for either deployment or evaluation. This process can be beneficial for determining correlations for disease diagnosis as it will reduce the vast amount of data into something that can be easily analyzed for an accurate result.

References

[edit]

^"What Is Data Munging?".Archived from the original on 2013-08-18. Retrieved2022-01-21.
^"mung".Mung.Jargon File.Archived from the original on 2012-09-18. Retrieved2012-10-10.
^As coder is for code, X is for data Archived 2021-04-15 at theWayback Machine, Open Knowledge Foundation blog post
^Parsons, M. A.; Brodzik, M. J.; Rutter, N. J. (2004)."Data management for the Cold Land Processes Experiment: improving hydrological science".Hydrological Processes.18 (18):3637–3653.Bibcode:2004HyPr...18.3637P.doi:10.1002/hyp.5801.S2CID 129774847.
^"What Is Data Wrangling? What are the steps in data wrangling?".Express Analytics. 2020-04-22.Archived from the original on 2020-11-01. Retrieved2020-12-06.
^Wickham, Hadley (2014)."Tidy Data".Journal of Statistical Software.59 (10).doi:10.18637/jss.v059.i10.
^^a ^bKandel, Sean; Paepcke, Andreas; Hellerstein, Joseph M.; Heer, Jeffrey (2011)."Wrangler: Interactive Visual Specification of Data Transformation Scripts"(PDF).Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp. 3363–3372.doi:10.1145/1978942.1979444.
^^a ^bSandve, Geir Kjetil; Nekrutenko, Anton; Taylor, James; Hovig, Eivind (2013)."Ten Simple Rules for Reproducible Computational Research".PLOS Computational Biology.9 (10) e1003285.Bibcode:2013PLSCB...9E3285S.doi:10.1371/journal.pcbi.1003285.PMC 3812051.
^"OpenRefine documentation".OpenRefine. 29 December 2022. Retrieved2025-08-15.
^Wickham, Hadley; Grolemund, Garrett (2016). "Chapter 9: Data Wrangling Introduction".R for data science: import, tidy, transform, visualize, and model data (First ed.). Sebastopol, CA: O'Reilly.ISBN 978-1-4919-1039-9.Archived from the original on 2021-10-11. Retrieved2022-01-12.
^Kandel, Sean; Paepcke, Andreas (May 2011). "Wrangler: Interactive visual specification of data transformation scripts".Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp. 3363–3372.doi:10.1145/1978942.1979444.ISBN 978-1-4503-0228-9.S2CID 11133756.
^What is Data Franchising? (2003 and 2017IRI)Archived 2021-04-15 at theWayback Machine

External links

[edit]

"What is Data Wrangling? Benefits, tools, and skills?". My Influencer Journey. Retrieved2022-01-26.

v t e Data
Acquisition Augmentation Analysis Anonymization Archaeology Big Cleansing Collection Compression Corruption Curation Deduplication Degradation De-identification Ecosystem Editing Engineering Erasure ETL/ELT Extract Transform Load Ethics Exhaust Exploration Farming Format management Fusion Governance Cooperatives Infrastructure Integration Integrity Library Lineage Loss Management Meta Migration Mining Philanthropy Pre-processing Preservation Processing Protection (privacy) Publishing Open data Recovery Reduction Redundancy Re-identification Remanence Rescue Retention Quality Science Scraping Scrubbing Security Sharing Stewardship Storage Structure Synchronization Topological data analysis Type Validation Warehouse Wrangling/munging

Data