This articlerelies largely or entirely on asingle source. Relevant discussion may be found on thetalk page. Please helpimprove this article byintroducing citations to additional sources. Find sources: "Data extraction" – news ·newspapers ·books ·scholar ·JSTOR(August 2023) |
Data extraction is the act or process of retrievingdata out of (usuallyunstructured or poorly structured) data sources for furtherdata processing ordata storage (data migration). Theimport into the intermediate extracting system is thus usually followed bydata transformation and possibly the addition ofmetadata prior toexport to another stage in the dataworkflow.
Usually, the term data extraction is applied when (experimental) data is first imported into a computer from primary sources, likemeasuring orrecording devices. Today'selectronic devices will usually present anelectrical connector (e.g.USB) through which 'raw data' can bestreamed into apersonal computer.
Typical unstructured data sources includeweb pages,emails, documents,PDFs, social media, scanned text, mainframe reports, spool files, multimedia files, etc. Extracting data from these unstructured sources has grown into a considerable technical challenge, where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as "Web data extraction" or "Web scraping".
The act of adding structure to unstructured data takes a number of forms