Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Data extraction

From Wikipedia, the free encyclopedia
Process in data storage
This articlerelies largely or entirely on asingle source. Relevant discussion may be found on thetalk page. Please helpimprove this article byintroducing citations to additional sources.
Find sources: "Data extraction" – news ·newspapers ·books ·scholar ·JSTOR
(August 2023)

Data extraction is the act or process of retrievingdata out of (usuallyunstructured or poorly structured) data sources for furtherdata processing ordata storage (data migration). Theimport into the intermediate extracting system is thus usually followed bydata transformation and possibly the addition ofmetadata prior toexport to another stage in the dataworkflow.

Usually, the term data extraction is applied when (experimental) data is first imported into a computer from primary sources, likemeasuring orrecording devices. Today'selectronic devices will usually present anelectrical connector (e.g.USB) through which 'raw data' can bestreamed into apersonal computer.

Data sources

[edit]

Typical unstructured data sources includeweb pages,emails, documents,PDFs, social media, scanned text, mainframe reports, spool files, multimedia files, etc. Extracting data from these unstructured sources has grown into a considerable technical challenge, where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as "Web data extraction" or "Web scraping".

Imposing structure

[edit]

The act of adding structure to unstructured data takes a number of forms

  • Using textpattern matching such asregular expressions to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers;
  • Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc. using a standard set of commonly used headings (these would differ from language to language), e.g. Education might be found under Education/Qualification/Courses;
  • Usingtext analytics to attempt to understand the text and link it to other information

See also

[edit]
  • Data mining, discovery of patterns in large data sets using statistics, database knowledge or machine learning
  • Data retrieval, obtaining data from a database management system, often using a query with a set of criteria
  • Extract, transform, load (ETL), procedure for copying data from one or more sources, transforming the data at the source system, and copying into a destination system
  • Information extraction, automated extraction of structured information from unstructured or semi-structured machine-readable data,[1] for example using natural language processing to extract content from images, audio or documents

References

[edit]
  1. ^Hartley, Miranda."Using AI to Extract Unstructured Data From PDFs: Benefits & Considerations".Evolution AI. Retrieved20 November 2024.
Creating a data warehouse
Concepts
Variants
Elements
Fact
Dimension
Filling
Using a data warehouse
Concepts
Languages
Tools
Related
People
Products
Retrieved from "https://en.wikipedia.org/w/index.php?title=Data_extraction&oldid=1299307341"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp