Movatterモバイル変換

[0]ホーム

Jump to content

Data engineering

Edit links

From Wikipedia, the free encyclopedia

Software engineering approach to designing and developing information systems

Not to be confused withInformation Engineering.

Engineering
This article is part ofa series on

Engineering branches Aerospace Agricultural Architectural Biomedical Chemical Civil Computer Data Design Electrical Electronics Energy Environmental Industrial Manufacturing Marine Materials Mechanical Mechatronics Mining Nuclear Petroleum Process Robotics Software Structural Systems
Engineering software lists Additive manufacturing Aerospace engineering Automotive engineering Bioinformatics Building information modeling Chemical engineering Chemical process simulators Civil engineering Computer-aided engineering Computer-aided manufacturing Computational chemistry Computational fluid dynamics Computational physics Data science Discrete event simulation Electronic design automation Electromagnetic simulation Finite element analysis Free electronics circuit Gene prediction Genetic engineering Hardware description language simulators Hydrology Mathematical Mechanical engineering Molecular design Molecular mechanics modeling Nanostructure modeling Nuclear engineering Nucleic acid simulation Numerical analysis Numerical libraries Open-source AI Open-source libraries Optimization Plasma physics Power engineering Programming tools Protein structure prediction RNA structure prediction Robotics simulation Scientific simulation Sequence alignment Structural alignment Structural engineering System dynamics Wind energy
Engineering glossaries Aerospace Civil Electrical, electronics Mechanical Structural
See also Engineering education Engineering ethics Engineering management History of engineering List of engineering awards List of engineering branches List of engineering journals and magazines List of engineering schools List of engineering societies Lists of engineers Outline of engineering
Engineering portal Engineering books on Wikibooks
v t e

Data engineering is asoftware engineering approach to the building ofdata systems, to enable the collection and usage ofdata. This data is usually used to enable subsequentanalysis anddata science, which often involvesmachine learning.^[1]^[2] Making the data usable usually involves substantialcomputing andstorage, as well asdata processing.

History

[edit]

Around the 1970s/1980s the terminformation engineering methodology (IEM) was created to describedatabase design and the use ofsoftware for data analysis and processing.^[3] These techniques were intended to be used bydatabase administrators (DBAs) and bysystems analysts based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the AustralianClive Finkelstein, who wrote several articles about it between 1976 and 1980, and also co-authored an influentialSavant Institute report on it with James Martin.^[4]^[5]^[6] Over the next few years, Finkelstein continued work in a more business-driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing-driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role in revamping IEM as well as helping to design the IEM software product (user data), which helped automate IEM.

In the early 2000s, the data and data tooling was generally held by theinformation technology (IT) teams in most companies.^[7] Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business.

In the early 2010s, with the rise of theinternet, the massive increase in data volumes, velocity, and variety led to the termbig data to describe the data itself, and data-driven tech companies likeFacebook andAirbnb started using the phrase data engineer.^[3]^[7] Due to the new scale of the data, major firms likeGoogle, Facebook,Amazon,Apple,Microsoft, andNetflix started to move away from traditionalETL and storage techniques. They started creatingdata engineering, a type ofsoftware engineering focused on data, and in particularinfrastructure,warehousing,data protection,cybersecurity,mining,modelling,processing, andmetadata management.^[3]^[7] This change in approach was particularly focused oncloud computing.^[7] Data started to be handled and used by many parts of the business, such assales andmarketing, and not just IT.^[7]

Tools

[edit]

Compute

[edit]

High-performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering isdataflow programming, in which the computation is represented as adirected graph (dataflow graph); nodes are the operations, and edges represent the flow of data.^[8] Popular implementations includeApache Spark, and thedeep learning specificTensorFlow.^[8]^[9]^[10] More recent implementations, such asDifferential/Timely Dataflow, have usedincremental computing for much more efficient data processing.^[8]^[11]^[12]

Storage

[edit]

Data is stored in a variety of ways, one of the key deciding factors is in how the data will be used.Data engineers optimize data storage and processing systems to reduce costs. They use data compression, partitioning, and archiving.

Databases

[edit]

If the data is structured and some form ofonline transaction processing is required, thendatabases are generally used.^[13] Originally mostlyrelational databases were used, with strongACID transaction correctness guarantees; most relational databases useSQL for their queries. However, with the growth of data in the 2010s,NoSQL databases have also become popular since theyhorizontally scaled more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing theobject-relational impedance mismatch.^[14] More recently,NewSQL databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.^[15]^[16]^[17]^[18]

Data warehouses

[edit]

Main article:Data warehouse

If the data is structured andonline analytical processing is required (but not online transaction processing), thendata warehouses are a main choice.^[19] They enable data analysis, mining, andartificial intelligence on a much larger scale than databases can allow,^[19] and indeed data often flow from databases into data warehouses.^[20]Business analysts, data engineers, and data scientists can access data warehouses using tools such as SQL orbusiness intelligence software.^[20]

Data lakes

[edit]

Adata lake is a centralized repository for storing, processing, and securing large volumes of data. A data lake can containstructured data fromrelational databases,semi-structured data,unstructured data, andbinary data. A data lake can be created on premises or in a cloud-based environment using the services frompublic cloud vendors such asAmazon,Microsoft, orGoogle.

Files

[edit]

If the data is less structured, then often they are just stored asfiles. There are several options:

File systems represent data hierarchically in nested folders.^[21]
Block storage splits data into regularly sized chunks;^[21] this often matches up with (virtual)hard drives orsolid state drives.
Object storage manages data usingmetadata;^[21] often each file is assigned a key such as aUUID.^[22]

Management

[edit]

The number and variety of different data processes and storage locations can become overwhelming for users. This inspired the usage of aworkflow management system (e.g.Airflow) to allow the data tasks to be specified, created, and monitored.^[23] The tasks are often specified as adirected acyclic graph (DAG).^[23]

Lifecycle

[edit]

Business planning

[edit]

Business objectives that executives set for what's to come are characterized in key business plans, with their more noteworthy definition in tactical business plans and implementation in operational business plans. Most businesses today recognize the fundamental need to grow a business plan that follows this strategy. It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations. This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of the business plan.

Systems design

[edit]

The design of data systems involves several components such as architecting data platforms, and designing data stores.^[24]^[25]

Data modeling

[edit]

Main article:Data modelling

Data modeling is the analysis and representation of data requirements for an organisation. It produces a data model—an abstract representation that organises business concepts and the relationships and constraints between them. The resulting artefacts guide communication between business and technical stakeholders and inform database design.^[26]^[27]

A common convention distinguishes three levels of models:^[26]

Conceptual model – a technology-independent view of the key business concepts and rules.
Logical model – a detailed representation in a chosen paradigm (most commonly the relational model) specifying entities, attributes, keys, and integrity constraints.^[27]
Physical model – an implementation-oriented design describing tables, indexes, partitioning, and other operational considerations.^[27]

Approaches include entity–relationship (ER) modeling for operational systems,^[28] dimensional modeling for analytics and data warehousing,^[29] and the use of UML class diagrams to express conceptual or logical models in general-purpose modeling tools.^[30]

Well-formed data models aim to improve data quality and interoperability by applying clear naming standards, normalisation, and integrity constraints.^[27]^[26]

Roles

[edit]

Data engineer

[edit]

A data engineer is a type of software engineer who createsbig data ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it intoinsights.^[31] They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages likeJava,Python,Scala, andRust.^[32]^[3] They will be more familiar with databases, architecture, cloud computing, andAgile software development.^[3]

Data scientist

[edit]

Main article:Data science

Data scientists are more focused on the analysis of the data, they will be more familiar withmathematics,algorithms,statistics, andmachine learning.^[3]^[33]

References

[edit]

^"What is Data Engineering? | A Quick Glance of Data Engineering".EDUCBA. January 5, 2020. RetrievedJuly 31, 2022.
^"Introduction to Data Engineering".Dremio. RetrievedJuly 31, 2022.
^^a ^b ^c ^d ^e ^fBlack, Nathan (January 15, 2020)."What is Data Engineering and Why Is It So Important?".QuantHub. RetrievedJuly 31, 2022.
^"Information engineering,"part 3,part 4,part 5,Part 6" by Clive Finkelstein. InComputerworld, In depths, appendix. May 25 – June 15, 1981.
^Christopher Allen, Simon Chatwin, Catherine Creary (2003).Introduction to Relational Databases and SQL Programming.
^Terry Halpin,Tony Morgan (2010).Information Modeling and Relational Databases. p. 343
^^a ^b ^c ^d ^eDodds, Eric."The History of the Data Engineering and the Megatrends".Rudderstack. RetrievedJuly 31, 2022.
^^a ^b ^cSchwarzkopf, Malte (March 7, 2020)."The Remarkable Utility of Dataflow Computing".ACM SIGOPS. RetrievedJuly 31, 2022.
^"sparkpaper"(PDF). RetrievedJuly 31, 2022.
^Abadi, Martin; Barham, Paul; Chen, Jianmin; Chen, Zhifeng; Davis, Andy; Dean, Jeffrey; Devin, Matthieu; Ghemawat, Sanjay; Irving, Geoffrey; Isard, Michael; Kudlur, Manjunath; Levenberg, Josh; Monga, Rajat; Moore, Sherry; Murray, Derek G.; Steiner, Benoit; Tucker, Paul; Vasudevan, Vijay; Warden, Pete; Wicke, Martin; Yu, Yuan; Zheng, Xiaoqiang (2016)."TensorFlow: A system for large-scale machine learning".12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). pp. 265–283. RetrievedJuly 31, 2022.
^McSherry, Frank; Murray, Derek; Isaacs, Rebecca; Isard, Michael (January 5, 2013)."Differential dataflow".Microsoft. RetrievedJuly 31, 2022.
^"Differential Dataflow". Timely Dataflow. July 30, 2022. RetrievedJuly 31, 2022.
^"Lecture Notes | Database Systems | Electrical Engineering and Computer Science | MIT OpenCourseWare".ocw.mit.edu. RetrievedJuly 31, 2022.
^Leavitt, Neal (February 2010). "Will NoSQL Databases Live Up to Their Promise?".Computer.43 (2):12–14.doi:10.1109/MC.2010.58.
^Aslett, Matthew (2011)."How Will The Database Incumbents Respond To NoSQL And NewSQL?"(PDF). 451 Group (published April 4, 2011). RetrievedFebruary 22, 2020.
^Pavlo, Andrew; Aslett, Matthew (2016)."What's Really New with NewSQL?"(PDF).SIGMOD Record. RetrievedFebruary 22, 2020.
^Stonebraker, Michael (June 16, 2011)."NewSQL: An Alternative to NoSQL and Old SQL for New OLTP Apps". Communications of the ACM Blog. RetrievedFebruary 22, 2020.
^Hoff, Todd (September 24, 2012)."Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In". RetrievedFebruary 22, 2020.
^^a ^b"What is a Data Warehouse?".www.ibm.com. RetrievedJuly 31, 2022.
^^a ^b"What is a Data Warehouse? | Key Concepts | Amazon Web Services".Amazon Web Services, Inc. RetrievedJuly 31, 2022.
^^a ^b ^c"File storage, block storage, or object storage?".www.redhat.com. RetrievedJuly 31, 2022.
^"Cloud Object Storage – Amazon S3 – Amazon Web Services".Amazon Web Services, Inc. RetrievedJuly 31, 2022.
^^a ^b"Home".Apache Airflow. RetrievedJuly 31, 2022.
^"Introduction to Data Engineering".Coursera. RetrievedJuly 31, 2022.
^Finkelstein, Clive.What are The Phases of Information Engineering.
^^a ^b ^cSimsion, Graeme; Witt, Graham (2015).Data Modeling Essentials (4th ed.). Morgan Kaufmann.ISBN 9780128002025.
^^a ^b ^c ^dDate, C. J. (2004).An Introduction to Database Systems (8th ed.). Addison-Wesley.ISBN 9780321197849.
^Chen, Peter P. (1976). "The Entity–Relationship Model—Toward a Unified View of Data".ACM Transactions on Database Systems.1 (1):9–36.doi:10.1145/320434.320440.
^Kimball, Ralph; Ross, Margy (2013).The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (3rd ed.). Wiley.ISBN 9781118530801.
^Unified Modeling Language (UML) Version 2.5.1 (Report). Object Management Group. 2017.
^Tamir, Mike; Miller, Steven; Gagliardi, Alessandro (December 11, 2015). The Data Engineer (Report).SSRN 2762013.
^"Data Engineer vs. Data Scientist".Springboard Blog. February 7, 2019. RetrievedMarch 14, 2021.
^"What is Data Science and Why it's Important". Edureka. January 5, 2017.

External links

[edit]

Wikimedia Commons has media related toInformation Engineering.

Engineering

Specialties
and
interdisciplinarity

Civil	Architectural Coastal Construction Earthquake Ecological Environmental Sanitary Geological Geotechnical Hydraulic Mining Municipal/urban Offshore River Structural Transportation Traffic Railway
Mechanical	Acoustic Aerospace Automotive Biomechanical Energy Manufacturing Marine Naval architecture Railway Sports Thermal Tribology
Electrical	Broadcast outline Control Electromechanics Electronics Microwaves Optical Power Radio-frequency Signal processing Telecommunications
Chemical	Biochemical/bioprocess Biological Bioresource Genetic Tissue Chemical reaction Electrochemical Food Molecular Paper Petroleum Process Reaction
Materials	Biomaterial Ceramics Corrosion Metallurgy Molecular Nanotechnology Polymers Semiconductors Surfaces
Computer	AI Computer Cybersecurity Data Networks Robotics Software

Engineering education

Movatterモバイル変換

History

Tools

Compute

Storage

Databases

Data warehouses

Data lakes

Files

Management

Lifecycle

Business planning

Systems design

Data modeling

Roles

Data engineer

Data scientist

See also

References

Further reading

External links