Data Engineering Process
RELEVANT FOR DATA SCIENCE
Data Engineering Process
The data engineering consist in work with multiple types of data to perform many operations using scripting or coding. The Data Engineer is the primary role responsible for integrating,
transforming, and consolidating data from various structured and
unstructured data systems into structures that are suitable for building
analytics solutions.
Types of Data
- Structured: table-based source systems - relational database or CSV-
- Semi-structured: JSON
- Unstructured: key-value pairs - no standard relational models - PDF, documents and images
Data Operations
- Data Integration: stablishing links between operational analytical services and data sources
- Data Transformation: transform operational data into suitable structure and format for analysis, in variation form ETL to ELT to apply big data processing.
- Data Consolidation. combining data from multiple data sources into a consistent structure, stores such as a data lake or data warehouse.
Tools:
- SQL
- Python, R, Java and others
Key Steps
- Identify the entities
- Identify a key property
- Draw a rough draft
- Identify the various data attributes (need be incorporated into)
- Map the attributes
- Finalize and validate the data model (refine it)
Key Concepts:
-------------------------------------------------------------------
Operational data: usually transactional data that is generated and stored by applications.Analytical data: is data that has been optimized for analysis and reporting, often in a data warehouse.
Streaming data: perpetual sources of data that generate data values in real-time.
Data pipelines: are used to orchestrate activities that transfer and transform data.
Data lake is a storage repository that holds large amounts of data in native, raw formats.-Files-
Data warehouses: is a centralized repository of integrated data from one or more disparate sources.
Apache Spark: is a parallel processing framework that takes advantage of in-memory processing and a distributed file storage. It's a common open-source software (OSS) tool for big data scenarios.
Data Management: process
ETL: extract, transform, and load process.
ELT: extract, load, and transform.
SQL: Structured Query Language
NoSQL -Not Only SQL database-
---------------------------------------------------------------------
