Data-Modeling-Process
Key Steps
RELEVANT FOR
DATA SCIENCE
- Identify the entities
- Identify a key property
- Draw a rough draft
- Identify the various data attributes (need be incorporated into)
- Map the attributes
- Finalize and validate the data model (refine it)
Data Engineering Process
Consist in work with multiple types of data to perform many operations using scripting or coding.
Types of Data
- Structured: table-based source systems - relational database or CSV-
- Semi-structured: JSON
- Unstructured: key-value pairs - no standard relational models - PDF, documents and images
Data Operations
- Data Integration: stablishing links between operational analytical services and data sources
- Data Transformation: transform operational data into suitable structure and format for analysis, in variation form ETL to ELT to apply big data processing.
- Data Consolidation. combining data from multiple data sources into a consistent structure, stores such as a data lake or data warehouse.
Tools:
- SQL
- Python, R, Java and others
Key Concepts:
------------------------------------------------------------------------------------------------
- Operational data: usually transactional data that is generated and stored by applications.
- Analytical data: is data that has been optimized for analysis and reporting, often in a data warehouse.
- Streaming data: perpetual sources of data that generate data values in real-time.
- Data pipelines: are used to orchestrate activities that transfer and transform data.
- Data lake is a storage repository that holds large amounts of data in native, raw formats.-Files-
- Data warehouses: is a centralized repository of integrated data from one or more disparate sources.
- Apache Spark: is a parallel processing framework that takes advantage of in-memory processing and a distributed file storage. It's a common open-source software (OSS) tool for big data scenarios.
------------------------------------------------------------------------------------------------
Data Engineer: is the primary role responsible for integrating, transforming, and consolidating data from various structured and unstructured data systems into structures that are suitable for building analytics solutions.
Data Management: process
ETL: extract, transform, and load process.
ELT: extract, load, and transform.
SQL: Structured Query Language
NoSQL -Not Only SQL database-