Databricks is a unified data analytics platform built on Apache Spark. It provides a collaborative environment for data science, data engineering, and business analytics. Here are some fundamental concepts in Databricks:
- Lakehouse Platform
- Databricks Lakehouse Platform combines the best of data lakes and data warehouses.
- It offers the scalability and flexibility of data lakes with the reliability, governance, and performance of data warehouses.
- It enables a single platform for all data workloads, including ETL/ELT, streaming analytics, data science, and machine learning.
- Key components include cloud storage, Delta Lake, Unity Catalog for governance, and an AI engine.
- Workspace
- A Databricks workspace is a collaborative environment for teams to access Databricks assets.
- It provides a web-based interface for notebooks, libraries, data, and other resources.
- Organizations can have multiple workspaces for different teams or environments.
- Clusters
- A cluster is a set of computation resources (virtual machines) where notebooks and jobs are executed.
- All-Purpose Clusters are interactive clusters that can be manually managed, shared by multiple users for collaborative analysis.
- Job Clusters are created by the Databricks job scheduler to run a specific job and are terminated upon completion.
- Pools are sets of idle instances that reduce cluster start and auto-scaling times.
- Databricks Units (DBUs)
- DBUs are the unit of processing capacity used for billing in Databricks.
- The cost is based on the DBU consumption per hour, which varies depending on the VM instance type used by the clusters.
- Databricks Runtime
- The Databricks Runtime is a set of core components that run on the clusters.
- It includes Apache Spark and additional components and optimizations for usability, performance, and security.
- Databricks Runtime for Machine Learning includes pre-built machine learning libraries like TensorFlow, Keras, PyTorch, and XGBoost.
- Delta Lake
- Delta Lake is an open-source storage layer that provides ACID transactions and scalable metadata handling to data lakes.
- It extends Parquet data files with a transaction log, ensuring data reliability and consistency.
- It supports features like time travel (data versioning), schema evolution, and unified batch and streaming data processing.
- Delta Lake is the default table format in Databricks.
- Databricks SQL
- Databricks SQL is a serverless data warehouse on the Databricks Lakehouse Platform.
- It allows running SQL queries and BI applications at scale with optimized price/performance.
- It offers a unified governance model through Unity Catalog and supports open formats and standard ANSI SQL.
- It provides a SQL editor and integrates with various BI tools.
- Unity Catalog
- Unity Catalog is a unified governance solution for data and AI assets across Databricks workspaces.
- It provides centralized metadata management, access control, auditing, data discovery, and data lineage.
- It allows for managing permissions on data and AI objects in a consistent way.
- Notebooks
- Databricks notebooks are a web-based interface for creating and running code (Python, SQL, Scala, R).
- They facilitate collaboration, allowing multiple users to work on the same notebook.
- Notebooks can include code, visualizations, and markdown for documentation.
- Jobs
- Databricks Jobs allow you to run notebooks or other tasks (like Spark JARs or Python scripts) in an automated and scheduled manner.
- Jobs can be configured with dependencies and can be monitored through the Databricks UI or API.
- Workflows
- The Workflows UI provides tools for orchestrating and scheduling data pipelines, including Jobs and Delta Live Tables (DLT) pipelines.
- Data Ingestion
- Databricks supports data ingestion from various sources, including cloud storage (AWS S3, Azure Blob Storage, Google Cloud Storage), relational databases, data lakes (Delta Lake, Parquet, Avro), and streaming platforms (Apache Kafka).
- Feature Store
- A centralized repository for data scientists to find and share features for machine learning models.
- It ensures consistent computation of feature values for both model training and inference.
- In Unity Catalog-enabled workspaces, any Delta table with a primary key can serve as a feature table.
These concepts provide a foundation for understanding and utilizing the capabilities of the Databricks platform for various data-related tasks.
Leave a Reply