
Databricks Interview Questions
β Overview π
Databricks is a cloud-based data and AI platform built on Apache Spark that provides a unified environment for data engineering, analytics, and machine learning.
It implements the Lakehouse architecture, combining the capabilities of data lakes and data warehouses in a single platform.
Databricks is a fully managed service that simplifies big data processing, collaboration, governance, and AI workloads at scale.
+------------------------+
| Databricks Workspace|
+-----------+------------+
|
v
+---------------------+
| Compute Layer |
| (Clusters / SQL WH)|
+-----------+----------+
|
-----------------------------
| |
v v
+-------------------+ +-------------------+
| Delta Lake | | Unity Catalog |
| (Storage & ACID) | | (Governance) |
+-------------------+ +-------------------+
π· Core Components π
1οΈβ£ Compute (Clusters & SQL Warehouses) π
The Compute layer provides scalable processing power for notebooks, pipelines, and SQL analytics.
- Interactive Clusters : Used for development and exploration.
- Job Clusters : Used for scheduled pipelines and batch processing.
- SQL Warehouses : Optimized compute for BI and SQL queries.
2οΈβ£ Delta Lake (Storage Layer) ποΈ
Delta Lake is an open storage layer that adds reliability and performance to data lakes.
- ACID Transactions : Ensures data consistency and reliability.
- Time Travel : Query historical versions of data.
- Schema Enforcement & Evolution : Handles schema changes safely.
- Efficient Storage : Built on Parquet with transaction logs.
3οΈβ£ Workspace & Notebooks π§βπ»
Databricks Workspace provides collaborative notebooks for multiple users.
- Supports Python, SQL, Scala, and R.
- Used by data engineers, analysts, and data scientists.
- Enables real-time collaboration and version control.
4οΈβ£ Unity Catalog π
Unity Catalog is the centralized governance and metadata management layer.
- Manages catalogs, schemas, and tables.
- Provides fine-grained access control.
- Tracks data lineage and auditing.
5οΈβ£ Jobs & Workflows βοΈ
Jobs and Workflows are used to orchestrate and schedule data pipelines.
- Supports multi-task pipelines.
- Handles retries, dependencies, and monitoring.
- Can integrate with Airflow and CI/CD tools.
6οΈβ£ Databricks SQL π
Databricks SQL provides a SQL analytics engine and dashboards.
- Used for BI and reporting.
- Integrates with Power BI, Tableau, and Looker.
- Optimized with Photon engine.
7οΈβ£ ML & AI Tools π€
- MLflow : Experiment tracking and model lifecycle management.
- Feature Store : Centralized feature management.
- Model Serving : Deploy models as APIs.
π Execution Flow
Typical Databricks Pipeline Flow:
- Data Ingestion : Load raw data into Delta Lake (Bronze layer).
- Data Processing : Transform data into Silver and Gold tables.
- Orchestration : Schedule pipelines using Jobs or Workflows.
- Analytics : Query using Databricks SQL or BI tools.
- ML Training : Train and deploy models using MLflow.
β What problems does Databricks solve compared to traditional data warehouses?
Traditional data warehouses are optimized for structured analytics but have limitations in handling big data, AI, and unstructured workloads. Databricks solves these limitations using the Lakehouse architecture.
π· Problems with Traditional Data Warehouses β οΈ
- Limited Data Types : Mainly support structured data, poor support for semi-structured and unstructured data.
- High Cost : Expensive storage and compute with vendor lock-in.
- Poor ML Support : Not designed for machine learning or AI workloads.
- Data Silos : Separate systems needed for data lake, warehouse, and ML platforms.
- Scalability Issues : Scaling storage and compute independently is difficult.
β How Databricks Solves These Problems π
- Unified Lakehouse : Combines data lake and warehouse in one platform.
- Open Data Formats : Uses Delta and Parquet instead of proprietary formats.
- Cost Efficient : Separation of compute and storage reduces cost.
- AI & ML Native : Built-in MLflow, Feature Store, and model serving.
- Scalable & Cloud Native : Easily scales to petabytes on cloud storage.
β What is Lakehouse Architecture?
Lakehouse architecture is a modern data architecture that unifies the capabilities of data lakes and data warehouses into a single platform, enabling analytics, streaming, and machine learning on the same data.
π· Why Lakehouse Was Needed β οΈ
- Data Lakes : Cheap storage but no transactions, governance, or fast BI queries.
- Data Warehouses : Fast analytics but expensive and limited to structured data.
- Data Silos : Separate systems for raw data, analytics, and ML pipelines.
β Key Principles of Lakehouse Architecture ποΈ
- Unified Storage : All data stored in open formats like Delta or Parquet on cloud storage.
- ACID Transactions : Ensures data consistency, reliability, and concurrent reads/writes.
- Schema Enforcement & Evolution : Validates schemas and supports schema changes safely.
- Separation of Compute and Storage : Compute scales independently from storage, reducing cost.
- Data Governance : Centralized metadata, lineage, and access control (e.g., Unity Catalog).
- BI + ML on Same Data : Supports SQL analytics, streaming, and machine learning without data movement.
π· Core Components of Lakehouse Architecture π
- Open Storage Layer : Cloud object storage (S3, ADLS, GCS) with Delta/Parquet files.
- Compute Engines : Distributed compute engines like Databricks clusters or SQL warehouses.
- Metadata & Governance : Catalogs, schemas, permissions, and lineage tracking.
- Query & Analytics Layer : SQL engines and BI tools for dashboards and reporting.
- ML & Streaming Layer : Tools for machine learning, feature stores, and real-time pipelines.
π Benefits of Lakehouse Architecture π
- Cost Efficient : Uses cheap cloud storage instead of expensive warehouse storage.
- Scalable : Handles petabyte-scale data with elastic compute.
- Unified Platform : No need separate systems for BI, ETL, and ML.
- Open & Flexible : Avoids vendor lock-in using open file formats.
- Real-Time & Batch : Supports both streaming and batch workloads.
β What is Medallion Architecture?
Medallion architecture is a data modeling and pipeline design pattern used in Lakehouse platforms to organize data into layered stages that improve data quality, reliability, and analytics readiness.
π· Why Medallion Architecture is Needed β οΈ
- Raw Data is Messy : Ingested data is often incomplete, duplicate, or inconsistent.
- Data Quality Issues : Analytics and ML require clean and standardized data.
- Pipeline Maintainability : Complex transformations become hard to manage without layered design.
β Core Layers of Medallion Architecture π₯π₯π₯
1οΈβ£ Bronze Layer (Raw Data) π₯
- Stores raw ingested data from source systems.
- Minimal transformations applied.
- Used for auditing and reprocessing.
2οΈβ£ Silver Layer (Cleaned & Enriched Data) π₯
- Data is cleaned, deduplicated, and standardized.
- Business rules and transformations applied.
- Used for intermediate analytics and ML features.
3οΈβ£ Gold Layer (Business-Ready Data) π₯
- Aggregated and curated data for BI and reporting.
- Contains fact and dimension tables.
- Used by dashboards and business users.
π· Key Principles of Medallion Architecture π
- Layered Transformation : Data flows from raw to refined to business-ready.
- Incremental Processing : Supports streaming and batch updates.
- Data Quality Improvement : Each layer improves data accuracy and consistency.
- Reproducibility : Raw data preserved for reprocessing.
π Benefits of Medallion Architecture π
- Improved Data Quality : Structured data refinement pipeline.
- Scalable Design : Works well for big data pipelines.
- Easy Maintenance : Clear separation of responsibilities per layer.
- Supports BI & ML : Gold and Silver layers feed analytics and ML models.
β Difference between Data Lake, Data Warehouse, and Lakehouse
Data Lake, Data Warehouse, and Lakehouse are different data architectures designed for storage, analytics, and AI workloads, each with different capabilities and use cases.
π· Data Lake ποΈ
- Purpose : Store raw structured, semi-structured, and unstructured data.
- Schema : Schema-on-read (schema applied when reading data).
- Cost : Very low-cost storage using cloud object storage.
- Performance : Slower for analytics without optimization layers.
- Use Cases : Raw data ingestion, logs, IoT, media, backups.
π· Data Warehouse π’
- Purpose : Fast analytics and BI reporting.
- Schema : Schema-on-write (data structured before loading).
- Cost : Expensive storage and compute.
- Performance : Highly optimized for SQL queries.
- Use Cases : Business dashboards, structured reporting, OLAP.
π· Lakehouse ποΈ
- Purpose : Unified platform for analytics, streaming, and machine learning.
- Schema : Supports both schema-on-read and schema-on-write.
- Cost : Uses low-cost data lake storage with optimized compute.
- Performance : Warehouse-like performance using Delta/ACID and indexing.
- Use Cases : BI, AI/ML, streaming, and big data analytics on same data.
π· Key Differences Summary π
| Feature | Data Lake | Data Warehouse | Lakehouse |
|---|---|---|---|
| Data Type | All types | Structured | All types |
| Schema | Schema-on-read | Schema-on-write | Both |
| Cost | Low | High | Medium / Optimized |
| Analytics Performance | Low | High | High |
| ML Support | Limited | Limited | Native |
β What are Databricks Workspaces?
Databricks Workspace is a collaborative environment where users can develop, manage, and share notebooks, dashboards, jobs, and other assets for data engineering, analytics, and machine learning.
π· Purpose of Databricks Workspace π―
- Provides a central place to organize code, notebooks, and data assets.
- Enables collaboration between data engineers, analysts, and data scientists.
- Acts as the main interface for interacting with Databricks platform.
π· Key Components in a Workspace π
- Notebooks : Interactive notebooks supporting Python, SQL, Scala, and R.
- Repos : Git-integrated repositories for version control.
- Jobs & Workflows : Scheduled and orchestrated pipelines.
- Dashboards : Visualizations and BI reports.
- Libraries : Custom or third-party libraries for workloads.
π· Collaboration Features π€
- Multiple users can edit and run notebooks simultaneously.
- Role-based access control for folders and assets.
- Integration with Git for CI/CD workflows.
π Key Benefits π
- Centralized development environment.
- Improves team productivity and collaboration.
- Supports end-to-end workflows from data ingestion to ML.
β What is a Databricks Cluster?
A Databricks Cluster is a group of virtual machines configured to provide distributed compute resources for running notebooks, jobs, and data processing workloads.
π· Purpose of Databricks Clusters π―
- Provides scalable compute for data engineering, analytics, and machine learning.
- Executes Spark jobs and SQL queries in a distributed manner.
- Handles large-scale batch and streaming data processing.
π· Core Components of a Cluster π§©
- Driver Node : Orchestrates the job, schedules tasks, and returns results.
- Worker Nodes : Execute tasks in parallel and process data partitions.
- Executors : Processes running on worker nodes that perform computations.
π· Types of Databricks Clusters βοΈ
- Interactive Cluster : Used for development and exploration in notebooks.
- Job Cluster : Created automatically for scheduled jobs and terminated after execution.
- SQL Warehouse : Specialized compute for SQL analytics.
π Key Features π
- Auto-scaling to add or remove worker nodes based on workload.
- Auto-termination to reduce cost when idle.
- Support for different instance types and Spark configurations.
β Types of Databricks Clusters
Databricks provides different types of compute depending on the workload and usage pattern, mainly Interactive Clusters, Job Clusters, and SQL Warehouses.
π· 1οΈβ£ Interactive Cluster (All-Purpose Cluster) π§βπ»
- Purpose : Development, testing, and exploratory analysis.
- Usage : Used with notebooks by data engineers and data scientists.
- Lifecycle : Manually created and shared by multiple users.
- Cost : Can be expensive if left running.
π· 2οΈβ£ Job Cluster βοΈ
- Purpose : Production batch pipelines and scheduled jobs.
- Usage : Created automatically when a job starts and terminated after completion.
- Lifecycle : Ephemeral (short-lived).
- Cost : More cost-efficient for production workloads.
π· 3οΈβ£ SQL Warehouse (Databricks SQL Compute) π
- Purpose : BI analytics and SQL queries.
- Usage : Used by analysts and BI tools like Power BI and Tableau.
- Lifecycle : Managed compute optimized for SQL workloads.
- Performance : Optimized using Photon engine.
π· Key Differences Summary π
| Feature | Interactive Cluster | Job Cluster | SQL Warehouse |
|---|---|---|---|
| Primary Use | Development & Exploration | Production Jobs | BI & SQL Analytics |
| Lifecycle | Long-running | Ephemeral | Managed / On-demand |
| Cost Efficiency | Lower | High | Medium |
| Users | Engineers & Data Scientists | Automation Pipelines | Analysts & BI Tools |
β What is Databricks Runtime (DBR)?
Databricks Runtime is an optimized runtime environment provided by Databricks that includes Apache Spark, Delta Lake, and other libraries pre-configured for high performance and reliability.
π· Key Components of Databricks Runtime
- Apache Spark : Core distributed processing engine for batch and streaming workloads.
- Delta Lake : Storage layer that adds ACID transactions, schema enforcement, and versioning.
- Optimized Libraries : Pre-installed ML, SQL, Python, and Scala libraries.
- Photon Engine (in some versions) : Vectorized query engine for faster SQL analytics.
π· Types of Databricks Runtime
- Standard Runtime : General-purpose analytics and data engineering workloads.
- ML Runtime : Includes MLflow, TensorFlow, PyTorch, and other ML libraries.
- Photon Runtime : High-performance SQL and BI workloads.
π· Why Databricks Runtime is Important
- Improves performance compared to open-source Spark distributions.
- Reduces setup and dependency management effort.
- Ensures compatibility with Databricks features like Delta Lake and Unity Catalog.
β What is Photon Engine?
Photon is a high-performance query execution engine built by Databricks to accelerate SQL and data analytics workloads in the Lakehouse.
π· Key Characteristics of Photon
- Vectorized Execution : Processes data in batches (vectors) instead of row-by-row for faster computation.
- Native C++ Engine : Written in C++ for low-level performance instead of JVM-based Spark execution.
- Columnar Processing : Optimized for columnar data formats like Delta and Parquet.
- Automatic Integration : Works transparently with Spark SQL without changing user code.
π· Where Photon is Used
- Databricks SQL Warehouses : BI dashboards and interactive analytics.
- Batch Analytics : Large-scale aggregations and joins.
- Lakehouse Queries : Faster reads from Delta Lake tables.
π· Benefits of Photon
- Provides 2x to 10x faster query performance compared to standard Spark execution.
- Reduces compute cost due to faster query execution.
- Improves performance for BI tools like Power BI, Tableau, and Looker.
β What is Delta Lake?
Delta Lake is an open-source storage layer built on top of data lakes that adds reliability, performance, and governance features to data stored in Parquet files.
π· Core Features of Delta Lake
- ACID Transactions : Ensures data consistency and reliability for concurrent reads and writes.
- Schema Enforcement : Prevents bad or unexpected data from being written to tables.
- Schema Evolution : Allows controlled changes to table schema over time.
- Time Travel : Query historical versions of data using table versions or timestamps.
π· Performance Optimizations
- Data Skipping : Skips irrelevant files using metadata statistics.
- Z-Ordering : Co-locates related data to improve query performance.
- Compaction (OPTIMIZE) : Merges small files into larger files for efficient reads.
π· Why Delta Lake is Important
- Brings data warehouse reliability to data lakes.
- Enables unified batch, streaming, and ML workloads on the same data.
- Foundation of Databricks Lakehouse architecture.
β What is ACID compliance in Delta Lake?
ACID compliance in Delta Lake ensures that data operations are reliable, consistent, and safe even with concurrent reads and writes.
π· ACID Properties
- Atomicity : A transaction is all-or-nothing. Either all changes are committed or none are applied.
- Consistency : Data always moves from one valid state to another, following schema and constraints.
- Isolation : Concurrent transactions do not interfere with each other. Readers see a consistent snapshot.
- Durability : Once committed, data changes are permanently stored even after failures.
π· How Delta Lake Achieves ACID
- Transaction Log (_delta_log) : Maintains metadata and commit history for all operations.
- Optimistic Concurrency Control : Detects conflicts during concurrent writes.
- Snapshot Isolation : Readers always see a consistent version of the table.
β What is Delta Transaction Log?
The Delta Transaction Log is a metadata system that tracks all changes made to a Delta table, enabling ACID transactions, versioning, and reliability in data lakes.
π· What is _delta_log?
The _delta_log directory is a hidden folder in every Delta table that stores transaction history as JSON and Parquet files.
π· Key Functions
- Metadata Management : Stores schema, table properties, and partition information.
- Version Control : Tracks every commit as a new table version (time travel support).
- Transaction History : Logs inserts, updates, deletes, merges, and schema changes.
- Concurrency Control : Enables optimistic locking to handle multiple writers safely.
π· File Types in Delta Log
- JSON Files : Each commit is stored as a JSON transaction log entry.
- Checkpoint Parquet Files : Periodic snapshots of table state to speed up reads.
π· Why It Matters
- Enables ACID transactions on data lakes
- Supports Time Travel and Rollback
- Improves reliability and auditability
- Makes data lakes behave like databases
β What is Time Travel in Delta?
Time Travel in Delta allows you to query, audit, or restore previous versions of a Delta table using the transaction log.
π· What It Does
- Historical Queries : Read old data versions.
- Rollback : Restore table to a previous state.
- Auditing : Track how data changed over time.
π· How It Works
- Every write creates a new version in _delta_log.
- Delta never overwrites data immediately; it tracks file changes.
- You can query by version number or timestamp.
π· Examples
-- Read table at version 5 SELECT * FROM sales VERSION AS OF 5; -- Read table as of a timestamp SELECT * FROM sales TIMESTAMP AS OF '2025-02-10 10:00:00'; -- Restore table to an old version RESTORE TABLE sales TO VERSION AS OF 5;
π· Real-World Use Cases
- Accidental Delete : Recover deleted data.
- Bad Pipeline Run : Roll back corrupted data.
- Compliance : Audit historical snapshots of data.
π· Why It Is Important
- Enables rollback and disaster recovery
- Supports governance and auditing
- Makes data lakes behave like databases
β What is Schema Enforcement and Schema Evolution?
Schema Enforcement ensures incoming data matches the table schema, while Schema Evolution allows Delta tables to automatically adapt to new columns or schema changes.
π· 1οΈβ£ Schema Enforcement (Strict Schema) π«
Schema Enforcement rejects data if the schema does not match the existing table structure.
Example:
CREATE TABLE users ( id INT, name STRING ) USING DELTA; -- This will FAIL (extra column age) INSERT INTO users VALUES (1, 'Aditya', 28);
Use Case: Prevents bad or unexpected data from corrupting production tables.
π· 2οΈβ£ Schema Evolution (Automatic Schema Change) π
Schema Evolution allows Delta to automatically add new columns or update schema when new data arrives.
Example using mergeSchema = true:
df.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save("/mnt/delta/users")
If df contains a new column like age, Delta will automatically add it to the table.
π· Key Differences
- Schema Enforcement : Rejects mismatched schema (data quality control).
- Schema Evolution : Automatically updates schema (flexibility for changing data).
π· Why Important in Databricks?
- Handles evolving JSON/API/Kafka data sources
- Reduces pipeline failures due to schema drift
- Balances flexibility and governance in Lakehouse
β What is OPTIMIZE in Delta?
OPTIMIZE is a Delta Lake command used to compact many small files into fewer large files to improve query performance.
π· Why OPTIMIZE is Needed
In Delta tables, frequent writes (streaming, CDC, incremental loads) create many small files, which:
- Increase query latency
- Increase metadata overhead
- Cause inefficient I/O operations
OPTIMIZE merges these small files into larger files.
π· Basic Example
OPTIMIZE sales;
This compacts all small files in the sales Delta table.
π· Partition-Specific Optimize
OPTIMIZE sales WHERE date = '2025-02-10';
Only optimizes files in a specific partition.
π· OPTIMIZE with Z-Ordering (Advanced)
Z-Ordering colocates related data to speed up selective queries.
OPTIMIZE sales ZORDER BY (customer_id, country);
This improves query performance for filters on customer_id and country.
π· How OPTIMIZE Works Internally
- Merges small Parquet files into large files (typically 128MBβ1GB)
- Updates Delta Transaction Log with new file references
- Old files remain until VACUUM cleans them
π· Real-World Use Cases
- Streaming Pipelines : Micro-batches create many tiny files
- CDC Loads : Frequent MERGE operations fragment files
- BI Queries : Faster scans for dashboards and analytics
π· OPTIMIZE vs VACUUM
- OPTIMIZE : Compacts small files for performance
- VACUUM : Deletes old unused files to save storage
π· Why OPTIMIZE is Important in Databricks?
- Improves query speed significantly
- Reduces metadata overhead
- Essential for large-scale Delta Lake production systems
β What is Z-Ordering in Delta?
Z-Ordering is a data layout optimization technique in Delta Lake that co-locates related data in the same set of files to improve query performance for selective filters.
π· Why Z-Ordering is Needed
In large Delta tables, data is spread across many files. When you filter on a column, Spark may scan many unnecessary files.
Z-Ordering physically reorganizes data so that similar values are stored together, reducing file scans.
π· Basic Example
OPTIMIZE sales ZORDER BY (customer_id);
This improves queries like:
SELECT * FROM sales WHERE customer_id = 123;
π· Z-Ordering on Multiple Columns
OPTIMIZE sales ZORDER BY (customer_id, country);
This speeds up filters on customer_id and country.
π· How Z-Ordering Works Internally
- Uses a space-filling curve (Z-curve) to map multi-dimensional data to 1D.
- Groups similar column values together in the same Parquet files.
- Improves data skipping and file pruning during queries.
π· Z-Ordering vs Partitioning
- Partitioning : Splits data into directories (good for low-cardinality columns like date).
- Z-Ordering : Optimizes file layout inside partitions (good for high-cardinality columns like user_id).
π· Real-World Use Cases
- Filtering on customer_id, product_id, order_id
- BI dashboards with selective queries
- Data science feature lookups
π· Best Practices
- Run Z-Order with OPTIMIZE (not alone)
- Use on columns frequently used in filters and joins
- Avoid too many Z-Order columns (2β4 max)
β What is VACUUM in Delta?
VACUUM is a Delta Lake command used to permanently delete old and unused data files that are no longer referenced by the Delta transaction log.
π· Why VACUUM is Needed
Delta tables keep old files for features like Time Travel and ACID transactions. Over time, these unused files:
- Consume large storage space
- Increase cloud storage costs
- Slow down metadata operations
VACUUM cleans up these obsolete files.
π· Basic Example
VACUUM sales;
This removes files older than the default retention period (7 days).
π· Custom Retention Period Example
-- Delete files older than 24 hours VACUUM sales RETAIN 24 HOURS;
π· How VACUUM Works Internally
- Checks the Delta Transaction Log to find active files
- Identifies orphan and obsolete files not referenced by any version
- Permanently deletes these files from storage
π· VACUUM vs OPTIMIZE
- OPTIMIZE : Compacts small files into larger files (performance).
- VACUUM : Deletes old unused files (storage cleanup).
π· Impact on Time Travel
After VACUUM, you cannot time travel to deleted versions older than the retention period.
π· Real-World Use Cases
- Long-running streaming pipelines creating many obsolete files
- CDC pipelines with frequent MERGE/UPDATE/DELETE operations
- Reducing cloud storage costs in production
π· Best Practices
- Do not set very low retention in production
- Keep default 7 days for safety
- Run VACUUM periodically (weekly/monthly)
β What is Liquid Clustering?
Liquid Clustering is a new automatic data layout optimization technique in Delta Lake that dynamically clusters data based on query patterns without requiring manual partitioning or Z-Ordering.
π· Why Liquid Clustering is Needed
Traditional optimization requires manual tuning:
- Partitioning needs predefined columns
- Z-Ordering must be manually executed
- Data distribution changes over time (query patterns evolve)
Liquid Clustering automatically adapts to changing workloads.
π· Basic Example
CREATE TABLE sales USING DELTA CLUSTER BY (customer_id, country);
Databricks will automatically reorganize data based on these columns.
π· How Liquid Clustering Works Internally
- Continuously monitors query access patterns
- Rewrites data files in the background for optimal layout
- Uses adaptive clustering instead of fixed partitions
- Improves data skipping and file pruning automatically
π· Liquid Clustering vs Partitioning vs Z-Ordering
- Partitioning : Static directory-based organization (manual)
- Z-Ordering : Manual file-level co-location optimization
- Liquid Clustering : Automatic, adaptive clustering without manual OPTIMIZE
π· Real-World Use Cases
- High-cardinality columns (user_id, transaction_id)
- Dynamic query workloads (BI + ML + ad-hoc analytics)
- Large-scale streaming and batch Delta tables
π· Benefits
- Eliminates manual partition tuning
- Automatically adapts to changing workloads
- Improves query performance and data skipping
- Reduces operational overhead for data engineers
β What is Delta Change Data Feed (CDF)?
Delta Change Data Feed (CDF) is a Delta Lake feature that captures and exposes row-level changes (inserts, updates, deletes) in a Delta table so downstream systems can consume only the changed data.
π· Why CDF is Needed
Traditionally, downstream pipelines reprocess entire tables to detect changes, which is expensive. CDF enables incremental processing by reading only changed rows.
- Reduces compute and cost
- Enables real-time replication
- Simplifies CDC pipelines
π· How to Enable CDF
ALTER TABLE sales SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
π· How to Read Change Data
-- Read changes between versions
SELECT * FROM table_changes('sales', 10, 20);
-- Read changes since a timestamp
SELECT * FROM table_changes('sales', TIMESTAMP '2025-02-10', TIMESTAMP '2025-02-15');
π· What Data CDF Captures
- INSERT : New rows added
- UPDATE_PREIMAGE : Old version of updated rows
- UPDATE_POSTIMAGE : New version of updated rows
- DELETE : Deleted rows
π· Example Output Columns
- _change_type β Type of change
- _commit_version β Delta table version
- _commit_timestamp β When the change happened
π· Real-World Use Cases
- Streaming CDC pipelines to Kafka or downstream Delta tables
- Syncing Delta tables with data warehouses (Snowflake, BigQuery)
- Auditing and debugging data changes
- Building incremental ML feature pipelines
π· Benefits
- Eliminates full table scans for CDC
- Near real-time incremental data consumption
- Native Delta Lake feature (no external CDC tool needed)
β What is Auto Loader in Databricks?
Auto Loader is a Databricks feature for incrementally and efficiently ingesting new files from cloud storage into Delta tables using Structured Streaming.
π· Why Auto Loader is Needed
Traditional file ingestion (batch jobs) scans the entire directory repeatedly, which is slow and expensive. Auto Loader detects only new files and processes them incrementally.
- Scales to millions of files
- Low latency ingestion
- Fault-tolerant and exactly-once processing
π· Basic Example (Databricks PySpark)
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/mnt/raw/events"))
df.writeStream.format("delta") \
.option("checkpointLocation", "/mnt/checkpoints/events") \
.table("bronze_events")
This continuously ingests new JSON files into a Delta Bronze table.
π· How Auto Loader Works
- Monitors cloud storage (S3, ADLS, GCS)
- Detects new files using notifications or directory listing
- Processes files exactly once using checkpoints
- Stores metadata in a schema location
π· File Discovery Modes
- Directory Listing : Periodically lists files (simple but slower)
- File Notification : Uses cloud events (SQS, Event Grid) for scalable ingestion
π· Schema Handling
- Schema Inference : Automatically detects schema
- Schema Evolution : Supports mergeSchema for new columns
- Schema Enforcement : Rejects bad schema if configured
π· Real-World Use Cases
- Ingesting IoT logs and event data
- Landing raw data into Bronze layer (Medallion)
- Streaming data lakes and CDC pipelines
- Replacing traditional batch ingestion jobs
π· Benefits
- Scales to billions of files
- Cost-efficient incremental ingestion
- Built-in fault tolerance and exactly-once guarantees
β What is Managed Table vs External Table in Databricks?
Managed and External tables differ in who controls the underlying data files and what happens when the table is dropped.
π· 1οΈβ£ Managed Table (Internal Table) π¦
In a Managed Table, Databricks manages both the metadata and the underlying data files.
Example:
CREATE TABLE sales_managed ( id INT, amount DOUBLE ) USING DELTA;
Data is stored in the Databricks managed storage location.
Behavior:
- DROP TABLE deletes metadata + data files
- Databricks controls storage path
- Best for curated internal datasets
π· 2οΈβ£ External Table (Unmanaged Table) π
In an External Table, Databricks manages only metadata, while data remains in a user-defined storage location.
Example:
CREATE TABLE sales_external ( id INT, amount DOUBLE ) USING DELTA LOCATION '/mnt/data/sales_external';
Data is stored in the specified cloud storage path.
Behavior:
- DROP TABLE deletes only metadata (data files remain)
- User controls storage and lifecycle
- Can be shared across multiple tools (Databricks, Athena, Synapse)
π· Key Differences
- Storage Control : Managed β Databricks, External β User-defined path
- Drop Behavior : Managed β Data deleted, External β Data preserved
- Use Case : Managed β Curated analytics, External β Data lake sharing
π· Real-World Use Cases
- Managed Table : Gold layer curated business tables
- External Table : Bronze/Silver lake data shared across platforms
β What is Delta Live Tables (DLT)?
Delta Live Tables (DLT) is a Databricks framework for building reliable, declarative, and automated data pipelines that create and manage Delta tables with built-in data quality, orchestration, and monitoring.
π· Why DLT is Needed
Traditional ETL pipelines require manual orchestration, data quality checks, and monitoring. DLT automates these tasks and ensures reliable production pipelines.
- Automatic pipeline orchestration
- Built-in data quality checks
- Automatic dependency management
π· Basic Example (DLT Table Definition)
import dlt
@dlt.table
def bronze_users():
return spark.read.format("json").load("/mnt/raw/users")
This defines a Bronze Delta table managed by DLT.
π· Example with Data Quality Rules
@dlt.table
@dlt.expect("valid_id", "id IS NOT NULL")
@dlt.expect_or_drop("valid_age", "age >= 0")
def silver_users():
return dlt.read("bronze_users")
DLT automatically enforces data quality rules.
π· How DLT Works
- Uses declarative SQL/Python to define tables
- Automatically builds DAG of dependencies
- Manages compute clusters and retries
- Tracks data lineage and pipeline health
π· DLT Pipeline Layers (Medallion)
- Bronze : Raw ingestion tables
- Silver : Cleaned and transformed tables
- Gold : Business-level aggregated tables
π· Real-World Use Cases
- Streaming and batch ETL pipelines
- Building Medallion architecture pipelines
- Data quality and governance pipelines
- Automated production data engineering workflows
π· Benefits
- Declarative pipelines (less orchestration code)
- Built-in data quality enforcement
- Automatic lineage, monitoring, and retries
- Reduces operational complexity
β What is Unity Catalog?
Unity Catalog is Databricksβ centralized governance and metadata management system that provides unified access control, auditing, and data discovery across all data assets in a Lakehouse.
π· Why Unity Catalog is Needed
In traditional setups, each workspace had its own metastore and security model, making governance difficult. Unity Catalog provides a single governance layer across workspaces.
- Centralized security and permissions
- Unified metadata across workspaces
- Data lineage and auditing
π· Core Components of Unity Catalog
- Metastore : Central metadata repository for tables, views, and files
- Catalog : Top-level container for schemas and tables
- Schema (Database) : Logical grouping of tables and views
- Table / View : Actual data objects
- Volumes : Managed file storage for non-tabular data
π· Example Structure
metastore
βββ main_catalog
βββ sales_schema
βββ orders_table
π· Access Control Example
GRANT SELECT ON TABLE main_catalog.sales_schema.orders_table TO `data_analyst`;
This controls who can access data centrally.
π· Key Features
- Fine-grained permissions (catalog, schema, table, column, row)
- Cross-workspace data sharing
- Built-in audit logs and lineage
- Integration with Delta Sharing and BI tools
π· Unity Catalog vs Hive Metastore
- Hive Metastore : Workspace-level, limited governance
- Unity Catalog : Centralized, enterprise-grade governance
π· Real-World Use Cases
- Enterprise data governance and compliance
- Sharing data across teams and workspaces
- Managing multi-cloud lakehouse environments
- Auditing and security for sensitive data
π· Benefits
- Single source of truth for metadata
- Centralized security and governance
- Improved data discovery and collaboration
- Enterprise compliance (GDPR, HIPAA)
Catalogs, Schemas, Tables, and Volumes in Unity Catalog
1. Catalog
A catalog is the highest-level namespace in Unity Catalog. It is used to logically separate data across teams, projects, or environments and acts as a governance boundary for access control, auditing, and lineage tracking.
Example Use Case: Separate environments like development, staging, and production.
dev_catalog prod_catalog
2. Schema (Database)
A schema is a logical container inside a catalog that groups related tables, views, and functions. It is similar to a database in traditional systems and is commonly used to organize data by domain or business unit.
Example Use Case: Organizing data by business domain.
prod_catalog.sales prod_catalog.finance
3. Table
A table is a structured dataset stored in Delta format that contains rows and columns. Tables in Unity Catalog support ACID transactions, time travel, schema enforcement, and fine-grained access control.
Example:
prod_catalog.sales.transactions
4. Volume
A volume is a managed storage object used to store unstructured or semi-structured files such as CSV, JSON, Parquet, images, or logs. Volumes are governed by Unity Catalog, allowing secure file access without exposing cloud storage credentials.
Example:
prod_catalog.raw_data.file_volume
Hierarchy Structure
catalog.schema.table catalog.schema.volume
This hierarchy enables centralized governance, data organization, and consistent security policies across all Databricks workloads.
Unity Catalog Table vs Hive Metastore Table
1. Unity Catalog Table
A Unity Catalog (UC) table is a table registered in Unity Catalog, which provides centralized governance, fine-grained access control, auditing, and data lineage across all Databricks workspaces and cloud storage.
Key Points:
- Centralized metastore shared across multiple workspaces
- Supports catalogs, schemas, and tables hierarchy
- Fine-grained permissions (catalog, schema, table, column, row-level)
- Built-in lineage and audit logging
Example:
prod_catalog.sales.transactions
2. Hive Metastore Table
A Hive Metastore (HMS) table is a legacy table registered in the workspace-level metastore. Governance is limited and mostly controlled at the workspace level.
Key Points:
- Metastore is scoped to a single workspace
- No catalogs (only database.schema.table structure)
- Limited fine-grained access control
- No centralized lineage across workspaces
Example:
sales.transactions
Key Differences
| Feature | Unity Catalog Table | Hive Metastore Table |
|---|---|---|
| Scope | Global across workspaces | Workspace-specific |
| Hierarchy | catalog.schema.table | schema.table |
| Governance | Centralized, fine-grained | Basic workspace-level |
| Lineage & Auditing | Built-in | Limited |
What is Data Lineage in Unity Catalog?
Data lineage in Unity Catalog is the ability to track how data flows across tables, views, notebooks, jobs, and dashboardsβshowing where data comes from, how it is transformed, and where it is used.
Key Capabilities
- Source Tracking: Shows upstream source tables and files used to create a dataset.
- Transformation Tracking: Captures SQL queries, notebooks, and pipelines that modify data.
- Downstream Impact: Identifies dashboards, ML models, or reports that depend on a table.
- Column-Level Lineage: Tracks how individual columns are derived and transformed.
Example
If a Delta table sales_summary is created from raw_sales and used in a BI dashboard:
raw_sales --> cleaned_sales --> sales_summary --> Power BI Dashboard
Why It Is Important
- Helps in impact analysis before changing tables
- Improves data governance and compliance
- Debugging data quality issues
- Supports auditing and regulatory requirements
Unity Catalog automatically captures lineage without manual instrumentation across SQL, notebooks, and Delta Live Tables.
How Does Unity Catalog Handle Permissions?
Unity Catalog uses a centralized, fine-grained access control model to manage permissions across all data assets like catalogs, schemas, tables, views, and volumes. It follows a GRANT/REVOKE privilege-based security model similar to enterprise databases.
Permission Levels in Unity Catalog
- Catalog Level: Controls access to an entire data domain (e.g., finance, marketing).
- Schema Level: Controls access to a database-like namespace inside a catalog.
- Table / View Level: Controls who can read or modify specific tables.
- Column Level: Restricts access to sensitive columns (PII, salary, SSN).
- Row-Level Security: Filters rows based on user roles (e.g., region-based access).
Common Privileges
- SELECT: Read data
- INSERT: Add new records
- UPDATE / DELETE: Modify or remove records
- CREATE: Create tables, views, schemas
- USAGE: Required to access catalogs and schemas
Example (Grant Permission)
GRANT USAGE ON CATALOG main TO `data_analyst`; GRANT USAGE ON SCHEMA main.sales TO `data_analyst`; GRANT SELECT ON TABLE main.sales.transactions TO `data_analyst`;
Column-Level Security Example
GRANT SELECT (customer_id, order_amount) ON TABLE main.sales.transactions TO `finance_team`;
Why Unity Catalog Permissions Are Important
- Centralized governance across all Databricks workspaces
- Enterprise-grade security for sensitive data
- Audit logs for compliance (HIPAA, GDPR, SOC2)
- Eliminates manual permission management per workspace
Unity Catalog integrates with cloud IAM (Azure AD, AWS IAM, GCP IAM) to enforce identity-based access control.
What is Row-Level Security (RLS) and Column-Level Security (CLS) in Databricks?
Row-Level Security (RLS) and Column-Level Security (CLS) are data governance features in Databricks (via Unity Catalog) that restrict data access at a granular level based on user roles or attributes.
1οΈβ£ Row-Level Security (RLS)
Row-Level Security restricts which rows a user can see in a table. Different users can see different subsets of the same table.
Example Use Case
A sales table contains data for all regions, but each regional manager should only see their region's data.
Databricks Example (Dynamic View for RLS)
CREATE VIEW main.sales.secure_sales AS SELECT * FROM main.sales.transactions WHERE region = current_user();
Now:
- User india_manager sees only India rows
- User us_manager sees only US rows
Key Points
- Filters rows dynamically at query time
- Used for multi-tenant data access
- Common in finance, healthcare, and SaaS platforms
2οΈβ£ Column-Level Security (CLS)
Column-Level Security restricts access to specific columns in a table. Sensitive columns like salary, SSN, or credit card numbers can be hidden.
Example Use Case
HR team can see employee salary, but analysts cannot.
Databricks Example (Column Grant)
GRANT SELECT (employee_id, name, department) ON TABLE main.hr.employees TO `analyst_team`;
HR team gets full access:
GRANT SELECT ON TABLE main.hr.employees TO `hr_team`;
Key Points
- Protects sensitive PII and confidential data
- Enforced at Unity Catalog metadata layer
- Supports compliance like GDPR and HIPAA
πΉ RLS vs CLS (Quick Difference)
| Feature | Row-Level Security (RLS) | Column-Level Security (CLS) |
|---|---|---|
| Controls | Which rows are visible | Which columns are visible |
| Use Case | Region-based or tenant-based filtering | Hide sensitive fields like salary or SSN |
| Implementation | Dynamic views / policies | GRANT SELECT on specific columns |
In Databricks, RLS and CLS are key features of Unity Catalog for enterprise-grade data governance and security.
What are Databricks Jobs & Workflows?
Databricks Jobs are used to run automated data pipelines, ETL tasks, ML training, or batch workloads on a schedule or event trigger. Workflows orchestrate multiple tasks in a dependency-based DAG (Directed Acyclic Graph).
1οΈβ£ Databricks Jobs
A Job is a scheduled or triggered execution of code (Notebook, Python script, SQL, or JAR) on a Databricks cluster.
Key Features
- Run notebooks, Python files, SQL queries, or JARs
- Schedule jobs using cron (hourly, daily, weekly)
- Trigger jobs via API, UI, or events
- Auto-retry on failure and alerting
Example Use Case
Run a daily ETL pipeline at 2 AM to load data from raw to curated Delta tables.
2οΈβ£ Databricks Workflows
Workflows are a collection of tasks connected with dependencies. They allow you to build end-to-end data pipelines with multiple steps.
Workflow Task Types
- Notebook Task
- Python Script Task
- SQL Task
- Delta Live Tables Pipeline
- Job Trigger Task (job chaining)
Example Workflow DAG
Task 1: Ingest Raw Data
β
Task 2: Transform (Silver)
β
Task 3: Aggregations (Gold)
β
Task 4: ML Model Training
Key Differences (Jobs vs Workflows)
| Feature | Jobs | Workflows |
|---|---|---|
| Scope | Single task execution | Multi-task orchestration |
| Dependencies | Not supported | Supported (DAG-based) |
| Use Case | Simple scheduled ETL or script | End-to-end data pipelines |
Why Jobs & Workflows are Important
- Automates data pipelines without external schedulers (Airflow, Jenkins)
- Built-in monitoring, retries, and alerts
- Supports CI/CD and production pipelines
- Serverless execution with job clusters
In interviews: "Databricks Jobs run scheduled workloads, and Workflows orchestrate multiple dependent tasks in a DAG for end-to-end data pipelines."
How Do You Schedule Notebooks in Databricks?
In Databricks, notebooks are scheduled using Databricks Jobs. A Job allows you to run notebooks automatically based on a schedule, event trigger, or API call, making it suitable for production ETL pipelines and batch workloads.
Ways to Schedule a Notebook
- Time-Based Schedule: Run notebooks daily, hourly, weekly, or using a cron expression.
- Event/API Trigger: Trigger jobs using Databricks REST API, CI/CD pipelines, or external orchestrators (Airflow, ADF).
- Manual Trigger: Run jobs manually from UI or CLI for testing.
Scheduling via Databricks UI
- Create a Job
- Add a Task β Select Notebook
- Attach a Cluster or Job Cluster
- Configure Schedule (Cron or Interval)
Example (Daily ETL Job)
Job Name: daily_sales_etl Task: notebooks/sales_pipeline Schedule: Cron -> 0 2 * * * Cluster: Job Cluster (Auto-terminate)
This runs the notebook every day at 2 AM to ingest and transform sales data.
Key Interview Points
- Scheduled via Jobs with cron or interval triggers
- Supports retries, alerts, and timeouts
- Can use dedicated Job Clusters for cost optimization
- Commonly used for production batch pipelines
What is a Multi-Task Job in Databricks?
A multi-task job in Databricks is a workflow that contains multiple tasks with defined dependencies, executed as a DAG (Directed Acyclic Graph). Each task can run a notebook, Python script, SQL query, JAR, or Delta Live Tables pipeline.
Key Characteristics
- Task Dependencies: Tasks can depend on previous tasks (sequential or parallel execution).
- Different Compute per Task: Each task can run on its own cluster or shared cluster.
- Retries & Timeouts: Retry and timeout policies can be configured per task.
- Parallel Execution: Independent tasks can run simultaneously.
Example Multi-Task Workflow
Task 1: Ingest Raw Data (Bronze)
β
Task 2: Transform Data (Silver)
β
Task 3: Aggregations (Gold)
β
Task 4: ML Model Training
Real Production Use Case
- Daily data ingestion from S3/ADLS
- Data cleaning and transformations
- Business aggregations for dashboards
- Trigger ML pipeline after data is ready
Key Interview Points
- Multi-task jobs orchestrate end-to-end pipelines inside Databricks
- They replace external orchestrators for many use cases (Airflow, ADF)
- They provide built-in monitoring, retries, and dependency management
How Do You Handle Retries and Failures in Databricks Jobs?
Databricks Jobs provide built-in mechanisms to automatically handle task failures using retries, timeouts, alerts, and task dependencies. These features help ensure reliability in production pipelines.
Retry Configuration
Each task in a job can be configured with a retry policy so that transient failures (network issues, cluster crashes, temporary data issues) are automatically retried.
Retries: 3 Retry Interval: 5 minutes
This means if a task fails, Databricks will retry it up to 3 times with a 5-minute gap.
Timeout Handling
You can define a timeout to stop tasks that run too long and prevent runaway jobs.
Task Timeout: 2 hours
If the task exceeds 2 hours, it will be terminated automatically.
Failure Notifications
Databricks can send alerts when jobs fail using:
- Email notifications
- Webhooks (Slack, Teams, PagerDuty)
- Integration with monitoring tools
Task Dependency Failure Handling
In multi-task jobs, downstream tasks are automatically skipped if an upstream task fails.
Task 1 (Ingest) β Task 2 (Transform) β Task 3 (Aggregate) If Task 1 fails, Task 2 and Task 3 will not run.
Cluster Failure Handling
- Jobs can use Job Clusters that auto-restart on failure
- Databricks automatically reschedules tasks on healthy nodes
Example Production Scenario
A daily ETL job is configured with:
- 3 retries
- 10-minute retry interval
- 1-hour timeout
- Email alert to data engineering team
If the cluster crashes or a temporary data source issue occurs, the pipeline retries automatically without manual intervention.
What is Databricks SQL Warehouse?
A Databricks SQL Warehouse is a compute resource optimized for interactive SQL analytics and BI workloads. It is designed for low-latency, high-concurrency SQL queries and is commonly used by analysts and BI tools.
Key Characteristics
- Optimized for SQL queries and dashboards
- Supports high concurrency for many users
- Uses Photon engine for fast query execution
- Can be Serverless (no cluster management required)
- Integrates with BI tools like Power BI, Tableau, Looker
Example Use Case
Business analysts run ad-hoc queries and dashboards on curated Delta tables using SQL Warehouse without interfering with ETL workloads.
Difference Between Databricks SQL Warehouse and Cluster
| Feature | SQL Warehouse | Databricks Cluster |
|---|---|---|
| Primary Use Case | BI analytics, dashboards, ad-hoc SQL queries | ETL, streaming, ML, notebooks, batch processing |
| Latency | Low-latency, interactive queries | Higher latency, optimized for compute-heavy tasks |
| Concurrency | High (many users querying simultaneously) | Limited (not ideal for many concurrent users) |
| Compute Management | Fully managed / Serverless option available | User-managed (manual or auto-scaling) |
| Workload Type | Read-heavy analytical workloads | Read + Write heavy ETL and ML workloads |
| Cost Model | Charged per SQL compute usage | Charged per cluster uptime and resources |
Simple Summary
SQL Warehouse is optimized for analytics and BI queries, while Databricks Clusters are general-purpose compute for data engineering and machine learning workloads.
What is Databricks SQL Dashboard?
π A Databricks SQL Dashboard is a visualization layer in Databricks that allows users to build interactive reports and BI dashboards using SQL queries on Delta tables.
Dashboards run on Databricks SQL Warehouses and are mainly used by analysts, business users, and stakeholders.
πΉ Key Features
- π Create charts, tables, KPIs, and metrics using SQL queries
- ποΈ Interactive filters and parameters (date range, region, product)
- π Scheduled refresh for near real-time reporting
- π₯ Share dashboards with users or embed in apps
- β‘ Optimized for low-latency analytics using SQL Warehouse
π§© Components of a SQL Dashboard
- π SQL Queries: Define the logic for metrics and charts
- π Visualizations: Bar charts, line charts, pie charts, tables
- ποΈ Filters: Dropdowns, date pickers, parameter inputs
- π Dashboard Layout: Arrange multiple visuals in a single UI
π‘ Example Use Case
A company builds a Sales Analytics Dashboard showing daily revenue, top products, and region-wise sales.
Query 1: Total Revenue by Date SELECT order_date, SUM(amount) AS revenue FROM main.sales.transactions GROUP BY order_date; Query 2: Top Products SELECT product_name, SUM(amount) AS revenue FROM main.sales.transactions GROUP BY product_name ORDER BY revenue DESC;
These queries are visualized as:
- π Line chart for daily revenue
- π Bar chart for top products
- π Table for detailed transactions
π’ Typical Business Users
- π Business Analysts
- π¨βπΌ Product Managers
- π Executives and Stakeholders
π§ Summary
Databricks SQL Dashboard provides a native BI visualization layer on top of Delta Lake using SQL Warehouses, enabling interactive analytics without exporting data to external BI tools.
How Do You Manage Schema Changes in Databricks?
π§© Schema changes refer to adding, removing, or modifying columns in a table. In Databricks, schema changes are managed using Delta Lake features like Schema Enforcement, Schema Evolution, and Controlled Migrations.
π 1. Schema Enforcement (Strict Schema)
Schema Enforcement ensures that incoming data matches the table schema. If the schema does not match, the write operation fails.
π‘ Example
-- This will fail if new columns appear
df.write.format("delta").mode("append").save("/mnt/delta/sales")
β Used in production to prevent accidental bad data ingestion.
π 2. Schema Evolution (Automatic Schema Update)
Schema Evolution allows Delta tables to automatically add new columns when data schema changes.
π‘ Example (mergeSchema = true)
df.write.format("delta")
.option("mergeSchema", "true")
.mode("append")
.save("/mnt/delta/sales")
π If a new column like discount appears, it will be added to the table automatically.
π§ 3. Controlled Schema Migration (ALTER TABLE)
In production, schema changes are often applied manually using DDL to maintain governance.
π‘ Example
ALTER TABLE main.sales.transactions ADD COLUMN discount DOUBLE; ALTER TABLE main.sales.transactions DROP COLUMN old_flag;
β This approach is preferred in enterprise environments with data governance.
π‘οΈ 4. Unity Catalog Governance
- π€ Only authorized users can change schema
- π All schema changes are audited
- π Data lineage tracks schema evolution
βοΈ 5. Best Practices in Production
- π« Avoid automatic schema evolution on Gold tables
- β Allow schema evolution in Bronze/Silver layers
- π§ͺ Test schema changes in dev before prod
- π Maintain schema versioning and documentation
π‘ Real Production Example
A new column customer_segment is added in source data:
df.write.format("delta")
.option("mergeSchema", "true")
.mode("append")
.saveAsTable("main.sales.customers")
Delta automatically updates the table schema without downtime.
π§ Summary
Databricks manages schema changes using schema enforcement for data quality, schema evolution for flexibility, and controlled ALTER TABLE operations for governed production environments.
What is Cluster Auto-Scaling in Databricks?
π Cluster Auto-Scaling is a Databricks feature that automatically increases or decreases the number of worker nodes in a cluster based on workload demand. It helps optimize performance and cost by scaling compute resources dynamically.
βοΈ How Auto-Scaling Works
- β¬οΈ Scale Out: Adds more worker nodes when workload increases (e.g., large data processing).
- β¬οΈ Scale In: Removes idle worker nodes when workload decreases.
- π§ Uses metrics like CPU utilization, task backlog, and memory usage.
π§© Configuration Parameters
- πΉ Min Workers: Minimum number of worker nodes (e.g., 2)
- πΉ Max Workers: Maximum number of worker nodes (e.g., 20)
- πΉ Auto-scaling Enabled: Dynamic cluster size between min and max
π‘ Example
Min Workers: 2 Max Workers: 10 Auto-Scaling: Enabled
If a job starts processing a large dataset, the cluster scales from 2 β 10 workers. When the job finishes, it scales back to 2 workers to save cost.
π Real Production Use Case
- π¦ Daily batch ETL where data volume varies daily
- β‘ Ad-hoc analytics with unpredictable query load
- π Streaming pipelines with fluctuating event traffic
π§ Benefits
- π° Cost optimization (pay only for required compute)
- π Better performance during peak loads
- π€ No manual cluster resizing
π§ Summary
Cluster auto-scaling dynamically adjusts the number of worker nodes based on workload, ensuring high performance while minimizing infrastructure cost.
What is Cluster Auto-Termination in Databricks?
βΉοΈ Cluster Auto-Termination is a Databricks feature that automatically shuts down a cluster after a period of inactivity. It helps reduce cloud costs by stopping unused compute resources.
βοΈ How Auto-Termination Works
- β±οΈ Databricks monitors cluster activity (jobs, notebooks, SQL queries).
- π€ If no activity is detected for a configured time, the cluster is terminated.
- π° Users are not billed once the cluster is terminated.
π§© Configuration Parameter
- Auto-Termination Minutes: Idle time before cluster shuts down (e.g., 30, 60, 120 minutes)
π‘ Example
Auto-Termination = 60 minutes
If the cluster is idle for 60 minutes, Databricks automatically stops it.
π Real Production Use Cases
- π¨βπ» Development clusters used by engineers
- π Ad-hoc analytics clusters
- π§ͺ Testing and experimentation environments
β οΈ Important Behavior
- π Cluster must be restarted manually or by a job trigger
- π All running tasks will stop if termination happens
- πΎ Data is not lost because it is stored in external storage (S3/ADLS)
π§ Benefits
- π° Major cost savings by avoiding idle compute charges
- π€ No manual shutdown required
- π’ Enforced governance in enterprise environments
π§ Summary
Cluster auto-termination automatically shuts down idle clusters after a configured time to prevent unnecessary cloud costs while keeping data safe in external storage.
π How to Optimize Delta Table Performance (Databricks)
Delta Lake performance can degrade as data grows. Below are the most effective optimization techniques used in real production systems.
1οΈβ£ Use Partitioning (π)
Partitioning splits data into folders based on a column, reducing the amount of data scanned.
Example:CREATE TABLE sales ( id INT, country STRING, date DATE ) USING DELTA PARTITIONED BY (country);
β Best for low-cardinality columns like country, date, year
2οΈβ£ Run OPTIMIZE Command (β‘)
Delta tables create many small files. OPTIMIZE compacts them into fewer large files.
Example:OPTIMIZE sales;
π Improves query speed by reducing file reads.
3οΈβ£ Use Z-ORDER for Faster Filtering (π)
Z-Ordering clusters data on frequently filtered columns.
Example:OPTIMIZE sales ZORDER BY (date, country);
β Great for columns used in WHERE filters.
4οΈβ£ Enable Delta Auto Optimize (π€)
Automatically compacts small files during writes.
Example:spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true");
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true");
5οΈβ£ Vacuum Old Files (π§Ή)
Delta keeps old versions for time travel. Vacuum deletes unused files.
Example:VACUUM sales RETAIN 168 HOURS;
β οΈ Reduces storage cost but removes old history.
6οΈβ£ Use Efficient File Size (π¦)
Target file size: 128MBβ1GB
Too many small files = slow queries
7οΈβ£ Cache Hot Tables (π₯)
Cache frequently queried tables in memory.
Example:CACHE TABLE sales;
8οΈβ£ Use Predicate Pushdown (π―)
Filter data early to reduce scanned rows.
Bad:SELECT * FROM sales;Good:
SELECT * FROM sales WHERE date = '2026-02-01';
9οΈβ£ Avoid Over-Partitioning (β οΈ)
Too many partitions = too many small files.
Rule: partitions should have at least 100MB data each.
β Quick Real-World Optimization Example
OPTIMIZE sales ZORDER BY (date); VACUUM sales RETAIN 168 HOURS;
Result: Query time reduced from 30s β 2s in production pipelines.
π§ Interview Tip
Say: "I optimize Delta tables using partitioning, OPTIMIZE with ZORDER, auto-compaction, and VACUUM to reduce small files and improve scan performance."
β‘ Photon vs Non-Photon Runtime in Databricks
Photon is a high-performance query engine built by Databricks to accelerate SQL and Spark workloads.
1οΈβ£ What is Non-Photon Runtime? (π’)
Non-Photon runtime is the default Apache Spark execution engine written in Scala/Java.
- Uses JVM-based Spark execution
- Good for general ETL and ML workloads
- Slower for SQL-heavy analytics
Databricks Runtime 12.2 (without Photon)
2οΈβ£ What is Photon Runtime? (π)
Photon is a vectorized C++ query engine built by Databricks for faster execution.
- Written in C++ (not JVM)
- Uses vectorized processing & SIMD
- Optimized for SQL, BI, and Delta queries
- 2xβ10x faster than normal Spark
Databricks Runtime 12.2 Photon
3οΈβ£ Key Differences (π)
| Feature | Photon Runtime | Non-Photon Runtime |
|---|---|---|
| Execution Engine | C++ Vectorized Engine | JVM Spark Engine |
| Performance | β‘ Very Fast | π’ Slower |
| Best For | SQL, BI, Analytics | ETL, ML, Streaming |
| Cost Efficiency | Lower (faster = fewer compute hours) | Higher for analytics workloads |
4οΈβ£ Real Production Example (π’)
A BI dashboard query took 40 seconds on non-Photon.
After switching to Photon:
- Query time reduced to 5 seconds
- Compute cost reduced by ~70%
5οΈβ£ When to Use Photon? (β )
- Databricks SQL Warehouses
- Interactive BI queries
- Large Delta table analytics
6οΈβ£ When NOT to Use Photon? (β)
- Heavy Python UDF workloads
- ML training pipelines
- Custom Spark transformations not supported by Photon
π§ Interview One-Liner
"Photon is Databricksβ C++ vectorized engine that accelerates SQL and Delta queries compared to JVM-based Spark runtime."
βοΈ What is Serverless Compute in Databricks?
Serverless compute in Databricks is a fully managed compute where Databricks automatically provisions, scales, and manages clusters for you.
1οΈβ£ How Serverless Compute Works (βοΈ)
You do NOT create clusters. Databricks handles:
- Cluster provisioning
- Auto-scaling
- Auto-termination
- Infrastructure management
You just run queries or notebooks.
2οΈβ£ Key Features (π)
- β‘ Instant startup (no cluster spin-up wait)
- π Automatic scaling based on workload
- π° Pay only for query execution time
- π No DevOps or cluster tuning needed
3οΈβ£ Serverless vs Cluster Compute (π)
| Feature | Serverless Compute | Dedicated Cluster |
|---|---|---|
| Cluster Management | Fully Managed by Databricks | User Managed |
| Startup Time | Seconds | Minutes |
| Scaling | Automatic | Manual / Auto-scaling config |
| Cost Model | Pay per query / workload | Pay for running cluster time |
| Use Case | SQL, BI, Dashboards | ETL, ML, Streaming |
4οΈβ£ Real Example (π’)
A BI team runs dashboards on Serverless SQL Warehouse.
- No cluster running 24/7
- Users run queries β compute spins up automatically
- Cost reduced by 60% compared to always-on clusters
5οΈβ£ When to Use Serverless Compute (β )
- Databricks SQL dashboards
- Ad-hoc analytics
- BI tools like Power BI / Tableau
6οΈβ£ When NOT to Use Serverless (β)
- Heavy ETL pipelines
- Streaming jobs
- Custom ML workloads needing GPU or libraries
π§ Interview One-Liner
"Serverless compute is Databricks-managed compute that auto-scales and charges per workload without requiring users to manage clusters."
π What is DLT (Delta Live Tables)?
Delta Live Tables (DLT) is a managed framework in Databricks to build reliable, declarative, and automated data pipelines.
You define what data should look like, and Databricks manages execution, monitoring, retries, and quality.
1οΈβ£ Key Features of DLT (π)
- π Declarative pipelines (SQL / Python)
- β Built-in data quality checks
- π Automatic retries & failure handling
- π Auto-scaling & optimized execution
- π§Ύ Lineage and monitoring UI
2οΈβ£ Simple DLT Example (π§ͺ)
SQL DLT Pipeline
CREATE LIVE TABLE bronze_users AS SELECT * FROM raw_users; CREATE LIVE TABLE clean_users AS SELECT * FROM bronze_users WHERE age > 0;
Python DLT Example
import dlt
@dlt.table
def bronze_users():
return spark.read.format("json").load("/raw/users")
@dlt.table
def clean_users():
return dlt.read("bronze_users").filter("age > 0")
3οΈβ£ Traditional Pipeline Example (Old Way) (βοΈ)
spark.read.format("json").load("/raw/users") \
.filter("age > 0") \
.write.format("delta").save("/tables/clean_users")
Here you manually manage scheduling, retries, and monitoring.
4οΈβ£ DLT vs Traditional Pipelines (π)
| Feature | Delta Live Tables | Traditional Spark Pipeline |
|---|---|---|
| Pipeline Type | Declarative | Imperative (manual code) |
| Data Quality | Built-in Expectations | Manual checks |
| Retries & Monitoring | Automatic | Manual Jobs setup |
| Lineage | Auto lineage graph | Manual documentation |
| Scaling & Optimization | Automatic | Manual cluster tuning |
| Use Case | Modern Lakehouse pipelines | Legacy Spark ETL |
5οΈβ£ Real Production Example (π’)
Raw Kafka data β Bronze β Silver β Gold tables using DLT:
- Bronze: raw ingestion
- Silver: cleaned & validated
- Gold: business aggregates
DLT automatically tracks lineage and failures.
π§ Interview One-Liner
"Delta Live Tables is a declarative pipeline framework in Databricks that automates ETL orchestration, data quality, and monitoring compared to manual Spark pipelines."