Databricks Interview Questions

βœ… Overview πŸ“‹

Databricks is a cloud-based data and AI platform built on Apache Spark that provides a unified environment for data engineering, analytics, and machine learning.

It implements the Lakehouse architecture, combining the capabilities of data lakes and data warehouses in a single platform.

Databricks is a fully managed service that simplifies big data processing, collaboration, governance, and AI workloads at scale.


   +------------------------+
   |     Databricks Workspace|
   +-----------+------------+
               |
               v
      +---------------------+
      |     Compute Layer   |
      |  (Clusters / SQL WH)|
      +-----------+----------+
                  |
        -----------------------------
        |                           |
        v                           v
 +-------------------+      +-------------------+
 |   Delta Lake       |      |   Unity Catalog   |
 | (Storage & ACID)   |      | (Governance)       |
 +-------------------+      +-------------------+


πŸ”· Core Components πŸ”


1️⃣ Compute (Clusters & SQL Warehouses) πŸš€

The Compute layer provides scalable processing power for notebooks, pipelines, and SQL analytics.

  • Interactive Clusters : Used for development and exploration.
  • Job Clusters : Used for scheduled pipelines and batch processing.
  • SQL Warehouses : Optimized compute for BI and SQL queries.

2️⃣ Delta Lake (Storage Layer) πŸ—„οΈ

Delta Lake is an open storage layer that adds reliability and performance to data lakes.

  • ACID Transactions : Ensures data consistency and reliability.
  • Time Travel : Query historical versions of data.
  • Schema Enforcement & Evolution : Handles schema changes safely.
  • Efficient Storage : Built on Parquet with transaction logs.

3️⃣ Workspace & Notebooks πŸ§‘β€πŸ’»

Databricks Workspace provides collaborative notebooks for multiple users.

  • Supports Python, SQL, Scala, and R.
  • Used by data engineers, analysts, and data scientists.
  • Enables real-time collaboration and version control.

4️⃣ Unity Catalog πŸ”

Unity Catalog is the centralized governance and metadata management layer.

  • Manages catalogs, schemas, and tables.
  • Provides fine-grained access control.
  • Tracks data lineage and auditing.

5️⃣ Jobs & Workflows βš™οΈ

Jobs and Workflows are used to orchestrate and schedule data pipelines.

  • Supports multi-task pipelines.
  • Handles retries, dependencies, and monitoring.
  • Can integrate with Airflow and CI/CD tools.

6️⃣ Databricks SQL πŸ“Š

Databricks SQL provides a SQL analytics engine and dashboards.

  • Used for BI and reporting.
  • Integrates with Power BI, Tableau, and Looker.
  • Optimized with Photon engine.

7️⃣ ML & AI Tools πŸ€–

  • MLflow : Experiment tracking and model lifecycle management.
  • Feature Store : Centralized feature management.
  • Model Serving : Deploy models as APIs.

πŸ”„ Execution Flow

     Typical Databricks Pipeline Flow:

  • Data Ingestion : Load raw data into Delta Lake (Bronze layer).
  • Data Processing : Transform data into Silver and Gold tables.
  • Orchestration : Schedule pipelines using Jobs or Workflows.
  • Analytics : Query using Databricks SQL or BI tools.
  • ML Training : Train and deploy models using MLflow.

❓ What problems does Databricks solve compared to traditional data warehouses?

Traditional data warehouses are optimized for structured analytics but have limitations in handling big data, AI, and unstructured workloads. Databricks solves these limitations using the Lakehouse architecture.


πŸ”· Problems with Traditional Data Warehouses ⚠️

  • Limited Data Types : Mainly support structured data, poor support for semi-structured and unstructured data.
  • High Cost : Expensive storage and compute with vendor lock-in.
  • Poor ML Support : Not designed for machine learning or AI workloads.
  • Data Silos : Separate systems needed for data lake, warehouse, and ML platforms.
  • Scalability Issues : Scaling storage and compute independently is difficult.

βœ… How Databricks Solves These Problems πŸš€

  • Unified Lakehouse : Combines data lake and warehouse in one platform.
  • Open Data Formats : Uses Delta and Parquet instead of proprietary formats.
  • Cost Efficient : Separation of compute and storage reduces cost.
  • AI & ML Native : Built-in MLflow, Feature Store, and model serving.
  • Scalable & Cloud Native : Easily scales to petabytes on cloud storage.

❓ What is Lakehouse Architecture?

Lakehouse architecture is a modern data architecture that unifies the capabilities of data lakes and data warehouses into a single platform, enabling analytics, streaming, and machine learning on the same data.


πŸ”· Why Lakehouse Was Needed ⚠️

  • Data Lakes : Cheap storage but no transactions, governance, or fast BI queries.
  • Data Warehouses : Fast analytics but expensive and limited to structured data.
  • Data Silos : Separate systems for raw data, analytics, and ML pipelines.

βœ… Key Principles of Lakehouse Architecture πŸ—οΈ

  • Unified Storage : All data stored in open formats like Delta or Parquet on cloud storage.
  • ACID Transactions : Ensures data consistency, reliability, and concurrent reads/writes.
  • Schema Enforcement & Evolution : Validates schemas and supports schema changes safely.
  • Separation of Compute and Storage : Compute scales independently from storage, reducing cost.
  • Data Governance : Centralized metadata, lineage, and access control (e.g., Unity Catalog).
  • BI + ML on Same Data : Supports SQL analytics, streaming, and machine learning without data movement.

πŸ”· Core Components of Lakehouse Architecture πŸ”

  • Open Storage Layer : Cloud object storage (S3, ADLS, GCS) with Delta/Parquet files.
  • Compute Engines : Distributed compute engines like Databricks clusters or SQL warehouses.
  • Metadata & Governance : Catalogs, schemas, permissions, and lineage tracking.
  • Query & Analytics Layer : SQL engines and BI tools for dashboards and reporting.
  • ML & Streaming Layer : Tools for machine learning, feature stores, and real-time pipelines.

πŸ“ Benefits of Lakehouse Architecture πŸš€

  • Cost Efficient : Uses cheap cloud storage instead of expensive warehouse storage.
  • Scalable : Handles petabyte-scale data with elastic compute.
  • Unified Platform : No need separate systems for BI, ETL, and ML.
  • Open & Flexible : Avoids vendor lock-in using open file formats.
  • Real-Time & Batch : Supports both streaming and batch workloads.

❓ What is Medallion Architecture?

Medallion architecture is a data modeling and pipeline design pattern used in Lakehouse platforms to organize data into layered stages that improve data quality, reliability, and analytics readiness.


πŸ”· Why Medallion Architecture is Needed ⚠️

  • Raw Data is Messy : Ingested data is often incomplete, duplicate, or inconsistent.
  • Data Quality Issues : Analytics and ML require clean and standardized data.
  • Pipeline Maintainability : Complex transformations become hard to manage without layered design.

βœ… Core Layers of Medallion Architecture πŸ₯‰πŸ₯ˆπŸ₯‡

1️⃣ Bronze Layer (Raw Data) πŸ₯‰

  • Stores raw ingested data from source systems.
  • Minimal transformations applied.
  • Used for auditing and reprocessing.

2️⃣ Silver Layer (Cleaned & Enriched Data) πŸ₯ˆ

  • Data is cleaned, deduplicated, and standardized.
  • Business rules and transformations applied.
  • Used for intermediate analytics and ML features.

3️⃣ Gold Layer (Business-Ready Data) πŸ₯‡

  • Aggregated and curated data for BI and reporting.
  • Contains fact and dimension tables.
  • Used by dashboards and business users.

πŸ”· Key Principles of Medallion Architecture πŸ”

  • Layered Transformation : Data flows from raw to refined to business-ready.
  • Incremental Processing : Supports streaming and batch updates.
  • Data Quality Improvement : Each layer improves data accuracy and consistency.
  • Reproducibility : Raw data preserved for reprocessing.

πŸ“ Benefits of Medallion Architecture πŸš€

  • Improved Data Quality : Structured data refinement pipeline.
  • Scalable Design : Works well for big data pipelines.
  • Easy Maintenance : Clear separation of responsibilities per layer.
  • Supports BI & ML : Gold and Silver layers feed analytics and ML models.

❓ Difference between Data Lake, Data Warehouse, and Lakehouse

Data Lake, Data Warehouse, and Lakehouse are different data architectures designed for storage, analytics, and AI workloads, each with different capabilities and use cases.


πŸ”· Data Lake πŸ—„οΈ

  • Purpose : Store raw structured, semi-structured, and unstructured data.
  • Schema : Schema-on-read (schema applied when reading data).
  • Cost : Very low-cost storage using cloud object storage.
  • Performance : Slower for analytics without optimization layers.
  • Use Cases : Raw data ingestion, logs, IoT, media, backups.

πŸ”· Data Warehouse 🏒

  • Purpose : Fast analytics and BI reporting.
  • Schema : Schema-on-write (data structured before loading).
  • Cost : Expensive storage and compute.
  • Performance : Highly optimized for SQL queries.
  • Use Cases : Business dashboards, structured reporting, OLAP.

πŸ”· Lakehouse πŸ—οΈ

  • Purpose : Unified platform for analytics, streaming, and machine learning.
  • Schema : Supports both schema-on-read and schema-on-write.
  • Cost : Uses low-cost data lake storage with optimized compute.
  • Performance : Warehouse-like performance using Delta/ACID and indexing.
  • Use Cases : BI, AI/ML, streaming, and big data analytics on same data.

πŸ”· Key Differences Summary πŸ“Š

Feature Data Lake Data Warehouse Lakehouse
Data Type All types Structured All types
Schema Schema-on-read Schema-on-write Both
Cost Low High Medium / Optimized
Analytics Performance Low High High
ML Support Limited Limited Native

❓ What are Databricks Workspaces?

Databricks Workspace is a collaborative environment where users can develop, manage, and share notebooks, dashboards, jobs, and other assets for data engineering, analytics, and machine learning.


πŸ”· Purpose of Databricks Workspace 🎯

  • Provides a central place to organize code, notebooks, and data assets.
  • Enables collaboration between data engineers, analysts, and data scientists.
  • Acts as the main interface for interacting with Databricks platform.

πŸ”· Key Components in a Workspace πŸ“‚

  • Notebooks : Interactive notebooks supporting Python, SQL, Scala, and R.
  • Repos : Git-integrated repositories for version control.
  • Jobs & Workflows : Scheduled and orchestrated pipelines.
  • Dashboards : Visualizations and BI reports.
  • Libraries : Custom or third-party libraries for workloads.

πŸ”· Collaboration Features 🀝

  • Multiple users can edit and run notebooks simultaneously.
  • Role-based access control for folders and assets.
  • Integration with Git for CI/CD workflows.

πŸ“ Key Benefits πŸš€

  • Centralized development environment.
  • Improves team productivity and collaboration.
  • Supports end-to-end workflows from data ingestion to ML.

❓ What is a Databricks Cluster?

A Databricks Cluster is a group of virtual machines configured to provide distributed compute resources for running notebooks, jobs, and data processing workloads.


πŸ”· Purpose of Databricks Clusters 🎯

  • Provides scalable compute for data engineering, analytics, and machine learning.
  • Executes Spark jobs and SQL queries in a distributed manner.
  • Handles large-scale batch and streaming data processing.

πŸ”· Core Components of a Cluster 🧩

  • Driver Node : Orchestrates the job, schedules tasks, and returns results.
  • Worker Nodes : Execute tasks in parallel and process data partitions.
  • Executors : Processes running on worker nodes that perform computations.

πŸ”· Types of Databricks Clusters βš™οΈ

  • Interactive Cluster : Used for development and exploration in notebooks.
  • Job Cluster : Created automatically for scheduled jobs and terminated after execution.
  • SQL Warehouse : Specialized compute for SQL analytics.

πŸ“ Key Features πŸš€

  • Auto-scaling to add or remove worker nodes based on workload.
  • Auto-termination to reduce cost when idle.
  • Support for different instance types and Spark configurations.

❓ Types of Databricks Clusters

Databricks provides different types of compute depending on the workload and usage pattern, mainly Interactive Clusters, Job Clusters, and SQL Warehouses.


πŸ”· 1️⃣ Interactive Cluster (All-Purpose Cluster) πŸ§‘β€πŸ’»

  • Purpose : Development, testing, and exploratory analysis.
  • Usage : Used with notebooks by data engineers and data scientists.
  • Lifecycle : Manually created and shared by multiple users.
  • Cost : Can be expensive if left running.

πŸ”· 2️⃣ Job Cluster βš™οΈ

  • Purpose : Production batch pipelines and scheduled jobs.
  • Usage : Created automatically when a job starts and terminated after completion.
  • Lifecycle : Ephemeral (short-lived).
  • Cost : More cost-efficient for production workloads.

πŸ”· 3️⃣ SQL Warehouse (Databricks SQL Compute) πŸ“Š

  • Purpose : BI analytics and SQL queries.
  • Usage : Used by analysts and BI tools like Power BI and Tableau.
  • Lifecycle : Managed compute optimized for SQL workloads.
  • Performance : Optimized using Photon engine.

πŸ”· Key Differences Summary πŸ“‹

Feature Interactive Cluster Job Cluster SQL Warehouse
Primary Use Development & Exploration Production Jobs BI & SQL Analytics
Lifecycle Long-running Ephemeral Managed / On-demand
Cost Efficiency Lower High Medium
Users Engineers & Data Scientists Automation Pipelines Analysts & BI Tools

❓ What is Databricks Runtime (DBR)?

Databricks Runtime is an optimized runtime environment provided by Databricks that includes Apache Spark, Delta Lake, and other libraries pre-configured for high performance and reliability.


πŸ”· Key Components of Databricks Runtime

  • Apache Spark : Core distributed processing engine for batch and streaming workloads.
  • Delta Lake : Storage layer that adds ACID transactions, schema enforcement, and versioning.
  • Optimized Libraries : Pre-installed ML, SQL, Python, and Scala libraries.
  • Photon Engine (in some versions) : Vectorized query engine for faster SQL analytics.

πŸ”· Types of Databricks Runtime

  • Standard Runtime : General-purpose analytics and data engineering workloads.
  • ML Runtime : Includes MLflow, TensorFlow, PyTorch, and other ML libraries.
  • Photon Runtime : High-performance SQL and BI workloads.

πŸ”· Why Databricks Runtime is Important

  • Improves performance compared to open-source Spark distributions.
  • Reduces setup and dependency management effort.
  • Ensures compatibility with Databricks features like Delta Lake and Unity Catalog.

❓ What is Photon Engine?

Photon is a high-performance query execution engine built by Databricks to accelerate SQL and data analytics workloads in the Lakehouse.


πŸ”· Key Characteristics of Photon

  • Vectorized Execution : Processes data in batches (vectors) instead of row-by-row for faster computation.
  • Native C++ Engine : Written in C++ for low-level performance instead of JVM-based Spark execution.
  • Columnar Processing : Optimized for columnar data formats like Delta and Parquet.
  • Automatic Integration : Works transparently with Spark SQL without changing user code.

πŸ”· Where Photon is Used

  • Databricks SQL Warehouses : BI dashboards and interactive analytics.
  • Batch Analytics : Large-scale aggregations and joins.
  • Lakehouse Queries : Faster reads from Delta Lake tables.

πŸ”· Benefits of Photon

  • Provides 2x to 10x faster query performance compared to standard Spark execution.
  • Reduces compute cost due to faster query execution.
  • Improves performance for BI tools like Power BI, Tableau, and Looker.

❓ What is Delta Lake?

Delta Lake is an open-source storage layer built on top of data lakes that adds reliability, performance, and governance features to data stored in Parquet files.


πŸ”· Core Features of Delta Lake

  • ACID Transactions : Ensures data consistency and reliability for concurrent reads and writes.
  • Schema Enforcement : Prevents bad or unexpected data from being written to tables.
  • Schema Evolution : Allows controlled changes to table schema over time.
  • Time Travel : Query historical versions of data using table versions or timestamps.

πŸ”· Performance Optimizations

  • Data Skipping : Skips irrelevant files using metadata statistics.
  • Z-Ordering : Co-locates related data to improve query performance.
  • Compaction (OPTIMIZE) : Merges small files into larger files for efficient reads.

πŸ”· Why Delta Lake is Important

  • Brings data warehouse reliability to data lakes.
  • Enables unified batch, streaming, and ML workloads on the same data.
  • Foundation of Databricks Lakehouse architecture.

❓ What is ACID compliance in Delta Lake?

ACID compliance in Delta Lake ensures that data operations are reliable, consistent, and safe even with concurrent reads and writes.


πŸ”· ACID Properties

  • Atomicity : A transaction is all-or-nothing. Either all changes are committed or none are applied.
  • Consistency : Data always moves from one valid state to another, following schema and constraints.
  • Isolation : Concurrent transactions do not interfere with each other. Readers see a consistent snapshot.
  • Durability : Once committed, data changes are permanently stored even after failures.

πŸ”· How Delta Lake Achieves ACID

  • Transaction Log (_delta_log) : Maintains metadata and commit history for all operations.
  • Optimistic Concurrency Control : Detects conflicts during concurrent writes.
  • Snapshot Isolation : Readers always see a consistent version of the table.

❓ What is Delta Transaction Log?

The Delta Transaction Log is a metadata system that tracks all changes made to a Delta table, enabling ACID transactions, versioning, and reliability in data lakes.


πŸ”· What is _delta_log?

The _delta_log directory is a hidden folder in every Delta table that stores transaction history as JSON and Parquet files.


πŸ”· Key Functions

  • Metadata Management : Stores schema, table properties, and partition information.
  • Version Control : Tracks every commit as a new table version (time travel support).
  • Transaction History : Logs inserts, updates, deletes, merges, and schema changes.
  • Concurrency Control : Enables optimistic locking to handle multiple writers safely.

πŸ”· File Types in Delta Log

  • JSON Files : Each commit is stored as a JSON transaction log entry.
  • Checkpoint Parquet Files : Periodic snapshots of table state to speed up reads.

πŸ”· Why It Matters

  • Enables ACID transactions on data lakes
  • Supports Time Travel and Rollback
  • Improves reliability and auditability
  • Makes data lakes behave like databases

❓ What is Time Travel in Delta?

Time Travel in Delta allows you to query, audit, or restore previous versions of a Delta table using the transaction log.


πŸ”· What It Does

  • Historical Queries : Read old data versions.
  • Rollback : Restore table to a previous state.
  • Auditing : Track how data changed over time.

πŸ”· How It Works

  • Every write creates a new version in _delta_log.
  • Delta never overwrites data immediately; it tracks file changes.
  • You can query by version number or timestamp.

πŸ”· Examples

-- Read table at version 5
SELECT * FROM sales VERSION AS OF 5;

-- Read table as of a timestamp
SELECT * FROM sales TIMESTAMP AS OF '2025-02-10 10:00:00';

-- Restore table to an old version
RESTORE TABLE sales TO VERSION AS OF 5;


πŸ”· Real-World Use Cases

  • Accidental Delete : Recover deleted data.
  • Bad Pipeline Run : Roll back corrupted data.
  • Compliance : Audit historical snapshots of data.

πŸ”· Why It Is Important

  • Enables rollback and disaster recovery
  • Supports governance and auditing
  • Makes data lakes behave like databases

❓ What is Schema Enforcement and Schema Evolution?

Schema Enforcement ensures incoming data matches the table schema, while Schema Evolution allows Delta tables to automatically adapt to new columns or schema changes.


πŸ”· 1️⃣ Schema Enforcement (Strict Schema) 🚫

Schema Enforcement rejects data if the schema does not match the existing table structure.


Example:

CREATE TABLE users (
  id INT,
  name STRING
) USING DELTA;

-- This will FAIL (extra column age)
INSERT INTO users VALUES (1, 'Aditya', 28);

Use Case: Prevents bad or unexpected data from corrupting production tables.


πŸ”· 2️⃣ Schema Evolution (Automatic Schema Change) πŸ”„

Schema Evolution allows Delta to automatically add new columns or update schema when new data arrives.


Example using mergeSchema = true:

df.write.format("delta") \
  .mode("append") \
  .option("mergeSchema", "true") \
  .save("/mnt/delta/users")

If df contains a new column like age, Delta will automatically add it to the table.


πŸ”· Key Differences

  • Schema Enforcement : Rejects mismatched schema (data quality control).
  • Schema Evolution : Automatically updates schema (flexibility for changing data).

πŸ”· Why Important in Databricks?

  • Handles evolving JSON/API/Kafka data sources
  • Reduces pipeline failures due to schema drift
  • Balances flexibility and governance in Lakehouse

❓ What is OPTIMIZE in Delta?

OPTIMIZE is a Delta Lake command used to compact many small files into fewer large files to improve query performance.


πŸ”· Why OPTIMIZE is Needed

In Delta tables, frequent writes (streaming, CDC, incremental loads) create many small files, which:

  • Increase query latency
  • Increase metadata overhead
  • Cause inefficient I/O operations

OPTIMIZE merges these small files into larger files.


πŸ”· Basic Example

OPTIMIZE sales;

This compacts all small files in the sales Delta table.


πŸ”· Partition-Specific Optimize

OPTIMIZE sales WHERE date = '2025-02-10';

Only optimizes files in a specific partition.


πŸ”· OPTIMIZE with Z-Ordering (Advanced)

Z-Ordering colocates related data to speed up selective queries.

OPTIMIZE sales
ZORDER BY (customer_id, country);

This improves query performance for filters on customer_id and country.


πŸ”· How OPTIMIZE Works Internally

  • Merges small Parquet files into large files (typically 128MB–1GB)
  • Updates Delta Transaction Log with new file references
  • Old files remain until VACUUM cleans them

πŸ”· Real-World Use Cases

  • Streaming Pipelines : Micro-batches create many tiny files
  • CDC Loads : Frequent MERGE operations fragment files
  • BI Queries : Faster scans for dashboards and analytics

πŸ”· OPTIMIZE vs VACUUM

  • OPTIMIZE : Compacts small files for performance
  • VACUUM : Deletes old unused files to save storage

πŸ”· Why OPTIMIZE is Important in Databricks?

  • Improves query speed significantly
  • Reduces metadata overhead
  • Essential for large-scale Delta Lake production systems

❓ What is Z-Ordering in Delta?

Z-Ordering is a data layout optimization technique in Delta Lake that co-locates related data in the same set of files to improve query performance for selective filters.


πŸ”· Why Z-Ordering is Needed

In large Delta tables, data is spread across many files. When you filter on a column, Spark may scan many unnecessary files.

Z-Ordering physically reorganizes data so that similar values are stored together, reducing file scans.


πŸ”· Basic Example

OPTIMIZE sales
ZORDER BY (customer_id);

This improves queries like:

SELECT * FROM sales WHERE customer_id = 123;


πŸ”· Z-Ordering on Multiple Columns

OPTIMIZE sales
ZORDER BY (customer_id, country);

This speeds up filters on customer_id and country.


πŸ”· How Z-Ordering Works Internally

  • Uses a space-filling curve (Z-curve) to map multi-dimensional data to 1D.
  • Groups similar column values together in the same Parquet files.
  • Improves data skipping and file pruning during queries.

πŸ”· Z-Ordering vs Partitioning

  • Partitioning : Splits data into directories (good for low-cardinality columns like date).
  • Z-Ordering : Optimizes file layout inside partitions (good for high-cardinality columns like user_id).

πŸ”· Real-World Use Cases

  • Filtering on customer_id, product_id, order_id
  • BI dashboards with selective queries
  • Data science feature lookups

πŸ”· Best Practices

  • Run Z-Order with OPTIMIZE (not alone)
  • Use on columns frequently used in filters and joins
  • Avoid too many Z-Order columns (2–4 max)

❓ What is VACUUM in Delta?

VACUUM is a Delta Lake command used to permanently delete old and unused data files that are no longer referenced by the Delta transaction log.


πŸ”· Why VACUUM is Needed

Delta tables keep old files for features like Time Travel and ACID transactions. Over time, these unused files:

  • Consume large storage space
  • Increase cloud storage costs
  • Slow down metadata operations

VACUUM cleans up these obsolete files.


πŸ”· Basic Example

VACUUM sales;

This removes files older than the default retention period (7 days).


πŸ”· Custom Retention Period Example

-- Delete files older than 24 hours
VACUUM sales RETAIN 24 HOURS;


πŸ”· How VACUUM Works Internally

  • Checks the Delta Transaction Log to find active files
  • Identifies orphan and obsolete files not referenced by any version
  • Permanently deletes these files from storage

πŸ”· VACUUM vs OPTIMIZE

  • OPTIMIZE : Compacts small files into larger files (performance).
  • VACUUM : Deletes old unused files (storage cleanup).

πŸ”· Impact on Time Travel

After VACUUM, you cannot time travel to deleted versions older than the retention period.


πŸ”· Real-World Use Cases

  • Long-running streaming pipelines creating many obsolete files
  • CDC pipelines with frequent MERGE/UPDATE/DELETE operations
  • Reducing cloud storage costs in production

πŸ”· Best Practices

  • Do not set very low retention in production
  • Keep default 7 days for safety
  • Run VACUUM periodically (weekly/monthly)

❓ What is Liquid Clustering?

Liquid Clustering is a new automatic data layout optimization technique in Delta Lake that dynamically clusters data based on query patterns without requiring manual partitioning or Z-Ordering.


πŸ”· Why Liquid Clustering is Needed

Traditional optimization requires manual tuning:

  • Partitioning needs predefined columns
  • Z-Ordering must be manually executed
  • Data distribution changes over time (query patterns evolve)

Liquid Clustering automatically adapts to changing workloads.


πŸ”· Basic Example

CREATE TABLE sales
USING DELTA
CLUSTER BY (customer_id, country);

Databricks will automatically reorganize data based on these columns.


πŸ”· How Liquid Clustering Works Internally

  • Continuously monitors query access patterns
  • Rewrites data files in the background for optimal layout
  • Uses adaptive clustering instead of fixed partitions
  • Improves data skipping and file pruning automatically

πŸ”· Liquid Clustering vs Partitioning vs Z-Ordering

  • Partitioning : Static directory-based organization (manual)
  • Z-Ordering : Manual file-level co-location optimization
  • Liquid Clustering : Automatic, adaptive clustering without manual OPTIMIZE

πŸ”· Real-World Use Cases

  • High-cardinality columns (user_id, transaction_id)
  • Dynamic query workloads (BI + ML + ad-hoc analytics)
  • Large-scale streaming and batch Delta tables

πŸ”· Benefits

  • Eliminates manual partition tuning
  • Automatically adapts to changing workloads
  • Improves query performance and data skipping
  • Reduces operational overhead for data engineers

❓ What is Delta Change Data Feed (CDF)?

Delta Change Data Feed (CDF) is a Delta Lake feature that captures and exposes row-level changes (inserts, updates, deletes) in a Delta table so downstream systems can consume only the changed data.


πŸ”· Why CDF is Needed

Traditionally, downstream pipelines reprocess entire tables to detect changes, which is expensive. CDF enables incremental processing by reading only changed rows.

  • Reduces compute and cost
  • Enables real-time replication
  • Simplifies CDC pipelines

πŸ”· How to Enable CDF

ALTER TABLE sales
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);


πŸ”· How to Read Change Data

-- Read changes between versions
SELECT * FROM table_changes('sales', 10, 20);

-- Read changes since a timestamp
SELECT * FROM table_changes('sales', TIMESTAMP '2025-02-10', TIMESTAMP '2025-02-15');


πŸ”· What Data CDF Captures

  • INSERT : New rows added
  • UPDATE_PREIMAGE : Old version of updated rows
  • UPDATE_POSTIMAGE : New version of updated rows
  • DELETE : Deleted rows

πŸ”· Example Output Columns

  • _change_type – Type of change
  • _commit_version – Delta table version
  • _commit_timestamp – When the change happened

πŸ”· Real-World Use Cases

  • Streaming CDC pipelines to Kafka or downstream Delta tables
  • Syncing Delta tables with data warehouses (Snowflake, BigQuery)
  • Auditing and debugging data changes
  • Building incremental ML feature pipelines

πŸ”· Benefits

  • Eliminates full table scans for CDC
  • Near real-time incremental data consumption
  • Native Delta Lake feature (no external CDC tool needed)

❓ What is Auto Loader in Databricks?

Auto Loader is a Databricks feature for incrementally and efficiently ingesting new files from cloud storage into Delta tables using Structured Streaming.


πŸ”· Why Auto Loader is Needed

Traditional file ingestion (batch jobs) scans the entire directory repeatedly, which is slow and expensive. Auto Loader detects only new files and processes them incrementally.

  • Scales to millions of files
  • Low latency ingestion
  • Fault-tolerant and exactly-once processing

πŸ”· Basic Example (Databricks PySpark)

df = (spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load("/mnt/raw/events"))

df.writeStream.format("delta") \
  .option("checkpointLocation", "/mnt/checkpoints/events") \
  .table("bronze_events")

This continuously ingests new JSON files into a Delta Bronze table.


πŸ”· How Auto Loader Works

  • Monitors cloud storage (S3, ADLS, GCS)
  • Detects new files using notifications or directory listing
  • Processes files exactly once using checkpoints
  • Stores metadata in a schema location

πŸ”· File Discovery Modes

  • Directory Listing : Periodically lists files (simple but slower)
  • File Notification : Uses cloud events (SQS, Event Grid) for scalable ingestion

πŸ”· Schema Handling

  • Schema Inference : Automatically detects schema
  • Schema Evolution : Supports mergeSchema for new columns
  • Schema Enforcement : Rejects bad schema if configured

πŸ”· Real-World Use Cases

  • Ingesting IoT logs and event data
  • Landing raw data into Bronze layer (Medallion)
  • Streaming data lakes and CDC pipelines
  • Replacing traditional batch ingestion jobs

πŸ”· Benefits

  • Scales to billions of files
  • Cost-efficient incremental ingestion
  • Built-in fault tolerance and exactly-once guarantees

❓ What is Managed Table vs External Table in Databricks?

Managed and External tables differ in who controls the underlying data files and what happens when the table is dropped.


πŸ”· 1️⃣ Managed Table (Internal Table) πŸ“¦

In a Managed Table, Databricks manages both the metadata and the underlying data files.


Example:

CREATE TABLE sales_managed (
  id INT,
  amount DOUBLE
) USING DELTA;

Data is stored in the Databricks managed storage location.


Behavior:

  • DROP TABLE deletes metadata + data files
  • Databricks controls storage path
  • Best for curated internal datasets

πŸ”· 2️⃣ External Table (Unmanaged Table) πŸ“‚

In an External Table, Databricks manages only metadata, while data remains in a user-defined storage location.


Example:

CREATE TABLE sales_external (
  id INT,
  amount DOUBLE
)
USING DELTA
LOCATION '/mnt/data/sales_external';

Data is stored in the specified cloud storage path.


Behavior:

  • DROP TABLE deletes only metadata (data files remain)
  • User controls storage and lifecycle
  • Can be shared across multiple tools (Databricks, Athena, Synapse)

πŸ”· Key Differences

  • Storage Control : Managed β†’ Databricks, External β†’ User-defined path
  • Drop Behavior : Managed β†’ Data deleted, External β†’ Data preserved
  • Use Case : Managed β†’ Curated analytics, External β†’ Data lake sharing

πŸ”· Real-World Use Cases

  • Managed Table : Gold layer curated business tables
  • External Table : Bronze/Silver lake data shared across platforms

❓ What is Delta Live Tables (DLT)?

Delta Live Tables (DLT) is a Databricks framework for building reliable, declarative, and automated data pipelines that create and manage Delta tables with built-in data quality, orchestration, and monitoring.


πŸ”· Why DLT is Needed

Traditional ETL pipelines require manual orchestration, data quality checks, and monitoring. DLT automates these tasks and ensures reliable production pipelines.

  • Automatic pipeline orchestration
  • Built-in data quality checks
  • Automatic dependency management

πŸ”· Basic Example (DLT Table Definition)

import dlt

@dlt.table
def bronze_users():
    return spark.read.format("json").load("/mnt/raw/users")

This defines a Bronze Delta table managed by DLT.


πŸ”· Example with Data Quality Rules

@dlt.table
@dlt.expect("valid_id", "id IS NOT NULL")
@dlt.expect_or_drop("valid_age", "age >= 0")
def silver_users():
    return dlt.read("bronze_users")

DLT automatically enforces data quality rules.


πŸ”· How DLT Works

  • Uses declarative SQL/Python to define tables
  • Automatically builds DAG of dependencies
  • Manages compute clusters and retries
  • Tracks data lineage and pipeline health

πŸ”· DLT Pipeline Layers (Medallion)

  • Bronze : Raw ingestion tables
  • Silver : Cleaned and transformed tables
  • Gold : Business-level aggregated tables

πŸ”· Real-World Use Cases

  • Streaming and batch ETL pipelines
  • Building Medallion architecture pipelines
  • Data quality and governance pipelines
  • Automated production data engineering workflows

πŸ”· Benefits

  • Declarative pipelines (less orchestration code)
  • Built-in data quality enforcement
  • Automatic lineage, monitoring, and retries
  • Reduces operational complexity

❓ What is Unity Catalog?

Unity Catalog is Databricks’ centralized governance and metadata management system that provides unified access control, auditing, and data discovery across all data assets in a Lakehouse.


πŸ”· Why Unity Catalog is Needed

In traditional setups, each workspace had its own metastore and security model, making governance difficult. Unity Catalog provides a single governance layer across workspaces.

  • Centralized security and permissions
  • Unified metadata across workspaces
  • Data lineage and auditing

πŸ”· Core Components of Unity Catalog

  • Metastore : Central metadata repository for tables, views, and files
  • Catalog : Top-level container for schemas and tables
  • Schema (Database) : Logical grouping of tables and views
  • Table / View : Actual data objects
  • Volumes : Managed file storage for non-tabular data

πŸ”· Example Structure

metastore
  └── main_catalog
        └── sales_schema
              └── orders_table


πŸ”· Access Control Example

GRANT SELECT ON TABLE main_catalog.sales_schema.orders_table TO `data_analyst`;

This controls who can access data centrally.


πŸ”· Key Features

  • Fine-grained permissions (catalog, schema, table, column, row)
  • Cross-workspace data sharing
  • Built-in audit logs and lineage
  • Integration with Delta Sharing and BI tools

πŸ”· Unity Catalog vs Hive Metastore

  • Hive Metastore : Workspace-level, limited governance
  • Unity Catalog : Centralized, enterprise-grade governance

πŸ”· Real-World Use Cases

  • Enterprise data governance and compliance
  • Sharing data across teams and workspaces
  • Managing multi-cloud lakehouse environments
  • Auditing and security for sensitive data

πŸ”· Benefits

  • Single source of truth for metadata
  • Centralized security and governance
  • Improved data discovery and collaboration
  • Enterprise compliance (GDPR, HIPAA)

Catalogs, Schemas, Tables, and Volumes in Unity Catalog

1. Catalog

A catalog is the highest-level namespace in Unity Catalog. It is used to logically separate data across teams, projects, or environments and acts as a governance boundary for access control, auditing, and lineage tracking.

Example Use Case: Separate environments like development, staging, and production.

dev_catalog
prod_catalog

2. Schema (Database)

A schema is a logical container inside a catalog that groups related tables, views, and functions. It is similar to a database in traditional systems and is commonly used to organize data by domain or business unit.

Example Use Case: Organizing data by business domain.

prod_catalog.sales
prod_catalog.finance

3. Table

A table is a structured dataset stored in Delta format that contains rows and columns. Tables in Unity Catalog support ACID transactions, time travel, schema enforcement, and fine-grained access control.

Example:

prod_catalog.sales.transactions

4. Volume

A volume is a managed storage object used to store unstructured or semi-structured files such as CSV, JSON, Parquet, images, or logs. Volumes are governed by Unity Catalog, allowing secure file access without exposing cloud storage credentials.

Example:

prod_catalog.raw_data.file_volume

Hierarchy Structure

catalog.schema.table
catalog.schema.volume

This hierarchy enables centralized governance, data organization, and consistent security policies across all Databricks workloads.

Unity Catalog Table vs Hive Metastore Table

1. Unity Catalog Table

A Unity Catalog (UC) table is a table registered in Unity Catalog, which provides centralized governance, fine-grained access control, auditing, and data lineage across all Databricks workspaces and cloud storage.

Key Points:

  • Centralized metastore shared across multiple workspaces
  • Supports catalogs, schemas, and tables hierarchy
  • Fine-grained permissions (catalog, schema, table, column, row-level)
  • Built-in lineage and audit logging

Example:

prod_catalog.sales.transactions

2. Hive Metastore Table

A Hive Metastore (HMS) table is a legacy table registered in the workspace-level metastore. Governance is limited and mostly controlled at the workspace level.

Key Points:

  • Metastore is scoped to a single workspace
  • No catalogs (only database.schema.table structure)
  • Limited fine-grained access control
  • No centralized lineage across workspaces

Example:

sales.transactions

Key Differences

Feature Unity Catalog Table Hive Metastore Table
Scope Global across workspaces Workspace-specific
Hierarchy catalog.schema.table schema.table
Governance Centralized, fine-grained Basic workspace-level
Lineage & Auditing Built-in Limited

What is Data Lineage in Unity Catalog?

Data lineage in Unity Catalog is the ability to track how data flows across tables, views, notebooks, jobs, and dashboardsβ€”showing where data comes from, how it is transformed, and where it is used.

Key Capabilities

  • Source Tracking: Shows upstream source tables and files used to create a dataset.
  • Transformation Tracking: Captures SQL queries, notebooks, and pipelines that modify data.
  • Downstream Impact: Identifies dashboards, ML models, or reports that depend on a table.
  • Column-Level Lineage: Tracks how individual columns are derived and transformed.

Example

If a Delta table sales_summary is created from raw_sales and used in a BI dashboard:

raw_sales  -->  cleaned_sales  -->  sales_summary  -->  Power BI Dashboard

Why It Is Important

  • Helps in impact analysis before changing tables
  • Improves data governance and compliance
  • Debugging data quality issues
  • Supports auditing and regulatory requirements

Unity Catalog automatically captures lineage without manual instrumentation across SQL, notebooks, and Delta Live Tables.

How Does Unity Catalog Handle Permissions?

Unity Catalog uses a centralized, fine-grained access control model to manage permissions across all data assets like catalogs, schemas, tables, views, and volumes. It follows a GRANT/REVOKE privilege-based security model similar to enterprise databases.

Permission Levels in Unity Catalog

  • Catalog Level: Controls access to an entire data domain (e.g., finance, marketing).
  • Schema Level: Controls access to a database-like namespace inside a catalog.
  • Table / View Level: Controls who can read or modify specific tables.
  • Column Level: Restricts access to sensitive columns (PII, salary, SSN).
  • Row-Level Security: Filters rows based on user roles (e.g., region-based access).

Common Privileges

  • SELECT: Read data
  • INSERT: Add new records
  • UPDATE / DELETE: Modify or remove records
  • CREATE: Create tables, views, schemas
  • USAGE: Required to access catalogs and schemas

Example (Grant Permission)

GRANT USAGE ON CATALOG main TO `data_analyst`;
GRANT USAGE ON SCHEMA main.sales TO `data_analyst`;
GRANT SELECT ON TABLE main.sales.transactions TO `data_analyst`;

Column-Level Security Example

GRANT SELECT (customer_id, order_amount)
ON TABLE main.sales.transactions
TO `finance_team`;

Why Unity Catalog Permissions Are Important

  • Centralized governance across all Databricks workspaces
  • Enterprise-grade security for sensitive data
  • Audit logs for compliance (HIPAA, GDPR, SOC2)
  • Eliminates manual permission management per workspace

Unity Catalog integrates with cloud IAM (Azure AD, AWS IAM, GCP IAM) to enforce identity-based access control.

What is Row-Level Security (RLS) and Column-Level Security (CLS) in Databricks?

Row-Level Security (RLS) and Column-Level Security (CLS) are data governance features in Databricks (via Unity Catalog) that restrict data access at a granular level based on user roles or attributes.


1️⃣ Row-Level Security (RLS)

Row-Level Security restricts which rows a user can see in a table. Different users can see different subsets of the same table.

Example Use Case

A sales table contains data for all regions, but each regional manager should only see their region's data.

Databricks Example (Dynamic View for RLS)

CREATE VIEW main.sales.secure_sales AS
SELECT *
FROM main.sales.transactions
WHERE region = current_user();

Now:

  • User india_manager sees only India rows
  • User us_manager sees only US rows

Key Points

  • Filters rows dynamically at query time
  • Used for multi-tenant data access
  • Common in finance, healthcare, and SaaS platforms

2️⃣ Column-Level Security (CLS)

Column-Level Security restricts access to specific columns in a table. Sensitive columns like salary, SSN, or credit card numbers can be hidden.

Example Use Case

HR team can see employee salary, but analysts cannot.

Databricks Example (Column Grant)

GRANT SELECT (employee_id, name, department)
ON TABLE main.hr.employees
TO `analyst_team`;

HR team gets full access:

GRANT SELECT ON TABLE main.hr.employees TO `hr_team`;

Key Points

  • Protects sensitive PII and confidential data
  • Enforced at Unity Catalog metadata layer
  • Supports compliance like GDPR and HIPAA

πŸ”Ή RLS vs CLS (Quick Difference)

Feature Row-Level Security (RLS) Column-Level Security (CLS)
Controls Which rows are visible Which columns are visible
Use Case Region-based or tenant-based filtering Hide sensitive fields like salary or SSN
Implementation Dynamic views / policies GRANT SELECT on specific columns

In Databricks, RLS and CLS are key features of Unity Catalog for enterprise-grade data governance and security.

What are Databricks Jobs & Workflows?

Databricks Jobs are used to run automated data pipelines, ETL tasks, ML training, or batch workloads on a schedule or event trigger. Workflows orchestrate multiple tasks in a dependency-based DAG (Directed Acyclic Graph).


1️⃣ Databricks Jobs

A Job is a scheduled or triggered execution of code (Notebook, Python script, SQL, or JAR) on a Databricks cluster.

Key Features

  • Run notebooks, Python files, SQL queries, or JARs
  • Schedule jobs using cron (hourly, daily, weekly)
  • Trigger jobs via API, UI, or events
  • Auto-retry on failure and alerting

Example Use Case

Run a daily ETL pipeline at 2 AM to load data from raw to curated Delta tables.


2️⃣ Databricks Workflows

Workflows are a collection of tasks connected with dependencies. They allow you to build end-to-end data pipelines with multiple steps.

Workflow Task Types

  • Notebook Task
  • Python Script Task
  • SQL Task
  • Delta Live Tables Pipeline
  • Job Trigger Task (job chaining)

Example Workflow DAG

Task 1: Ingest Raw Data
        ↓
Task 2: Transform (Silver)
        ↓
Task 3: Aggregations (Gold)
        ↓
Task 4: ML Model Training

Key Differences (Jobs vs Workflows)

Feature Jobs Workflows
Scope Single task execution Multi-task orchestration
Dependencies Not supported Supported (DAG-based)
Use Case Simple scheduled ETL or script End-to-end data pipelines

Why Jobs & Workflows are Important

  • Automates data pipelines without external schedulers (Airflow, Jenkins)
  • Built-in monitoring, retries, and alerts
  • Supports CI/CD and production pipelines
  • Serverless execution with job clusters

In interviews: "Databricks Jobs run scheduled workloads, and Workflows orchestrate multiple dependent tasks in a DAG for end-to-end data pipelines."

How Do You Schedule Notebooks in Databricks?

In Databricks, notebooks are scheduled using Databricks Jobs. A Job allows you to run notebooks automatically based on a schedule, event trigger, or API call, making it suitable for production ETL pipelines and batch workloads.

Ways to Schedule a Notebook

  • Time-Based Schedule: Run notebooks daily, hourly, weekly, or using a cron expression.
  • Event/API Trigger: Trigger jobs using Databricks REST API, CI/CD pipelines, or external orchestrators (Airflow, ADF).
  • Manual Trigger: Run jobs manually from UI or CLI for testing.

Scheduling via Databricks UI

  1. Create a Job
  2. Add a Task β†’ Select Notebook
  3. Attach a Cluster or Job Cluster
  4. Configure Schedule (Cron or Interval)

Example (Daily ETL Job)

Job Name: daily_sales_etl
Task: notebooks/sales_pipeline
Schedule: Cron -> 0 2 * * *
Cluster: Job Cluster (Auto-terminate)

This runs the notebook every day at 2 AM to ingest and transform sales data.

Key Interview Points

  • Scheduled via Jobs with cron or interval triggers
  • Supports retries, alerts, and timeouts
  • Can use dedicated Job Clusters for cost optimization
  • Commonly used for production batch pipelines

What is a Multi-Task Job in Databricks?

A multi-task job in Databricks is a workflow that contains multiple tasks with defined dependencies, executed as a DAG (Directed Acyclic Graph). Each task can run a notebook, Python script, SQL query, JAR, or Delta Live Tables pipeline.

Key Characteristics

  • Task Dependencies: Tasks can depend on previous tasks (sequential or parallel execution).
  • Different Compute per Task: Each task can run on its own cluster or shared cluster.
  • Retries & Timeouts: Retry and timeout policies can be configured per task.
  • Parallel Execution: Independent tasks can run simultaneously.

Example Multi-Task Workflow

Task 1: Ingest Raw Data (Bronze)
        ↓
Task 2: Transform Data (Silver)
        ↓
Task 3: Aggregations (Gold)
        ↓
Task 4: ML Model Training

Real Production Use Case

  • Daily data ingestion from S3/ADLS
  • Data cleaning and transformations
  • Business aggregations for dashboards
  • Trigger ML pipeline after data is ready

Key Interview Points

  • Multi-task jobs orchestrate end-to-end pipelines inside Databricks
  • They replace external orchestrators for many use cases (Airflow, ADF)
  • They provide built-in monitoring, retries, and dependency management

How Do You Handle Retries and Failures in Databricks Jobs?

Databricks Jobs provide built-in mechanisms to automatically handle task failures using retries, timeouts, alerts, and task dependencies. These features help ensure reliability in production pipelines.

Retry Configuration

Each task in a job can be configured with a retry policy so that transient failures (network issues, cluster crashes, temporary data issues) are automatically retried.

Retries: 3
Retry Interval: 5 minutes

This means if a task fails, Databricks will retry it up to 3 times with a 5-minute gap.


Timeout Handling

You can define a timeout to stop tasks that run too long and prevent runaway jobs.

Task Timeout: 2 hours

If the task exceeds 2 hours, it will be terminated automatically.


Failure Notifications

Databricks can send alerts when jobs fail using:

  • Email notifications
  • Webhooks (Slack, Teams, PagerDuty)
  • Integration with monitoring tools

Task Dependency Failure Handling

In multi-task jobs, downstream tasks are automatically skipped if an upstream task fails.

Task 1 (Ingest) β†’ Task 2 (Transform) β†’ Task 3 (Aggregate)
If Task 1 fails, Task 2 and Task 3 will not run.

Cluster Failure Handling

  • Jobs can use Job Clusters that auto-restart on failure
  • Databricks automatically reschedules tasks on healthy nodes

Example Production Scenario

A daily ETL job is configured with:

  • 3 retries
  • 10-minute retry interval
  • 1-hour timeout
  • Email alert to data engineering team

If the cluster crashes or a temporary data source issue occurs, the pipeline retries automatically without manual intervention.

What is Databricks SQL Warehouse?

A Databricks SQL Warehouse is a compute resource optimized for interactive SQL analytics and BI workloads. It is designed for low-latency, high-concurrency SQL queries and is commonly used by analysts and BI tools.

Key Characteristics

  • Optimized for SQL queries and dashboards
  • Supports high concurrency for many users
  • Uses Photon engine for fast query execution
  • Can be Serverless (no cluster management required)
  • Integrates with BI tools like Power BI, Tableau, Looker

Example Use Case

Business analysts run ad-hoc queries and dashboards on curated Delta tables using SQL Warehouse without interfering with ETL workloads.


Difference Between Databricks SQL Warehouse and Cluster

Feature SQL Warehouse Databricks Cluster
Primary Use Case BI analytics, dashboards, ad-hoc SQL queries ETL, streaming, ML, notebooks, batch processing
Latency Low-latency, interactive queries Higher latency, optimized for compute-heavy tasks
Concurrency High (many users querying simultaneously) Limited (not ideal for many concurrent users)
Compute Management Fully managed / Serverless option available User-managed (manual or auto-scaling)
Workload Type Read-heavy analytical workloads Read + Write heavy ETL and ML workloads
Cost Model Charged per SQL compute usage Charged per cluster uptime and resources

Simple Summary

SQL Warehouse is optimized for analytics and BI queries, while Databricks Clusters are general-purpose compute for data engineering and machine learning workloads.

What is Databricks SQL Dashboard?

πŸ“Š A Databricks SQL Dashboard is a visualization layer in Databricks that allows users to build interactive reports and BI dashboards using SQL queries on Delta tables.

Dashboards run on Databricks SQL Warehouses and are mainly used by analysts, business users, and stakeholders.


πŸ”Ή Key Features

  • πŸ“ˆ Create charts, tables, KPIs, and metrics using SQL queries
  • πŸŽ›οΈ Interactive filters and parameters (date range, region, product)
  • πŸ”„ Scheduled refresh for near real-time reporting
  • πŸ‘₯ Share dashboards with users or embed in apps
  • ⚑ Optimized for low-latency analytics using SQL Warehouse

🧩 Components of a SQL Dashboard

  • πŸ“ SQL Queries: Define the logic for metrics and charts
  • πŸ“Š Visualizations: Bar charts, line charts, pie charts, tables
  • 🎚️ Filters: Dropdowns, date pickers, parameter inputs
  • πŸ“Œ Dashboard Layout: Arrange multiple visuals in a single UI

πŸ’‘ Example Use Case

A company builds a Sales Analytics Dashboard showing daily revenue, top products, and region-wise sales.

Query 1: Total Revenue by Date
SELECT order_date, SUM(amount) AS revenue
FROM main.sales.transactions
GROUP BY order_date;

Query 2: Top Products
SELECT product_name, SUM(amount) AS revenue
FROM main.sales.transactions
GROUP BY product_name
ORDER BY revenue DESC;

These queries are visualized as:

  • πŸ“‰ Line chart for daily revenue
  • πŸ“Š Bar chart for top products
  • πŸ“‹ Table for detailed transactions

🏒 Typical Business Users

  • πŸ“Š Business Analysts
  • πŸ‘¨β€πŸ’Ό Product Managers
  • πŸ“ˆ Executives and Stakeholders

🧠 Summary

Databricks SQL Dashboard provides a native BI visualization layer on top of Delta Lake using SQL Warehouses, enabling interactive analytics without exporting data to external BI tools.

How Do You Manage Schema Changes in Databricks?

🧩 Schema changes refer to adding, removing, or modifying columns in a table. In Databricks, schema changes are managed using Delta Lake features like Schema Enforcement, Schema Evolution, and Controlled Migrations.


πŸ”’ 1. Schema Enforcement (Strict Schema)

Schema Enforcement ensures that incoming data matches the table schema. If the schema does not match, the write operation fails.

πŸ’‘ Example

-- This will fail if new columns appear
df.write.format("delta").mode("append").save("/mnt/delta/sales")

βœ… Used in production to prevent accidental bad data ingestion.


πŸ”„ 2. Schema Evolution (Automatic Schema Update)

Schema Evolution allows Delta tables to automatically add new columns when data schema changes.

πŸ’‘ Example (mergeSchema = true)

df.write.format("delta")
  .option("mergeSchema", "true")
  .mode("append")
  .save("/mnt/delta/sales")

πŸ“Œ If a new column like discount appears, it will be added to the table automatically.


🧠 3. Controlled Schema Migration (ALTER TABLE)

In production, schema changes are often applied manually using DDL to maintain governance.

πŸ’‘ Example

ALTER TABLE main.sales.transactions
ADD COLUMN discount DOUBLE;

ALTER TABLE main.sales.transactions
DROP COLUMN old_flag;

βœ… This approach is preferred in enterprise environments with data governance.


πŸ›‘οΈ 4. Unity Catalog Governance

  • πŸ‘€ Only authorized users can change schema
  • πŸ“œ All schema changes are audited
  • πŸ” Data lineage tracks schema evolution

βš™οΈ 5. Best Practices in Production

  • 🚫 Avoid automatic schema evolution on Gold tables
  • βœ… Allow schema evolution in Bronze/Silver layers
  • πŸ§ͺ Test schema changes in dev before prod
  • πŸ“Š Maintain schema versioning and documentation

πŸ’‘ Real Production Example

A new column customer_segment is added in source data:

df.write.format("delta")
  .option("mergeSchema", "true")
  .mode("append")
  .saveAsTable("main.sales.customers")

Delta automatically updates the table schema without downtime.


🧠 Summary

Databricks manages schema changes using schema enforcement for data quality, schema evolution for flexibility, and controlled ALTER TABLE operations for governed production environments.

What is Cluster Auto-Scaling in Databricks?

πŸ“ˆ Cluster Auto-Scaling is a Databricks feature that automatically increases or decreases the number of worker nodes in a cluster based on workload demand. It helps optimize performance and cost by scaling compute resources dynamically.


βš™οΈ How Auto-Scaling Works

  • ⬆️ Scale Out: Adds more worker nodes when workload increases (e.g., large data processing).
  • ⬇️ Scale In: Removes idle worker nodes when workload decreases.
  • 🧠 Uses metrics like CPU utilization, task backlog, and memory usage.

🧩 Configuration Parameters

  • πŸ”Ή Min Workers: Minimum number of worker nodes (e.g., 2)
  • πŸ”Ή Max Workers: Maximum number of worker nodes (e.g., 20)
  • πŸ”Ή Auto-scaling Enabled: Dynamic cluster size between min and max

πŸ’‘ Example

Min Workers: 2
Max Workers: 10
Auto-Scaling: Enabled

If a job starts processing a large dataset, the cluster scales from 2 β†’ 10 workers. When the job finishes, it scales back to 2 workers to save cost.


🏭 Real Production Use Case

  • πŸ“¦ Daily batch ETL where data volume varies daily
  • ⚑ Ad-hoc analytics with unpredictable query load
  • 🌊 Streaming pipelines with fluctuating event traffic

🧠 Benefits

  • πŸ’° Cost optimization (pay only for required compute)
  • πŸš€ Better performance during peak loads
  • πŸ€– No manual cluster resizing

🧠 Summary

Cluster auto-scaling dynamically adjusts the number of worker nodes based on workload, ensuring high performance while minimizing infrastructure cost.

What is Cluster Auto-Termination in Databricks?

⏹️ Cluster Auto-Termination is a Databricks feature that automatically shuts down a cluster after a period of inactivity. It helps reduce cloud costs by stopping unused compute resources.


βš™οΈ How Auto-Termination Works

  • ⏱️ Databricks monitors cluster activity (jobs, notebooks, SQL queries).
  • πŸ’€ If no activity is detected for a configured time, the cluster is terminated.
  • πŸ’° Users are not billed once the cluster is terminated.

🧩 Configuration Parameter

  • Auto-Termination Minutes: Idle time before cluster shuts down (e.g., 30, 60, 120 minutes)

πŸ’‘ Example

Auto-Termination = 60 minutes

If the cluster is idle for 60 minutes, Databricks automatically stops it.


🏭 Real Production Use Cases

  • πŸ‘¨β€πŸ’» Development clusters used by engineers
  • πŸ“Š Ad-hoc analytics clusters
  • πŸ§ͺ Testing and experimentation environments

⚠️ Important Behavior

  • πŸ”„ Cluster must be restarted manually or by a job trigger
  • πŸ“‰ All running tasks will stop if termination happens
  • πŸ’Ύ Data is not lost because it is stored in external storage (S3/ADLS)

🧠 Benefits

  • πŸ’° Major cost savings by avoiding idle compute charges
  • πŸ€– No manual shutdown required
  • 🏒 Enforced governance in enterprise environments

🧠 Summary

Cluster auto-termination automatically shuts down idle clusters after a configured time to prevent unnecessary cloud costs while keeping data safe in external storage.

πŸš€ How to Optimize Delta Table Performance (Databricks)

Delta Lake performance can degrade as data grows. Below are the most effective optimization techniques used in real production systems.


1️⃣ Use Partitioning (πŸ“‚)

Partitioning splits data into folders based on a column, reducing the amount of data scanned.

Example:
CREATE TABLE sales (
  id INT,
  country STRING,
  date DATE
)
USING DELTA
PARTITIONED BY (country);

βœ” Best for low-cardinality columns like country, date, year


2️⃣ Run OPTIMIZE Command (⚑)

Delta tables create many small files. OPTIMIZE compacts them into fewer large files.

Example:
OPTIMIZE sales;

πŸ‘‰ Improves query speed by reducing file reads.


3️⃣ Use Z-ORDER for Faster Filtering (πŸ”)

Z-Ordering clusters data on frequently filtered columns.

Example:
OPTIMIZE sales ZORDER BY (date, country);

βœ” Great for columns used in WHERE filters.


4️⃣ Enable Delta Auto Optimize (πŸ€–)

Automatically compacts small files during writes.

Example:
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true");
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true");

5️⃣ Vacuum Old Files (🧹)

Delta keeps old versions for time travel. Vacuum deletes unused files.

Example:
VACUUM sales RETAIN 168 HOURS;

⚠️ Reduces storage cost but removes old history.


6️⃣ Use Efficient File Size (πŸ“¦)

Target file size: 128MB–1GB

Too many small files = slow queries


7️⃣ Cache Hot Tables (πŸ”₯)

Cache frequently queried tables in memory.

Example:
CACHE TABLE sales;

8️⃣ Use Predicate Pushdown (🎯)

Filter data early to reduce scanned rows.

Bad:
SELECT * FROM sales;
Good:
SELECT * FROM sales WHERE date = '2026-02-01';

9️⃣ Avoid Over-Partitioning (⚠️)

Too many partitions = too many small files.

Rule: partitions should have at least 100MB data each.


βœ… Quick Real-World Optimization Example

OPTIMIZE sales
ZORDER BY (date);

VACUUM sales RETAIN 168 HOURS;

Result: Query time reduced from 30s β†’ 2s in production pipelines.


🧠 Interview Tip

Say: "I optimize Delta tables using partitioning, OPTIMIZE with ZORDER, auto-compaction, and VACUUM to reduce small files and improve scan performance."

⚑ Photon vs Non-Photon Runtime in Databricks

Photon is a high-performance query engine built by Databricks to accelerate SQL and Spark workloads.


1️⃣ What is Non-Photon Runtime? (🐒)

Non-Photon runtime is the default Apache Spark execution engine written in Scala/Java.

  • Uses JVM-based Spark execution
  • Good for general ETL and ML workloads
  • Slower for SQL-heavy analytics
Example:
Databricks Runtime 12.2 (without Photon)

2️⃣ What is Photon Runtime? (πŸš€)

Photon is a vectorized C++ query engine built by Databricks for faster execution.

  • Written in C++ (not JVM)
  • Uses vectorized processing & SIMD
  • Optimized for SQL, BI, and Delta queries
  • 2x–10x faster than normal Spark
Example:
Databricks Runtime 12.2 Photon

3️⃣ Key Differences (πŸ“Š)

Feature Photon Runtime Non-Photon Runtime
Execution Engine C++ Vectorized Engine JVM Spark Engine
Performance ⚑ Very Fast 🐒 Slower
Best For SQL, BI, Analytics ETL, ML, Streaming
Cost Efficiency Lower (faster = fewer compute hours) Higher for analytics workloads

4️⃣ Real Production Example (🏒)

A BI dashboard query took 40 seconds on non-Photon.

After switching to Photon:

  • Query time reduced to 5 seconds
  • Compute cost reduced by ~70%

5️⃣ When to Use Photon? (βœ…)

  • Databricks SQL Warehouses
  • Interactive BI queries
  • Large Delta table analytics

6️⃣ When NOT to Use Photon? (❌)

  • Heavy Python UDF workloads
  • ML training pipelines
  • Custom Spark transformations not supported by Photon

🧠 Interview One-Liner

"Photon is Databricks’ C++ vectorized engine that accelerates SQL and Delta queries compared to JVM-based Spark runtime."

☁️ What is Serverless Compute in Databricks?

Serverless compute in Databricks is a fully managed compute where Databricks automatically provisions, scales, and manages clusters for you.


1️⃣ How Serverless Compute Works (βš™οΈ)

You do NOT create clusters. Databricks handles:

  • Cluster provisioning
  • Auto-scaling
  • Auto-termination
  • Infrastructure management

You just run queries or notebooks.


2️⃣ Key Features (πŸš€)

  • ⚑ Instant startup (no cluster spin-up wait)
  • πŸ“ˆ Automatic scaling based on workload
  • πŸ’° Pay only for query execution time
  • πŸ›  No DevOps or cluster tuning needed

3️⃣ Serverless vs Cluster Compute (πŸ“Š)

Feature Serverless Compute Dedicated Cluster
Cluster Management Fully Managed by Databricks User Managed
Startup Time Seconds Minutes
Scaling Automatic Manual / Auto-scaling config
Cost Model Pay per query / workload Pay for running cluster time
Use Case SQL, BI, Dashboards ETL, ML, Streaming

4️⃣ Real Example (🏒)

A BI team runs dashboards on Serverless SQL Warehouse.

  • No cluster running 24/7
  • Users run queries β†’ compute spins up automatically
  • Cost reduced by 60% compared to always-on clusters

5️⃣ When to Use Serverless Compute (βœ…)

  • Databricks SQL dashboards
  • Ad-hoc analytics
  • BI tools like Power BI / Tableau

6️⃣ When NOT to Use Serverless (❌)

  • Heavy ETL pipelines
  • Streaming jobs
  • Custom ML workloads needing GPU or libraries

🧠 Interview One-Liner

"Serverless compute is Databricks-managed compute that auto-scales and charges per workload without requiring users to manage clusters."

πŸ”„ What is DLT (Delta Live Tables)?

Delta Live Tables (DLT) is a managed framework in Databricks to build reliable, declarative, and automated data pipelines.

You define what data should look like, and Databricks manages execution, monitoring, retries, and quality.


1️⃣ Key Features of DLT (πŸš€)

  • πŸ“œ Declarative pipelines (SQL / Python)
  • βœ… Built-in data quality checks
  • πŸ” Automatic retries & failure handling
  • πŸ“ˆ Auto-scaling & optimized execution
  • 🧾 Lineage and monitoring UI

2️⃣ Simple DLT Example (πŸ§ͺ)

SQL DLT Pipeline

CREATE LIVE TABLE bronze_users
AS SELECT * FROM raw_users;

CREATE LIVE TABLE clean_users
AS SELECT * FROM bronze_users WHERE age > 0;

Python DLT Example

import dlt

@dlt.table
def bronze_users():
    return spark.read.format("json").load("/raw/users")

@dlt.table
def clean_users():
    return dlt.read("bronze_users").filter("age > 0")

3️⃣ Traditional Pipeline Example (Old Way) (βš™οΈ)

spark.read.format("json").load("/raw/users") \
  .filter("age > 0") \
  .write.format("delta").save("/tables/clean_users")

Here you manually manage scheduling, retries, and monitoring.


4️⃣ DLT vs Traditional Pipelines (πŸ“Š)

Feature Delta Live Tables Traditional Spark Pipeline
Pipeline Type Declarative Imperative (manual code)
Data Quality Built-in Expectations Manual checks
Retries & Monitoring Automatic Manual Jobs setup
Lineage Auto lineage graph Manual documentation
Scaling & Optimization Automatic Manual cluster tuning
Use Case Modern Lakehouse pipelines Legacy Spark ETL

5️⃣ Real Production Example (🏒)

Raw Kafka data β†’ Bronze β†’ Silver β†’ Gold tables using DLT:

  • Bronze: raw ingestion
  • Silver: cleaned & validated
  • Gold: business aggregates

DLT automatically tracks lineage and failures.


🧠 Interview One-Liner

"Delta Live Tables is a declarative pipeline framework in Databricks that automates ETL orchestration, data quality, and monitoring compared to manual Spark pipelines."

More questions coming .....