Database Management Systems
Unit 3: Relational Operations
JOINs, Aggregates, Subqueries, Views, Set Operators & Relational Algebra โ the SQL skills that separate beginners from professionals.
๐ข Oracle & PostgreSQL | ๐ 15 MCQs (Bloom's) | ๐ฌ 5 Lab Exercises | ๐ผ Interview Prep
Why This Chapter Pays Your Salary
JOINs and subqueries are the #1 topic in SQL interviews. Every data analyst, backend developer, and DBA writes multi-table queries daily. If Unit 2 taught you to build one table, this unit teaches you to connect the entire database into meaningful reports. A Flipkart analyst writing a "Top 10 products by revenue per category per month" report uses JOINs, GROUP BY, HAVING, window functions, and CTEs โ all in one query.
๐ข Industry Snapshot
Flipkart โ Their sales analytics dashboard uses 50+ table JOINs, GROUP BY with ROLLUP for hierarchical subtotals, and materialized views refreshed every 15 minutes. A single report query connects: products โ orders โ order_items โ sellers โ categories โ warehouses.
PhonePe โ Their UPI transaction summary uses aggregate functions across billions of rows. Monthly reports to NPCI require: total volume, average transaction value, success rate โ all computed using GROUP BY, COUNT, AVG, and HAVING on partitioned tables.
IRCTC โ Waitlist processing uses correlated subqueries: "For each cancelled ticket, find the first waitlisted passenger on that train on that date." This runs thousands of times per minute during Tatkal booking.
๐ฆ Schema Setup โ E-Commerce System (Flipkart-style)
All examples in this chapter use an e-commerce domain โ familiar and interview-relevant.
SQL โ Schema Setup
CREATE TABLE categories (
category_id NUMBER(5) PRIMARY KEY,
category_name VARCHAR2(50) NOT NULL UNIQUE
);
CREATE TABLE customers (
customer_id NUMBER(10) PRIMARY KEY,
customer_name VARCHAR2(100) NOT NULL,
city VARCHAR2(50),
reg_date DATE DEFAULT SYSDATE
);
CREATE TABLE products (
product_id NUMBER(10) PRIMARY KEY,
product_name VARCHAR2(100) NOT NULL,
category_id NUMBER(5) REFERENCES categories(category_id),
price NUMBER(10,2) CHECK (price > 0)
);
CREATE TABLE orders (
order_id NUMBER(10) PRIMARY KEY,
customer_id NUMBER(10) REFERENCES customers(customer_id),
order_date DATE DEFAULT SYSDATE,
status VARCHAR2(15) CHECK (status IN ('PENDING','SHIPPED','DELIVERED','CANCELLED'))
);
CREATE TABLE order_items (
item_id NUMBER(10) PRIMARY KEY,
order_id NUMBER(10) REFERENCES orders(order_id),
product_id NUMBER(10) REFERENCES products(product_id),
quantity NUMBER(5) CHECK (quantity > 0),
unit_price NUMBER(10,2) NOT NULL
);
-- Sample Data
INSERT INTO categories VALUES (1,'Electronics');
INSERT INTO categories VALUES (2,'Clothing');
INSERT INTO categories VALUES (3,'Books');
INSERT INTO categories VALUES (4,'Home & Kitchen');
INSERT INTO customers VALUES (1,'Rahul Sharma','Mumbai',DATE '2023-01-15');
INSERT INTO customers VALUES (2,'Priya Patel','Pune',DATE '2023-03-20');
INSERT INTO customers VALUES (3,'Amit Joshi','Delhi',DATE '2023-06-10');
INSERT INTO customers VALUES (4,'Sneha Kulkarni','Bangalore',DATE '2024-01-05');
INSERT INTO customers VALUES (5,'Rajesh Gupta','Mumbai',DATE '2024-02-28');
INSERT INTO products VALUES (101,'iPhone 15',1,79999);
INSERT INTO products VALUES (102,'Samsung Galaxy S24',1,69999);
INSERT INTO products VALUES (103,'Levi''s Jeans',2,2499);
INSERT INTO products VALUES (104,'DBMS by Navathe',3,650);
INSERT INTO products VALUES (105,'Prestige Cooker',4,1850);
INSERT INTO products VALUES (106,'Allen Solly Shirt',2,1299);
INSERT INTO orders VALUES (1001,1,DATE '2024-11-15','DELIVERED');
INSERT INTO orders VALUES (1002,2,DATE '2024-11-20','DELIVERED');
INSERT INTO orders VALUES (1003,1,DATE '2024-12-01','SHIPPED');
INSERT INTO orders VALUES (1004,3,DATE '2024-12-10','CANCELLED');
INSERT INTO orders VALUES (1005,2,DATE '2025-01-05','PENDING');
INSERT INTO order_items VALUES (1,1001,101,1,79999);
INSERT INTO order_items VALUES (2,1001,104,2,650);
INSERT INTO order_items VALUES (3,1002,103,3,2499);
INSERT INTO order_items VALUES (4,1002,106,1,1299);
INSERT INTO order_items VALUES (5,1003,102,1,69999);
INSERT INTO order_items VALUES (6,1004,105,2,1850);
INSERT INTO order_items VALUES (7,1005,101,1,79999);
COMMIT;
Learning Outcomes โ Bloom's Taxonomy
| Bloom's Level | Outcome Statement |
|---|---|
| L1 โ Remember | List all JOIN types, aggregate functions, and set operators; recall the symbols for relational algebra operations |
| L2 โ Understand | Explain when LEFT JOIN produces NULLs, why HAVING filters after GROUP BY, and how correlated subqueries differ from regular subqueries |
| L3 โ Apply | Write multi-table JOIN queries, GROUP BY with HAVING, subqueries (single-row, multi-row, correlated), and CTEs for real business reports |
| L4 โ Analyze | Compare JOIN vs subquery performance; analyze when EXISTS outperforms IN; determine correct join type from business requirements |
| L5 โ Evaluate | Evaluate view design decisions (updatable vs read-only, materialized vs virtual); justify when to denormalize with materialized views |
| L6 โ Create | Design complex report queries combining JOINs, aggregates, subqueries, CTEs, and views for a complete analytics dashboard |
Concept Explanations
3.1 Aggregate Functions & GROUP BY
SQL โ Aggregate Functions
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Example 1: Basic aggregates on e-commerce data
-- Business Context: Dashboard KPIs for an e-commerce platform
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SELECT
COUNT(*) AS total_orders,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(oi.quantity * oi.unit_price) AS total_revenue,
AVG(oi.quantity * oi.unit_price) AS avg_order_value,
MIN(o.order_date) AS first_order,
MAX(o.order_date) AS last_order
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.status != 'CANCELLED';
SQL โ GROUP BY & HAVING
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Example 2: Revenue per category (only categories with revenue > โน5000)
-- Business Context: Identify top-performing product categories
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SELECT
c.category_name,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(oi.quantity) AS units_sold,
SUM(oi.quantity * oi.unit_price) AS total_revenue,
ROUND(AVG(oi.unit_price), 2) AS avg_price
FROM categories c
JOIN products p ON c.category_id = p.category_id
JOIN order_items oi ON p.product_id = oi.product_id
JOIN orders o ON oi.order_id = o.order_id
WHERE o.status != 'CANCELLED' -- WHERE filters rows BEFORE grouping
GROUP BY c.category_name
HAVING SUM(oi.quantity * oi.unit_price) > 5000 -- HAVING filters groups AFTER
ORDER BY total_revenue DESC;
-- Output:
-- category_name | orders | units_sold | revenue | avg_price
-- Electronics | 3 | 3 | 229997 | 76666.33
-- Clothing | 1 | 4 | 8796 | 2199.00
WHERE vs HAVING confusion. WHERE filters individual rows BEFORE grouping. HAVING filters groups AFTER GROUP BY. You cannot use aggregate functions in WHERE: WHERE SUM(price) > 1000 is ILLEGAL. Use HAVING SUM(price) > 1000 instead. Rule: if it involves an aggregate (SUM, COUNT, AVG), it goes in HAVING.
ROLLUP, CUBE & GROUPING SETS
SQL โ Advanced Grouping
-- ROLLUP: Hierarchical subtotals + grand total
SELECT
c.category_name,
TO_CHAR(o.order_date, 'YYYY-MM') AS order_month,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM categories c
JOIN products p ON c.category_id = p.category_id
JOIN order_items oi ON p.product_id = oi.product_id
JOIN orders o ON oi.order_id = o.order_id
GROUP BY ROLLUP(c.category_name, TO_CHAR(o.order_date, 'YYYY-MM'));
-- Output includes: category+month rows, category subtotals (month=NULL),
-- and grand total row (both NULL). Perfect for financial reports.
-- CUBE: All possible subtotal combinations
GROUP BY CUBE(category_name, city);
-- Generates: (cat,city), (cat,NULL), (NULL,city), (NULL,NULL)
-- GROUPING SETS: Custom subset of groupings
GROUP BY GROUPING SETS((category_name), (city), ());
-- Only: by category, by city, and grand total โ no cross combos
3.2 SQL JOINs โ Connecting Tables
๐ JOINs โ Why Data Lives in Multiple Tables
The relational model distributes data across normalized tables to avoid redundancy. JOINs reconstruct the complete picture by combining rows from two or more tables based on a related column (usually PK-FK relationship).
๐ REAL-WORLD ANALOGYYour Aadhaar card has basic info. Your bank statement has financial info. Your medical record has health info. A JOIN is like combining all three: "Show me Rahul's address (Aadhaar) + balance (bank) + blood group (hospital)" โ one row connecting three sources.
JOIN Types โ Visual Summary
Table A (orders) Table B (customers)
โโโโโโฌโโโโโโ โโโโโโฌโโโโโโโโโโโ
โ o_idโc_id โ โc_idโ name โ
โโโโโโผโโโโโโค โโโโโโผโโโโโโโโโโโค
โ1001โ 1 โ โ 1 โ Rahul โ
โ1002โ 2 โ โ 2 โ Priya โ
โ1003โ 1 โ โ 3 โ Amit โ
โ1004โ 3 โ โ 4 โ Sneha โโ No orders
โ1005โ 2 โ โ 5 โ Rajesh โโ No orders
โโโโโโดโโโโโโ โโโโโโดโโโโโโโโโโโ
INNER JOIN: Only matching rows (1001-Rahul, 1002-Priya, etc.) โ 5 rows
LEFT JOIN: All A rows + matching B (same 5 rows, all orders have customers)
RIGHT JOIN: All B rows + matching A โ includes Sneha(NULL), Rajesh(NULL) โ 7 rows
FULL JOIN: All rows from both โ 7 rows (all orders + all customers)
CROSS JOIN: Every A ร every B โ 5 ร 5 = 25 rows (Cartesian product)
INNER JOIN
SQL
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Example 3: INNER JOIN โ Orders with customer details
-- Business Context: Order listing with customer names
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- SQL:1999 Standard (RECOMMENDED)
SELECT o.order_id, c.customer_name, c.city, o.order_date, o.status
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
ORDER BY o.order_date DESC;
-- Output:
-- order_id | customer_name | city | order_date | status
-- 1005 | Priya Patel | Pune | 2025-01-05 | PENDING
-- 1004 | Amit Joshi | Delhi | 2024-12-10 | CANCELLED
-- 1003 | Rahul Sharma | Mumbai | 2024-12-01 | SHIPPED
-- 1002 | Priya Patel | Pune | 2024-11-20 | DELIVERED
-- 1001 | Rahul Sharma | Mumbai | 2024-11-15 | DELIVERED
SELECT * FROM orders o, customers c WHERE o.customer_id = c.customer_id. This comma-separated syntax is legacy. Use ANSI JOIN ... ON syntax โ it's clearer, less error-prone (forgetting WHERE gives CROSS JOIN), and works everywhere.LEFT OUTER JOIN
SQL
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Example 4: LEFT JOIN โ All customers, even those with no orders
-- Business Context: Find customers who haven't ordered yet (marketing target)
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SELECT c.customer_name, c.city, o.order_id, o.status
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.customer_name;
-- Output:
-- customer_name | city | order_id | status
-- Amit Joshi | Delhi | 1004 | CANCELLED
-- Priya Patel | Pune | 1002 | DELIVERED
-- Priya Patel | Pune | 1005 | PENDING
-- Rahul Sharma | Mumbai | 1001 | DELIVERED
-- Rahul Sharma | Mumbai | 1003 | SHIPPED
-- Rajesh Gupta | Mumbai | NULL | NULL โ No orders!
-- Sneha Kulkarni | Bangalore | NULL | NULL โ No orders!
-- Find ONLY customers with NO orders:
SELECT c.customer_name, c.city
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL; -- The NULL-check trick
SELF JOIN & Multi-Table JOINs
SQL
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Example 5: SELF JOIN โ Customers from the same city
-- Business Context: "People who live near you also bought..."
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SELECT c1.customer_name AS customer_a,
c2.customer_name AS customer_b,
c1.city
FROM customers c1
JOIN customers c2 ON c1.city = c2.city
AND c1.customer_id < c2.customer_id; -- Avoid duplicates (Rahul-Rajesh, not Rajesh-Rahul)
-- 4-table JOIN: Order details with customer, product, and category
SELECT c.customer_name, p.product_name, cat.category_name,
oi.quantity, oi.unit_price, (oi.quantity * oi.unit_price) AS line_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories cat ON p.category_id = cat.category_id
WHERE o.status != 'CANCELLED'
ORDER BY c.customer_name, o.order_date;
JOIN Summary Table
| Join Type | Returns | Use Case |
|---|---|---|
INNER JOIN | Only matching rows from both tables | Standard lookup โ orders WITH customers |
LEFT JOIN | All left rows + matching right (NULL if no match) | Find customers with NO orders, show all products even unsold |
RIGHT JOIN | All right rows + matching left | Rare โ usually rewrite as LEFT JOIN by swapping tables |
FULL OUTER JOIN | All rows from both (NULLs on non-matching side) | Reconciliation โ compare two datasets for mismatches |
CROSS JOIN | Cartesian product (every A ร every B) | Combination generation (size ร color for a product matrix) |
SELF JOIN | Table joined with itself | Hierarchies (employee-manager), same-city comparisons |
NATURAL JOIN | Implicit join on same-named columns | โ ๏ธ AVOID โ fragile, breaks if columns are added/renamed |
3.3 Set Operators โ UNION, INTERSECT, MINUS/EXCEPT
SQL โ Set Operators
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Example 6: Combine active + archived customer records (bank audit)
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- UNION: combines and removes duplicates
SELECT customer_name, city FROM customers WHERE city = 'Mumbai'
UNION
SELECT customer_name, city FROM customers WHERE reg_date > DATE '2024-01-01';
-- UNION ALL: keeps duplicates โ faster (no sort needed)
SELECT customer_name FROM active_customers
UNION ALL
SELECT customer_name FROM archived_customers;
-- INTERSECT: rows present in BOTH queries
SELECT customer_id FROM orders WHERE status = 'DELIVERED'
INTERSECT
SELECT customer_id FROM orders WHERE order_date > DATE '2024-12-01';
-- Customers with delivered orders AND orders after Dec 2024
-- MINUS (Oracle) / EXCEPT (PostgreSQL): rows in first NOT in second
SELECT customer_id FROM customers
MINUS -- PostgreSQL: EXCEPT
SELECT customer_id FROM orders;
-- Customers who have NEVER placed an order (same as LEFT JOIN + IS NULL)
| Operator | Duplicates | Oracle | PostgreSQL |
|---|---|---|---|
UNION | Removed | โ | โ |
UNION ALL | Kept | โ | โ |
INTERSECT | Removed | โ | โ |
MINUS | Removed | โ MINUS | โ Use EXCEPT |
3.4 Views โ Virtual Tables
๐ Views โ A Saved Query Masquerading as a Table
A view is a stored SELECT query that behaves like a virtual table. It doesn't store data โ it executes the underlying query each time you SELECT from it. Views provide security (hide columns), simplicity (hide complex JOINs), and consistency (same definition reused).
SQL โ Views
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Example 7: Views for different user roles
-- Business Context: Dashboard views for sales team vs management
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Simple view: Order summary (hide internal IDs)
CREATE OR REPLACE VIEW v_order_summary AS
SELECT o.order_id, c.customer_name, c.city,
o.order_date, o.status,
SUM(oi.quantity * oi.unit_price) AS order_total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, c.customer_name, c.city, o.order_date, o.status;
-- Now use it like a table:
SELECT * FROM v_order_summary WHERE status = 'DELIVERED';
-- Read-only view (prevent DML)
CREATE VIEW v_product_catalog AS
SELECT p.product_name, c.category_name, p.price
FROM products p JOIN categories c ON p.category_id = c.category_id
WITH READ ONLY;
-- View with CHECK OPTION (prevent invisible row inserts)
CREATE VIEW v_mumbai_customers AS
SELECT * FROM customers WHERE city = 'Mumbai'
WITH CHECK OPTION;
-- INSERT into v_mumbai_customers with city='Delhi' is REJECTED
-- because the row wouldn't be visible through the view
SQL โ Materialized View (Oracle)
CREATE MATERIALIZED VIEW mv_category_sales
BUILD IMMEDIATE
REFRESH COMPLETE ON DEMAND
AS
SELECT c.category_name, SUM(oi.quantity * oi.unit_price) AS revenue
FROM categories c
JOIN products p ON c.category_id = p.category_id
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY c.category_name;
-- Refresh manually when needed:
-- EXEC DBMS_MVIEW.REFRESH('mv_category_sales');
3.5 Subqueries & CTEs
SQL โ Subquery Types
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Example 8: Single-row subquery
-- "Find products priced above the average"
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);
-- Multiple-row subquery (IN)
-- "Find customers who ordered Electronics"
SELECT customer_name FROM customers
WHERE customer_id IN (
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.category_id = 1
);
-- Correlated subquery (inner query references outer query)
-- "Find products priced above their category average"
SELECT p.product_name, p.price, p.category_id
FROM products p
WHERE p.price > (
SELECT AVG(p2.price) FROM products p2
WHERE p2.category_id = p.category_id -- References outer p!
);
-- EXISTS (often faster than IN for large datasets)
-- "Customers who have at least one delivered order"
SELECT c.customer_name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id AND o.status = 'DELIVERED'
);
-- Scalar subquery in SELECT clause
SELECT p.product_name, p.price,
(SELECT AVG(price) FROM products) AS overall_avg,
p.price - (SELECT AVG(price) FROM products) AS diff_from_avg
FROM products p;
WITH Clause โ Common Table Expressions (CTEs)
SQL โ CTEs
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Example 9: CTE for readable complex queries
-- Business Context: Top customer per city by total spend
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
WITH customer_spend AS (
SELECT c.customer_id, c.customer_name, c.city,
SUM(oi.quantity * oi.unit_price) AS total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.status != 'CANCELLED'
GROUP BY c.customer_id, c.customer_name, c.city
),
ranked AS (
SELECT cs.*,
ROW_NUMBER() OVER (PARTITION BY city ORDER BY total_spent DESC) AS rn
FROM customer_spend cs
)
SELECT customer_name, city, total_spent
FROM ranked WHERE rn = 1;
-- Recursive CTE: Category hierarchy (Electronics โ Mobiles โ Smartphones)
WITH RECURSIVE cat_tree AS ( -- PostgreSQL; Oracle: no RECURSIVE keyword
SELECT category_id, category_name, parent_id, 1 AS level
FROM categories WHERE parent_id IS NULL -- Root categories
UNION ALL
SELECT c.category_id, c.category_name, c.parent_id, ct.level + 1
FROM categories c
JOIN cat_tree ct ON c.parent_id = ct.category_id
)
SELECT LPAD(' ', (level-1)*4) || category_name AS hierarchy
FROM cat_tree ORDER BY level;
3.6 Relational Algebra โ The Mathematical Foundation
๐ Relational Algebra โ SQL's Mathematical Backbone
Relational algebra is a formal system of operations on relations (tables). Every SQL query is internally translated into a relational algebra expression by the query optimizer. Understanding it helps you write better SQL and understand EXPLAIN PLANs.
| Operation | Symbol | Purpose | SQL Equivalent |
|---|---|---|---|
| Selection | ฯ (sigma) | Filter rows (horizontal) | WHERE |
| Projection | ฯ (pi) | Choose columns (vertical) | SELECT col1, col2 |
| Union | โช | Combine two compatible relations | UNION |
| Set Difference | โ | Rows in R but not in S | MINUS / EXCEPT |
| Cartesian Product | ร | All combinations of rows | CROSS JOIN |
| Rename | ฯ (rho) | Rename relation or attributes | AS alias |
| Natural Join | โ | Join on common attributes | NATURAL JOIN |
| Theta Join | โฮธ | Join on arbitrary condition | JOIN ON condition |
| Intersection | โฉ | Rows common to both | INTERSECT |
| Division | รท | "Find X related to ALL Y" | Complex subquery with NOT EXISTS |
Relational Algebra โ SQL Mapping
Query: "Names and cities of customers from Mumbai who have placed orders"
Relational Algebra:
ฯcustomer_name, city(ฯcity='Mumbai'(customers โ orders))
Step by step:
1. โ (Natural Join): customers JOIN orders ON customer_id
2. ฯ (Selection): WHERE city = 'Mumbai'
3. ฯ (Projection): SELECT customer_name, city
SQL:
SELECT DISTINCT c.customer_name, c.city
FROM customers c JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'Mumbai';
EXPLAIN PLAN (Oracle) or EXPLAIN ANALYZE (PostgreSQL) to see how the optimizer executes your query โ then optimize by adding indexes on frequently filtered/joined columns.Industry Problems
๐ข Industry Problem #1 โ PhonePe UPI Transaction Summary Report
Organization: Payments company (PhonePe-scale)
Scenario: Generate the monthly UPI transaction report for NPCI submission. Report must include: total transactions, successful vs failed, total volume (โน), average transaction value, top 5 receiving banks by volume โ all grouped by month.
Schema:
SQL
CREATE TABLE upi_transactions (
txn_id NUMBER(15) PRIMARY KEY,
sender_vpa VARCHAR2(50) NOT NULL,
receiver_vpa VARCHAR2(50) NOT NULL,
amount NUMBER(12,2) CHECK (amount > 0),
status VARCHAR2(10) CHECK (status IN ('SUCCESS','FAILED','PENDING')),
txn_date DATE,
receiver_bank VARCHAR2(30)
);
๐ก Complete Solution
SQL
WITH monthly_stats AS (
SELECT
TO_CHAR(txn_date, 'YYYY-MM') AS txn_month,
COUNT(*) AS total_txn,
SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) AS success_count,
SUM(CASE WHEN status = 'FAILED' THEN 1 ELSE 0 END) AS failed_count,
SUM(CASE WHEN status = 'SUCCESS' THEN amount ELSE 0 END) AS total_volume,
ROUND(AVG(CASE WHEN status = 'SUCCESS' THEN amount END), 2) AS avg_txn_value,
ROUND(SUM(CASE WHEN status='SUCCESS' THEN 1 ELSE 0 END)*100.0/COUNT(*),2) AS success_rate
FROM upi_transactions
GROUP BY TO_CHAR(txn_date, 'YYYY-MM')
)
SELECT * FROM monthly_stats ORDER BY txn_month DESC;
-- Top 5 receiver banks by volume per month (using window function)
WITH bank_volume AS (
SELECT TO_CHAR(txn_date,'YYYY-MM') AS month, receiver_bank,
SUM(amount) AS vol,
DENSE_RANK() OVER (PARTITION BY TO_CHAR(txn_date,'YYYY-MM')
ORDER BY SUM(amount) DESC) AS rnk
FROM upi_transactions WHERE status='SUCCESS'
GROUP BY TO_CHAR(txn_date,'YYYY-MM'), receiver_bank
)
SELECT * FROM bank_volume WHERE rnk <= 5;
DBA Insight: At PhonePe's scale (10 billion+ monthly), this query runs on partitioned tables (partition by month). Materialized views pre-compute daily summaries. The NPCI report is generated from the materialized view, not raw transactions.
๐ข Industry Problem #2 โ College Placement Analytics Dashboard
Organization: University placement cell
Scenario: Generate a placement report: company-wise offers, department-wise placement rate, highest/lowest packages, students with multiple offers.
๐ก Solution Highlights
SQL
-- Students with multiple offers
SELECT s.student_name, COUNT(po.offer_id) AS num_offers,
MAX(po.ctc_lpa) AS highest_ctc
FROM students s
JOIN placement_offers po ON s.student_id = po.student_id
GROUP BY s.student_name
HAVING COUNT(po.offer_id) > 1
ORDER BY highest_ctc DESC;
-- Department-wise placement rate
SELECT d.dept_name,
COUNT(DISTINCT s.student_id) AS total_students,
COUNT(DISTINCT po.student_id) AS placed,
ROUND(COUNT(DISTINCT po.student_id)*100.0/COUNT(DISTINCT s.student_id),1) AS placement_pct
FROM departments d
JOIN students s ON d.dept_id = s.dept_id
LEFT JOIN placement_offers po ON s.student_id = po.student_id
GROUP BY d.dept_name
ORDER BY placement_pct DESC;
๐ข Industry Problem #3 โ Flipkart Product Recommendation Query
Organization: E-commerce (Flipkart-scale)
Scenario: "Customers who bought product X also bought..." โ find co-purchased products using self-joins on order_items.
๐ก Solution
SQL
-- Find products frequently bought WITH iPhone 15 (product_id=101)
SELECT p.product_name, COUNT(*) AS co_purchase_count
FROM order_items oi1
JOIN order_items oi2 ON oi1.order_id = oi2.order_id -- Same order
AND oi1.product_id != oi2.product_id -- Different product
JOIN products p ON oi2.product_id = p.product_id
WHERE oi1.product_id = 101 -- iPhone 15
GROUP BY p.product_name
ORDER BY co_purchase_count DESC
FETCH FIRST 5 ROWS ONLY;
-- Output:
-- product_name | co_purchase_count
-- DBMS by Navathe | 1 (bought together in order 1001)
DBA Insight: Flipkart's actual recommendation engine uses Apache Spark on historical data, not live SQL queries. But the SQL pattern above is exactly how recommendation prototypes are built and tested. Index on order_items(order_id, product_id) is critical for performance.
Lab Exercises
Exercise 1: Aggregate Functions & GROUP BY
Schema: E-commerce schema from Section 1
Tasks:
- Find total number of orders, total revenue, and average order value (excluding cancelled)
- Count orders per status (PENDING, SHIPPED, DELIVERED, CANCELLED)
- Find revenue per category โ show only categories with revenue > โน5000
- Find the most expensive product in each category
- Use GROUP BY with ROLLUP on (category, month) for subtotals
Hints: Remember HAVING for group-level filters, not WHERE.
Exercise 2: All JOIN Types
Tasks:
- INNER JOIN: List all orders with customer names and product names
- LEFT JOIN: Show ALL customers including those with no orders
- LEFT JOIN + IS NULL: Find customers who never ordered
- SELF JOIN: Find pairs of customers from the same city
- 4-table JOIN: Full order details (customer โ order โ items โ product โ category)
- Verify: COUNT the rows returned by INNER vs LEFT vs FULL OUTER JOIN โ explain the differences
Exercise 3: Subqueries โ All Types
Tasks:
- Single-row subquery: Products priced above average
- Multi-row subquery (IN): Customers who bought Electronics
- Correlated subquery: Products priced above their category average
- EXISTS: Customers with at least one delivered order
- Rewrite each subquery as a JOIN โ compare which is more readable
Exercise 4: Views & Materialized Views
Tasks:
- Create
v_order_details: customer name, product name, quantity, total, order date - Create
v_city_customersfor Mumbai with CHECK OPTION โ try inserting a Delhi customer - Create a READ ONLY view for the product catalog
- Try INSERT/UPDATE on a view with JOINs โ observe which operations fail
- Create a materialized view for category-wise monthly revenue (Oracle) or use
CREATE TABLE ASequivalent
Exercise 5: Complete Analytics Dashboard
Tasks: Write a single CTE-based query that produces a complete e-commerce dashboard:
- CTE 1: Customer lifetime value (total spend per customer)
- CTE 2: Product popularity (total units sold per product)
- CTE 3: Category revenue with ranking
- Final SELECT: Top 3 customers by spend, joined with their most-purchased category
Extension: Add a "month-over-month growth" calculation using LAG() window function.
MCQ Assessment Bank โ 15 Questions
Hover to reveal answer and explanation.
Which aggregate function counts the number of non-NULL values in a column?
- COUNT(*)
- COUNT(column_name)
- SUM(column_name)
- TOTAL(column_name)
COUNT(column_name) counts only non-NULL values. COUNT(*) counts all rows regardless of NULLs. If a column has 100 rows but 10 are NULL: COUNT(*)=100, COUNT(column)=90. SUM adds values, doesn't count. TOTAL is not standard SQL.๐ข This distinction is tested in GATE CS and every campus placement SQL test.
Which JOIN type returns ALL rows from the left table even if there is no matching row in the right table?
- INNER JOIN
- LEFT OUTER JOIN
- CROSS JOIN
- SELF JOIN
๐ข LEFT JOIN is the most commonly used join in data analysis โ finding "all X, even those without Y."
Which set operator removes duplicates when combining two SELECT results?
- UNION ALL
- UNION
- CROSS JOIN
- MERGE
๐ข Performance tip: always prefer UNION ALL unless you specifically need duplicate removal.
Why can't you use WHERE to filter on aggregate results (e.g., WHERE COUNT(*) > 5)?
- It's a syntax limitation that will be fixed in future SQL versions
- Because WHERE filters individual rows BEFORE grouping occurs; aggregates don't exist yet at the WHERE stage of execution. Use HAVING to filter groups AFTER GROUP BY
- You can โ it works in PostgreSQL
- WHERE only works with string comparisons
๐ข Understanding WHERE vs HAVING is essential for every SQL developer. Tested in every assessment.
What is the difference between a correlated subquery and a regular subquery?
- They are identical
- A correlated subquery references columns from the outer query and is re-executed for each outer row; a regular subquery is independent and executes only once
- Correlated subqueries run faster
- Regular subqueries can't use WHERE
WHERE price > (SELECT AVG(price) FROM products) โ inner query runs once, returns one result. Correlated subquery: WHERE price > (SELECT AVG(price) FROM products p2 WHERE p2.category_id = p.category_id) โ inner query references outer p.category_id and re-executes for EACH outer row. Correlated subqueries are slower (N executions vs 1) but sometimes necessary.๐ข Correlated subqueries are asked in every advanced SQL interview โ know how to optimize them.
What is a materialized view and how does it differ from a regular view?
- They are the same thing
- A regular view stores only the query definition (re-executes each time). A materialized view physically stores the query result on disk and is refreshed periodically. MViews are faster for reads but may show stale data between refreshes
- Materialized views are virtual
- Regular views store data on disk
๐ข MViews are a key Oracle DBA concept. Also available in PostgreSQL (manual refresh).
Write a query: "Find customers who have never placed any order."
SELECT * FROM customers WHERE order_id IS NULLSELECT c.customer_name FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_id IS NULLSELECT * FROM customers WHERE customer_id NOT IN (SELECT * FROM orders)SELECT * FROM customers MINUS SELECT * FROM orders
WHERE o.order_id IS NULL gives only no-order customers. Option A: customers table doesn't have order_id column. Option C: subquery must select customer_id, not *. Option D: different column structures, can't use MINUS.๐ข This is the single most asked JOIN question in SQL interviews. Three ways to write it: LEFT JOIN + IS NULL, NOT EXISTS, NOT IN.
Write a query to find the second highest salary from a doctors table.
SELECT MAX(salary) FROM doctors WHERE salary < (SELECT MAX(salary) FROM doctors)SELECT salary FROM doctors ORDER BY salary DESC LIMIT 2SELECT salary FROM doctors WHERE ROWNUM = 2SELECT MIN(salary) FROM doctors
DENSE_RANK() OVER (ORDER BY salary DESC) = 2.๐ข This is the #1 most asked SQL interview question across all Indian IT companies. Know 3 ways to solve it.
Write a CTE to find the top-spending customer per city.
- This requires a stored procedure
WITH spend AS (SELECT c.customer_name, c.city, SUM(oi.quantity*oi.unit_price) AS total FROM customers c JOIN orders o ON ... JOIN order_items oi ON ... GROUP BY c.customer_name, c.city), ranked AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY city ORDER BY total DESC) AS rn FROM spend) SELECT * FROM ranked WHERE rn = 1- CTEs can't use window functions
- Use GROUP BY city only
๐ข This exact pattern is used at Swiggy, Zomato, and Flipkart for analytics dashboards.
A query uses LEFT JOIN but returns the same number of rows as INNER JOIN. What does this tell you about the data?
- The query has a bug
- LEFT JOIN and INNER JOIN are identical
- Every row in the left table has at least one matching row in the right table โ there are no "orphan" rows. The LEFT JOIN is technically unnecessary but not incorrect
- The right table is empty
๐ข Senior developers check this to optimize queries โ unnecessary LEFT JOINs add overhead.
A developer writes: SELECT * FROM customers WHERE customer_id NOT IN (SELECT customer_id FROM orders WHERE customer_id IS NULL). The query returns NO rows even though customers exist. What happened?
- The customers table is empty
- The orders subquery returns NULL values. NOT IN with any NULL in the list evaluates to UNKNOWN (not TRUE) for every row, so no rows are returned. Fix: use NOT EXISTS instead, or add WHERE customer_id IS NOT NULL in the subquery
- NOT IN doesn't work with subqueries
- The WHERE clause syntax is wrong
NOT IN checks: customer_id != 1 AND customer_id != 2 AND customer_id != NULL. Comparing anything with NULL yields UNKNOWN, and UNKNOWN AND TRUE = UNKNOWN. So NO rows pass the filter. NOT EXISTS handles NULLs correctly because it checks existence, not value equality. This is why experienced developers prefer NOT EXISTS over NOT IN.๐ข This NULL trap causes production bugs regularly. It's a frequent interview "gotcha" question.
A view v_order_summary uses JOINs and GROUP BY. A developer tries INSERT INTO v_order_summary VALUES (...). Evaluate this operation.
- It works normally
- It fails โ views with JOINs, GROUP BY, DISTINCT, aggregates, or UNION are NOT updatable. The DBMS cannot determine which underlying table to modify. Only simple single-table views without these constructs are updatable
- Only the first table in the JOIN is updated
- It works in PostgreSQL but not Oracle
๐ข View updatability rules are tested in Oracle OCA certification (1Z0-071).
A team debates: use a correlated subquery or a JOIN+GROUP BY to find "employees earning more than their department average." Evaluate both approaches.
- Subquery is always better
- JOIN is always better
- Correlated subquery is more readable (closer to natural language) but re-executes the inner query for each row (O(nยฒ)). JOIN+GROUP BY computes averages once, then joins (O(n log n)). For large tables, the JOIN approach is significantly faster. For small tables, difference is negligible. Modern optimizers may rewrite one into the other
- They can't solve the same problem
WHERE salary > (SELECT AVG(salary) FROM emp e2 WHERE e2.dept=e.dept) โ readable but slow for large tables. JOIN: WITH dept_avg AS (SELECT dept, AVG(salary) as avg_sal FROM emp GROUP BY dept) SELECT e.* FROM emp e JOIN dept_avg d ON e.dept=d.dept WHERE e.salary > d.avg_sal โ computes averages once. Modern query optimizers (Oracle's CBO, PG's planner) may automatically rewrite correlated subqueries into JOINs โ but don't rely on it.๐ข Query optimization is a senior-level interview topic. Know both approaches and their trade-offs.
Design queries for a Swiggy-like food delivery dashboard: (1) Total orders per restaurant with revenue, (2) Top 3 cuisines by order count, (3) Customers who ordered from 3+ different restaurants.
- Single query can do all three
- Three separate queries using JOINs + GROUP BY + HAVING: (1) restaurants JOIN orders JOIN order_items, GROUP BY restaurant, SUM(amount). (2) cuisines JOIN restaurants JOIN orders, GROUP BY cuisine, ORDER BY COUNT DESC, FETCH FIRST 3. (3) customers JOIN orders, GROUP BY customer HAVING COUNT(DISTINCT restaurant_id) >= 3
- This requires NoSQL
- Views can handle this automatically
๐ข These are real data analyst interview questions at Swiggy, Zomato, and food delivery startups.
Express this query in relational algebra: "Find names of customers from Mumbai who ordered products in the Electronics category."
- Just use SQL
- ฯcustomer_name(ฯcity='Mumbai' โง category_name='Electronics'(customers โ orders โ order_items โ products โ categories))
- ฯMumbai(customers)
- customers ร orders
SELECT DISTINCT c.customer_name FROM customers c JOIN orders o ON... JOIN order_items oi ON... JOIN products p ON... JOIN categories cat ON... WHERE c.city='Mumbai' AND cat.category_name='Electronics'.๐ข GATE CS exam always has 1-2 relational algebra questions. Map each symbol to SQL for easy solving.
Chapter Summary
๐ฏ 3 SQL Skills This Chapter Unlocks
- Multi-table Queries โ JOINs connect normalized data into meaningful reports. Every real-world query involves 2-5 table JOINs.
- Analytics Reports โ GROUP BY + HAVING + aggregates + ROLLUP = the exact toolkit for financial reports, dashboards, and KPI tracking.
- Subquery Thinking โ Correlated subqueries, EXISTS, CTEs โ these solve "find X where condition depends on another table's aggregate" problems that JOINs alone can't elegantly handle.
๐ SQL Quick Reference
JOINS:
FROM a INNER JOIN b ON a.id = b.a_id -- Matching only
FROM a LEFT JOIN b ON a.id = b.a_id -- All a + matching b
FROM a FULL OUTER JOIN b ON a.id = b.a_id -- All from both
FROM a CROSS JOIN b -- Cartesian product
FROM a a1 JOIN a a2 ON a1.col = a2.col -- Self join
AGGREGATES:
SELECT col, COUNT(*), SUM(x), AVG(x), MIN(x), MAX(x)
FROM t GROUP BY col HAVING COUNT(*) > 5
GROUP BY ROLLUP(col1, col2) -- Subtotals + grand total
SET OPS:
query1 UNION query2 -- Dedup
query1 UNION ALL query2 -- Keep dupes (faster)
query1 INTERSECT query2 -- Common rows
query1 MINUS query2 -- Oracle (EXCEPT in PG)
SUBQUERIES:
WHERE col > (SELECT AVG(col) FROM t) -- Single-row
WHERE col IN (SELECT col FROM t2) -- Multi-row
WHERE col > (SELECT AVG(x) FROM t2 WHERE t2.id = t.id) -- Correlated
WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.id = t.id) -- EXISTS
VIEWS:
CREATE [OR REPLACE] VIEW v AS SELECT ...
CREATE VIEW v AS ... WITH CHECK OPTION
CREATE VIEW v AS ... WITH READ ONLY
CREATE MATERIALIZED VIEW mv AS ... REFRESH ON DEMAND
RELATIONAL ALGEBRA:
ฯ = WHERE | ฯ = SELECT cols | โ = JOIN | โช = UNION
โ = MINUS/EXCEPT | ร = CROSS JOIN | ฯ = AS alias
Interview & Career Preparation
Q1: Explain INNER JOIN vs LEFT JOIN with an example.
Model Answer: INNER JOIN returns only matching rows from both tables. LEFT JOIN returns ALL rows from the left table and matching rows from the right โ non-matching right rows appear as NULLs. Example: customers LEFT JOIN orders shows all customers, including those with no orders (order columns = NULL). Use LEFT JOIN to find "customers without orders" by adding WHERE order_id IS NULL.
Q2: How do you find the second highest salary?
Model Answer: Three approaches: (1) Subquery: SELECT MAX(salary) FROM emp WHERE salary < (SELECT MAX(salary) FROM emp). (2) DENSE_RANK: WITH ranked AS (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk FROM emp) SELECT salary FROM ranked WHERE rnk = 2. (3) OFFSET: SELECT DISTINCT salary FROM emp ORDER BY salary DESC OFFSET 1 FETCH FIRST 1 ROW ONLY. Approach 2 handles ties best โ if two people share the highest salary, it still returns the second-highest correctly.
Q3: What is the difference between WHERE and HAVING?
Model Answer: WHERE filters individual rows BEFORE GROUP BY. HAVING filters groups AFTER GROUP BY. WHERE cannot use aggregate functions (SUM, COUNT, AVG). HAVING can. Example: WHERE salary > 50000 filters individual employees. HAVING COUNT(*) > 5 filters departments with more than 5 employees. Both can coexist in one query.
Q4: What is a correlated subquery?
Model Answer: A correlated subquery references a column from the outer query, causing it to re-execute for each outer row. Example: "Find employees earning more than their department average": SELECT * FROM emp e WHERE salary > (SELECT AVG(salary) FROM emp e2 WHERE e2.dept_id = e.dept_id). The inner query depends on e.dept_id from the outer query. Performance: O(nยฒ) for naive execution; optimizers may rewrite as JOIN.
Q5: When is EXISTS better than IN?
Model Answer: EXISTS is better when: (1) the subquery returns many rows (EXISTS stops at first match; IN checks all), (2) the subquery may return NULLs (NOT IN with NULLs returns no rows โ a common bug; NOT EXISTS handles NULLs correctly). IN is simpler for small, known value sets: WHERE dept_id IN (1, 2, 3). Rule of thumb: use EXISTS for correlated checks, IN for static lists.
Q6: What is a CTE? Why use it over subqueries?
Model Answer: CTE (Common Table Expression) is a named temporary result set defined with the WITH clause. Advantages over subqueries: (1) Readability โ named steps vs nested brackets. (2) Reusability โ referenced multiple times in one query. (3) Recursive queries โ hierarchical data (org charts, category trees) impossible with regular subqueries. (4) Same performance as subqueries โ CTEs are "syntactic sugar," not physically materialized (unlike temp tables).
Q7: What is a view? Can you do INSERT on a view?
Model Answer: A view is a stored SELECT query that behaves like a virtual table. INSERT is allowed only on simple single-table views without GROUP BY, DISTINCT, aggregates, UNION, or complex expressions. Views with JOINs are generally not updatable (except "key-preserved" tables in Oracle). WITH READ ONLY explicitly prevents all DML. Use INSTEAD OF triggers to enable DML on complex views.
Q8: What is UNION vs UNION ALL?
Model Answer: UNION combines two SELECT results and removes duplicate rows (implicit DISTINCT + sort). UNION ALL keeps all rows including duplicates โ no sort, significantly faster. Use UNION ALL when: (1) duplicates are impossible (disjoint datasets), (2) duplicates are acceptable, (3) performance matters. Both require same number of columns with compatible data types. ORDER BY goes at the end of the last SELECT.
Q9: Explain relational algebra ฯ, ฯ, and โ.
Model Answer: ฯ (sigma/Selection): filters rows based on condition โ equivalent to SQL WHERE. ฯ (pi/Projection): selects specific columns โ equivalent to SQL SELECT column list. โ (natural join): combines two relations matching on common attribute names โ equivalent to SQL JOIN. Example: ฯname(ฯsalary>50000(employees)) = SELECT name FROM employees WHERE salary > 50000. The query optimizer translates SQL into these operations internally.
Q10: Write a query to find departments with more than 3 employees where the average salary exceeds โน2,00,000.
Model Answer: SELECT d.dept_name, COUNT(*) AS emp_count, ROUND(AVG(e.salary), 2) AS avg_salary FROM departments d JOIN employees e ON d.dept_id = e.dept_id GROUP BY d.dept_name HAVING COUNT(*) > 3 AND AVG(e.salary) > 200000 ORDER BY avg_salary DESC. Key points: JOIN for table connection, GROUP BY for aggregation, HAVING for two conditions on aggregates. WHERE would only filter individual rows, not groups.
๐ Practice Platforms
- LeetCode SQL 50 โ 50 curated SQL problems. Covers JOINs, subqueries, window functions.
- HackerRank SQL โ Free challenges from Easy to Hard. Earn SQL certificates.
- pgexercises.com โ PostgreSQL-specific exercises with a real schema.
- SQLZoo โ Interactive SQL tutorials with quizzes.
- Oracle Live SQL โ Free Oracle environment in the browser.