Starting out as a Data Engineer, I remember the moment I first tackled a complex SQL query. It felt like trying to solve a puzzle with missing pieces. But once I got the hang of it, the satisfaction of seeing everything click into place was incredible.
SQL, as I soon learned, is much more than just a technical skill — it’s the foundation of working with data effectively.
In a moment, I’ll walk you through the key SQL concepts every Data Engineer should know. From basic commands to advanced techniques, I’ll break down each topic with clear examples.
Whether you’re just starting out or looking to refresh your skills, this guide will help you understand how to use SQL to manage and analyze data like a pro.
1. Data Definition Language (DDL)
DDL commands are used to define and manage database objects such as tables, indexes, and schemas.
Create: Used to create a new table.
CREATE TABLE Employees (
EmployeeID INT PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
HireDate DATE
);
Alter: Modifies an existing table’s structure.
ALTER TABLE Employees ADD COLUMN Email VARCHAR(100);
Drop: Deletes an entire table.
DROP TABLE Employees;
2. Data Manipulation Language (DML)
DML commands are used to manipulate data within existing database objects.
Select: Retrieves data from a table.
SELECT FirstName, LastName FROM Employees WHERE HireDate > '2023-01-01';
Insert: Adds a new row to a table.
INSERT INTO Employees (EmployeeID, FirstName, LastName, HireDate)
VALUES (1, 'John', 'Nnamdi', '2024-08-01');
Update: Modifies existing data.
UPDATE Employees SET Email = 'john.nnamdi@kreg.com' WHERE EmployeeID = 1;
Delete: Removes data from a table.
DELETE FROM Employees WHERE EmployeeID = 1;
3. Joins
Joins are used to combine data from two or more tables based on a related column.
Inner Join: Returns rows with matching values in both tables.
SELECT Employees.FirstName, Departments.DepartmentName
FROM Employees
INNER JOIN Departments ON Employees.DepartmentID = Departments.DepartmentID;
Left Join: Returns all rows from the left table and matching rows from the right table.
SELECT Employees.FirstName, Departments.DepartmentName
FROM Employees
LEFT JOIN Departments ON Employees.DepartmentID = Departments.DepartmentID;
Right Join: Returns all rows from the right table and matching rows from the left table.
SELECT Employees.FirstName, Departments.DepartmentName
FROM Employees
RIGHT JOIN Departments ON Employees.DepartmentID = Departments.DepartmentID;
Full Outer Join: Returns rows when there is a match in one of the tables.
SELECT Employees.FirstName, Departments.DepartmentName
FROM Employees
FULL OUTER JOIN Departments ON Employees.DepartmentID = Departments.DepartmentID;
4. Subqueries and Nested Queries
Subqueries are queries nested inside another query, often used to perform complex filtering or calculations.
Example: Find employees hired after the average hire date.
SELECT FirstName, LastName
FROM Employees
WHERE HireDate > (SELECT AVG(HireDate) FROM Employees);
5. Aggregate Functions
Aggregate functions perform calculations on a set of values and return a single value.
Count: Returns the number of rows.
SELECT COUNT(*) FROM Employees;
Sum: Returns the total sum of a numeric column.
SELECT SUM(Salary) FROM Employees;
AVG: Returns the average value of a numeric column.
SELECT AVG(Salary) FROM Employees;
MIN and MAX: Return the smallest and largest values.
SELECT MIN(HireDate), MAX(HireDate) FROM Employees;
6. GROUP BY and HAVING
“GROUP BY” is used to group rows that have the same values in specified columns, and “HAVING” filters groups.
Example: Count the number of employees in each department.
SELECT DepartmentID, COUNT(*) AS NumEmployees
FROM Employees
GROUP BY DepartmentID
HAVING COUNT(*) > 5; -- Only show departments with more than 5 employees
7. Window Functions
Window functions perform calculations across a set of table rows related to the current row.
ROW_NUMBER(): Assigns a unique number to each row.
SELECT FirstName, LastName,
ROW_NUMBER() OVER (ORDER BY HireDate) AS RowNumber
FROM Employees;
RANK(): Assigns a rank to each row within a partition.
SELECT FirstName, LastName,
RANK() OVER (ORDER BY Salary DESC) AS SalaryRank
FROM Employees;
PARTITION BY: Divides the result set into partitions.
SELECT FirstName, LastName,
RANK() OVER (PARTITION BY DepartmentID ORDER BY Salary DESC) AS DeptSalaryRank
FROM Employees;
8. Normalization and Denormalization
Normalization involves organizing a database to reduce redundancy and improve data integrity.
1NF (First Normal Form): Ensure that each table cell contains a single value.
2NF (Second Normal Form): Ensure that each table is in 1NF and all non-key attributes are fully functional dependent on the primary key.
3NF (Third Normal Form): Ensure that the table is in 2NF and all the columns are only dependent on the primary key.
Denormalization is the process of adding redundancy to a database to improve read performance.
9. Indexes
Indexes are used to speed up the retrieval of rows by using pointers.
Creating an Index: Improves query performance.
CREATE INDEX idx_lastname ON Employees(LastName);
Clustered vs. Non-Clustered Indexes:
Clustered Index: The table rows are stored in the order of the index.
Non-Clustered Index: The index is stored separately from the table data.
10. Transactions
Transactions are used to ensure that a series of SQL operations are executed as a single unit.
BEGIN TRANSACTION: Starts a transaction.
COMMIT: Saves the changes made in the transaction.
ROLLBACK: Undoes changes made during the transaction
BEGIN TRANSACTION;
UPDATE Employees SET Salary = Salary + 5000 WHERE EmployeeID = 1;
COMMIT; -- Save the changes
11. Error Handling
Error handling in SQL involves managing errors to prevent a complete failure of a script.
TRY-CATCH: Used to handle errors and exceptions.
BEGIN TRY
-- SQL statements
END TRY
BEGIN CATCH
-- Error handling
END CATCH;
12. Performance Tuning
Performance tuning involves optimizing SQL queries for faster execution.
EXPLAIN: Provides information about how SQL statements are executed.
EXPLAIN SELECT * FROM Employees WHERE LastName = 'Dowen';
Query Optimization Techniques:
Use indexes wisely.
Avoid SELECT *; select only necessary columns.
Use JOINS instead of subqueries where possible.
13. Stored Procedures and Functions
Stored procedures and functions are reusable SQL code blocks.
Stored Procedure: Encapsulates SQL code.
CREATE PROCEDURE GetEmployeeDetails
@EmployeeID INT
AS
BEGIN
SELECT FirstName, LastName, Salary
FROM Employees
WHERE EmployeeID = @EmployeeID;
END;
Function: Returns a value or table.
CREATE FUNCTION GetEmployeeFullName(@EmployeeID INT)
RETURNS VARCHAR(100)
AS
BEGIN
DECLARE @FullName VARCHAR(100);
SELECT @FullName = FirstName + ' ' + LastName
FROM Employees
WHERE EmployeeID = @EmployeeID;
RETURN @FullName;
END;
14. Views
Views are virtual tables representing the result of a SQL query.
Creating a View:
CREATE VIEW EmployeeDetails AS
SELECT FirstName, LastName, HireDate
FROM Employees;
Using a View:
SELECT * FROM EmployeeDetails;
15. Common Table Expressions (CTEs)
CTEs simplify complex queries and recursive queries.
Simple CTE:
WITH EmployeeCTE AS (
SELECT FirstName, LastName, DepartmentID
FROM Employees
)
SELECT * FROM EmployeeCTE WHERE DepartmentID = 1;
Recursive CTE:
WITH EmployeeHierarchy (EmployeeID, ManagerID, Level) AS (
SELECT EmployeeID, ManagerID, 0 AS Level
FROM Employees
WHERE ManagerID IS NULL
UNION ALL
SELECT e.EmployeeID, e.ManagerID, eh.Level + 1
FROM Employees e
INNER JOIN EmployeeHierarchy eh ON e.ManagerID = eh.EmployeeID
)
SELECT EmployeeID, ManagerID, Level FROM EmployeeHierarchy;
16. Data Types
Data types define the kind of data that can be stored in a column.
Common Data Types:
INT: Integer values.
VARCHAR(N): Variable-length strings.
DATE: Dates.
FLOAT: Floating-point numbers.
BOOLEAN: True/False values.
17. Constraints
Constraints are rules applied to table columns to enforce data integrity.
PRIMARY KEY: Uniquely identifies each record in a table.
CREATE TABLE Employees (
EmployeeID INT PRIMARY KEY,
LastName VARCHAR(50)
);
FOREIGN KEY: Ensures referential integrity between tables.
CREATE TABLE Orders (
OrderID INT PRIMARY KEY,
EmployeeID INT,
FOREIGN KEY (EmployeeID) REFERENCES Employees(EmployeeID)
);
UNIQUE: Ensures all values in a column are unique.
ALTER TABLE Employees ADD CONSTRAINT unique_email UNIQUE (Email);
NOT NULL: Ensures a column cannot have a NULL value.
ALTER TABLE Employees ALTER COLUMN LastName VARCHAR(50) NOT NULL;
CHECK: Ensures a column’s values meet a specific condition.
ALTER TABLE Employees ADD CONSTRAINT check_salary CHECK (Salary > 0);
18. JSON and XML
Handling semi-structured data stored in JSON or XML format.
JSON Functions:
SELECT JSON_VALUE(Data, '$.Employee.FirstName') AS FirstName
FROM EmployeesJSON;
XML Functions:
SELECT EmployeeXML.value('(/Employee/FirstName)[1]', 'VARCHAR(50)') AS FirstName
FROM EmployeesXML;
19. ETL Processes
ETL (Extract, Transform, Load) processes involve moving data from one or more sources to a destination database.
Extract: Retrieving data from a source.
SELECT * FROM SourceTable;
Transform: Manipulating the data as needed.
INSERT INTO DestinationTable (Col1, Col2)
SELECT Col1, Col2 + 1 FROM SourceTable;
Load: Inserting the transformed data into the destination.
INSERT INTO DestinationTable SELECT * FROM TransformedData;
20. Partitioning
Partitioning splits a large table into smaller pieces, making it easier to manage and improving performance.
Range Partitioning:
CREATE TABLE Orders (
OrderID INT,
OrderDate DATE,
...
) PARTITION BY RANGE (OrderDate) (
PARTITION p1 VALUES LESS THAN ('2024-01-01'),
PARTITION p2 VALUES LESS THAN ('2025-01-01')
);
Mastering SQL is essential for any data engineer aiming to build robust, scalable, and efficient data systems. These topics, from advanced querying techniques to optimization strategies, form the bedrock of a data engineer’s toolkit. As you continue to refine your SQL skills, remember that practice and continuous learning are key to staying ahead in this rapidly evolving industry.
Thank you for reading! If you found this interesting, consider following and subscribing for more interesting articles. Catch me on LinkedIn and follow me on X (Formally Twitter).