Data cleaning in MySQL involves identifying and correcting inaccuracies, inconsistencies, and incomplete data in a database. Effective data cleaning ensures accuracy and reliability in database operations and analytics.
Key activities include:
- Removing Duplicates: Identifying and deleting duplicate records using `GROUP BY` or unique constraints.
- Handling Nulls: Updating or filling null values with appropriate default values or calculated data.
- Standardizing Data: Ensuring data formats are consistent (e.g., date formats, phone numbers) using string functions and regular expressions.
- Validating Data: Using constraints and checks to enforce data integrity and validate input values.
- Data Transformation: Normalizing or aggregating data to meet analytical or reporting requirements.
Sample SQL queries for various data cleaning tasks in MySQL These queries help clean and standardize data, ensuring consistency and improving the quality of the database:
- Removing Duplicate Records
Assume you have a table `employees` with potential duplicate rows:
“`sql
— Remove duplicates while keeping the row with the lowest ID
DELETE e1
FROM employees e1
JOIN employees e2
ON e1.email = e2.email
WHERE e1.id > e2.id;
- Handling Nulls
Update null values in the `phone_number` column with a default value:
“`sql
— Replace null phone numbers with ‘N/A’
UPDATE employees
SET phone_number = ‘N/A’
WHERE phone_number IS NULL;
- Standardizing Data
Convert all email addresses to lowercase:
“`sql
— Convert email addresses to lowercase
UPDATE employees
SET email = LOWER(email);
- Validating Data
Add a constraint to ensure that the `salary` column is always positive:
“`sql
— Add a check constraint for positive salary values
ALTER TABLE employees
ADD CONSTRAINT chk_salary_positive CHECK (salary > 0);
- Data Transformation
Normalize job titles to a consistent format:
“`sql
— Capitalize job titles
UPDATE employees
SET job_title = CONCAT(UCASE(SUBSTRING(job_title, 1, 1)), LCASE(SUBSTRING(job_title, 2)));