Data Profiling..All in Nutshell

Data Profiling is the process of analyzing (statistically/ systematically), assessing, and hence collecting statistics or informative summaries about the data.
Data profiling is essential first step and critical component of implementing a data strategy – as it helps;
- Discover, understand and organize Data (in nutshell).
- Perform the thorough assessment of data quality.
- Find out whether existing data can be easily used for other purposes.
- Discover business knowledge embedded within data.
- Reveal the relationships that span different sources (e.g. a database or a file).
- Detect outliers and anomalies.
- Tag data (with categories, descriptions or keywords) to improve the ability to search data.
Types of Data Analysis:-
- Structure Analysis:
- Validate data is consistent.
- Validate data is formatted correctly – using set of Scrub rules or Pattern matching to find valid set of formats and learn other format-specific information.
- Statistics – mean, median, mode, SD to give insights into the validity of data.
- Content Analysis:
- Individual elements are looked closely to check data quality in terms of
- Completeness : How often a given field/attribute is populated, verses blank or null?
- Uniqueness : How many unique values are found for a given filed/attribute across all records? Are there duplicates / Redundancies?
- Values Distribution : What is the distribution of records across different values for a given attribute?
- Range : What are the minimum, maximum, average and median values found for a given attribute?
- Pattern : What formats were found for a given attribute, and what is the distribution of records across these formats.
- Relationship Analysis:
- Gain a better understanding of the connection between data sets.
- Meta data analysis to determine key relationship (specifically where data overlaps).
Data Profiling Tasks & Techniques:
- Single Column Profiling:
- Most basic form of data profiling
- Useful for categorization /relevance of attribute, outlier detection, query optimization
- Cardinalities:
- Number of values.
- Number of distinct values.
- Number of NULLs/ Blank values.
- No of times each value shows up within each column.
- Min, Max, Average and Median values.
- Values Distribution:
- Probability distribution to detect whether data follows some well-known distribution.
- Frequency distribution for value intervals.
- Histograms – if no specific function is detectable.
- Data types and value patterns:
- String vs. number vs. date vs. complex.
- Categorical vs. continuous.
- Semantic domains (address, phone, name etc.)
- Regular expressions
- Assumption :
- All values are of same type.
- All values have some common properties.
- Multiple Columns Profiling:
- Dependency among data attributes.
- Relationship / structure / joint properties embedded in dataset.
- Useful for data integration, indexing, optimization, schema design.
- Uniqueness :
- Unique column – only unique values
- Key discovery – No Null values
- Inclusion dependencies:
- Foreign key
- All values of in A are also present in B.
- Functional dependencies:
- Normalization
- Two records have same X values; they also have same Y values.
- Partial dependencies:
- No perfect hold – only 10 or 10% or 1% of the records.
- Useful for data cleansing.
- Conditional dependencies:
- For which part do the hold?
- Useful for cross-source integration
- Multiple Tables Profiling:
- Data overlap
- Duplicate detection – detect multiple representation of the same real-world entity.
- Record linkage
- Orphan records detection.
- Identify data values sets that could be mapped together.
- Topic Discovery
- What is a data set about? Domains, topics, entity types.
- Data overlap
- Data Scrubbing:
- Verify that data and data sets conform to the predefined rules/business rules.
Use Case for Profiling;
- Query optimization
- Data cleansing
- Data integration
- Data analytics
- Database reverse engineering
- Data management tasks
Data profiling is all about knowing your data and hence it lets you discover the business knowledge that is actually buried in data itself.