Data Profiling..All in Nutshell

Extract. Eigenize. Evolve.

Data Profiling..All in Nutshell

Data Profiling is the process of analyzing (statistically/ systematically), assessing, and hence collecting statistics or informative summaries about the data.

Data profiling is essential first step and critical component of implementing a data strategy – as it helps;

  • Discover, understand and organize Data (in nutshell).
  • Perform the thorough assessment of data quality.
  • Find out whether existing data can be easily used for other purposes.
  • Discover business knowledge embedded within data.
  • Reveal the relationships that span different sources (e.g. a database or a file).
  • Detect outliers and anomalies.
  • Tag data (with categories, descriptions or keywords) to improve the ability to search data.

Types of Data Analysis:-

  • Structure Analysis:
    • Validate data is consistent.
    • Validate data is formatted correctly – using set of Scrub rules or Pattern matching to find valid set of formats and learn other format-specific information.
    • Statistics – mean, median, mode, SD to give insights into the validity of data.
  • Content Analysis:
  • Individual elements are looked closely to check data quality in terms of
    • Completeness : How often a given field/attribute is populated, verses blank or null?
    • Uniqueness : How many unique values are found for a given filed/attribute across all records? Are there duplicates / Redundancies?
    • Values Distribution : What is the distribution of records across different values for a given attribute?
    • Range : What are the minimum, maximum, average and median values found for a given attribute?
    • Pattern : What formats were found for a given attribute, and what is the distribution of records across these formats.
  • Relationship Analysis:
    • Gain a better understanding of the connection between data sets.
    • Meta data analysis to determine key relationship (specifically where data overlaps).


Data Profiling Tasks & Techniques:

  • Single Column Profiling:
    • Most basic form of data profiling
    • Useful for categorization /relevance of attribute, outlier detection, query optimization
    • Cardinalities:
      • Number of values.
      • Number of distinct values.
      • Number of NULLs/ Blank values.
      • No of times each value shows up within each column.
      • Min, Max, Average and Median values.
    • Values Distribution:
      • Probability distribution to detect whether data follows some well-known distribution.
      • Frequency distribution for value intervals.
      • Histograms – if no specific function is detectable.
    • Data types and value patterns:
      • String vs. number vs. date vs. complex.
      • Categorical vs. continuous.
      • Semantic domains (address, phone, name etc.)
      • Regular expressions
    • Assumption :
      • All values are of same type.
      • All values have some common properties.
  • Multiple Columns Profiling:
    • Dependency among data attributes.
    • Relationship / structure / joint properties embedded in dataset.
    • Useful for data integration, indexing, optimization, schema design.
    • Uniqueness :
      • Unique column – only unique values
      • Key discovery – No Null values
    • Inclusion dependencies:
      • Foreign key
      • All values of in A are also present in B.
    • Functional dependencies:
      • Normalization
      • Two records have same X values; they also have same Y values.
    • Partial dependencies:
      • No perfect hold – only 10 or 10% or 1% of the records.
      • Useful for data cleansing.
    • Conditional dependencies:
      • For which part do the hold?
      • Useful for cross-source integration
  • Multiple Tables Profiling:
    • Data overlap
      • Duplicate detection – detect multiple representation of the same real-world entity.
      • Record linkage
    • Orphan records detection.
    • Identify data values sets that could be mapped together.
    • Topic Discovery
      • What is a data set about? Domains, topics, entity types.
  • Data Scrubbing:
    • Verify that data and data sets conform to the predefined rules/business rules.

Use Case for Profiling;

  • Query optimization
  • Data cleansing
  • Data integration
  • Data analytics
  • Database reverse engineering
  • Data management tasks

Data profiling is all about knowing your data and hence it lets you discover the business knowledge that is actually buried in data itself.

Leave a Reply

Your email address will not be published. Required fields are marked *