Data Profiler - Standard, clean data is the foundation

Companies are storing lots of data and having a clear visibility of this data at the enterprise level makes the business efficient, safe and complaint. Manual profiling/classification of the data is very costly and time consuming. InsightLake solution enables the companies to perform the data classification using an intelligent, flexible and robust framework, which uses variety of data elemenets, glossaries, metadata coupled with business rules and ML models.

Classification Manager

Classification manager is a central core engine, which provides following functionality:

Interactive web console Core Engine
Central engine for coordination ML Solution
Pipelines, Notebooks, Labeling Environment, Model repo and catalog Classification
Exploration of classification and correction APIs - Enable external applications to retrieve classifications, profiles, workflows etc

Profiler

Diagram shows the high level architecture of the profiler. Multiple instances of profilers can be deployed on-premise or cloud environments and they connect to the centralized classification manager.

Loads assigned data profiles, workflows, policies and rules from the Classification manager
Initiates the data and technical metadata collection from source systems using various data connectors
Connects to the data catalog and glossaries systems or relies on Classification manager to provide them
Performs data profiling to get the statistical distribution of the data and categorization of the data types
Builds a Data graph to hold rich information about data, metadata, lineage etc.
In case of structured data performs high level contextual analysis using rules and models

Data Profiling

InsightLake Data Profiler Big Data based solution enables companies to perform following operations to create reliable data in both real time & batch pipelines.

Profiles and classifies data using rules and ML models
Identifies personal and sensitive data
Performs data classification leveraging glossaries, catalog, technical metadata, business tags and actual data
Find out whether existing data can be easily used for other purposes.
Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category.
Classify critical data elements

Data profile

Context to store rules

Technical data types
Null or unsupported values
Mean, average, min, max
Sample values
Logical context types like currency, OS, geo information etc.

Data standardization

Organizations build clean data sets, which they call data marts, subject areas etc. These data sets get data from various data sources. Data from different sources can come in different formats and with quality issues. These in-consistent data elements can be standardized using quality rules in quality center easily with rich function library.

Dates - convert to standard date format
Geo Locations - convert to ISO codes
Currency conversion
Brand name standardization
Default or null value standardization

Data Catalog

Automatically catalog and map sensitive & personal data with deep data insight, incorporating active metadata and classification. Gain additional privacy, security, and business insight – all within a single pane of glass.

Visualize your data in one place
Unified data inventory
Enhance technical metadata
Metadata exchange
Automate discovery
Manage risk

Cluster Analysis

Automate classification at scale for large data volumes, uncover duplicate, derivative and similar data, and rapidly deliver meaningful insight with InsightLake cluster analysis.

Find the data you need
Identify duplicate data
Automate labeling
Discover dark data
Classify quickly & accurately
Analyze data

Correlation

Add context to classification and surface relationships between data points. Build identity and entity profiles, associate whose data it is, and visualize how data is interconnected across data sources.

Discover any data
Correlation-based classifiers
Uncover relationships
Visualize connections
Uncover dark data
ML data analysis

Classification

Classify all types of data across data stores and in the cloud: discover sensitive and personal data, analyze activity, meet compliance, and protect personal and sensitive data – all with a data-centric approach.

Person & identity
Sensitive data
Regulation
Document type
Policy
Security attributes
ML learning models

Data Quality

Actively monitor the consistency, accuracy, completeness and validity of your data with BigID – on all of your data sources in one single platform.

Visualize data quality
Dynamic profiling
Modern & scaleable

Inventory

Insightlake takes a new machine learning based approach to personal data discovery that is focused on personal information – leveraging identity intelligence and machine learning to deliver an accurate and scalable method focused on personal and sensitive data in order to satisfy data privacy requirements at scale.

Person & identity
Sensitive data
Regulation
Document type
Policy