Data Profiler - Standard, clean data is the foundation

Companies are storing lots of data and having a clear visibility of this data at the enterprise level makes the business efficient, safe and complaint. Manual profiling/classification of the data is very costly and time consuming. InsightLake solution enables the companies to perform the data classification using an intelligent, flexible and robust framework, which uses variety of data elemenets, glossaries, metadata coupled with business rules and ML models.

Classification Manager

Classification manager is a central core engine, which provides following functionality:

  • Interactive web console Core Engine
  • Central engine for coordination ML Solution
  • Pipelines, Notebooks, Labeling Environment, Model repo and catalog Classification
  • Exploration of classification and correction APIs - Enable external applications to retrieve classifications, profiles, workflows etc

Profiler

Diagram shows the high level architecture of the profiler. Multiple instances of profilers can be deployed on-premise or cloud environments and they connect to the centralized classification manager.

  • Loads assigned data profiles, workflows, policies and rules from the Classification manager
  • Initiates the data and technical metadata collection from source systems using various data connectors
  • Connects to the data catalog and glossaries systems or relies on Classification manager to provide them
  • Performs data profiling to get the statistical distribution of the data and categorization of the data types
  • Builds a Data graph to hold rich information about data, metadata, lineage etc.
  • In case of structured data performs high level contextual analysis using rules and models

Data Profiling

InsightLake Data Profiler Big Data based solution enables companies to perform following operations to create reliable data in both real time & batch pipelines.

  • Profiles and classifies data using rules and ML models
  • Identifies personal and sensitive data
  • Performs data classification leveraging glossaries, catalog, technical metadata, business tags and actual data
  • Find out whether existing data can be easily used for other purposes.
  • Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category.
  • Classify critical data elements

Data profile

Context to store rules

  • Technical data types
  • Null or unsupported values
  • Mean, average, min, max
  • Sample values
  • Logical context types like currency, OS, geo information etc.

Data standardization

Organizations build clean data sets, which they call data marts, subject areas etc. These data sets get data from various data sources. Data from different sources can come in different formats and with quality issues. These in-consistent data elements can be standardized using quality rules in quality center easily with rich function library.

  • Dates - convert to standard date format
  • Geo Locations - convert to ISO codes
  • Currency conversion
  • Brand name standardization
  • Default or null value standardization

Data Catalog

Automatically catalog and map sensitive & personal data with deep data insight, incorporating active metadata and classification. Gain additional privacy, security, and business insight – all within a single pane of glass.

  • Visualize your data in one place
  • Unified data inventory
  • Enhance technical metadata
  • Metadata exchange
  • Automate discovery
  • Manage risk

Cluster Analysis

Automate classification at scale for large data volumes, uncover duplicate, derivative and similar data, and rapidly deliver meaningful insight with InsightLake cluster analysis.

  • Find the data you need
  • Identify duplicate data
  • Automate labeling
  • Discover dark data
  • Classify quickly & accurately
  • Analyze data

Correlation

Add context to classification and surface relationships between data points. Build identity and entity profiles, associate whose data it is, and visualize how data is interconnected across data sources.

  • Discover any data
  • Correlation-based classifiers
  • Uncover relationships
  • Visualize connections
  • Uncover dark data
  • ML data analysis

Classification

Classify all types of data across data stores and in the cloud: discover sensitive and personal data, analyze activity, meet compliance, and protect personal and sensitive data – all with a data-centric approach.

  • Person & identity
  • Sensitive data
  • Regulation
  • Document type
  • Policy
  • Security attributes
  • ML learning models

Data Quality

Actively monitor the consistency, accuracy, completeness and validity of your data with BigID – on all of your data sources in one single platform.

  • Visualize data quality
  • Dynamic profiling
  • Modern & scaleable

Inventory

Insightlake takes a new machine learning based approach to personal data discovery that is focused on personal information – leveraging identity intelligence and machine learning to deliver an accurate and scalable method focused on personal and sensitive data in order to satisfy data privacy requirements at scale.

  • Person & identity
  • Sensitive data
  • Regulation
  • Document type
  • Policy