Company - Reference Data Integration Methodology

Most of the companies that we have worked for have significant data disparity issues. Typical problems include

  1. Lack of knowledge of regarding reference data
  2. Low quality of the reference data
  3. Multiple copies of the reference data
  4. Lack of synchronization of reference data
  5. Absence of reference data stewards

In these projects, our goal is to deliver a single unified view of the reference data.
We employ the following steps to integrate the reference data.

STEP I:REFERENCE DATA DISCOVERY

Most companies are unaware of where all their reference data is located.
For example, customer-related data may be distributed across different applications and different databases in a company's systems environment.
To discover all the reference data from the data landscape, we scan all the databases in the landscape, and create a comprehensive catalog of the data environment (data, metadata, IDs)

STEP II:REFERENCE DATA IDENTIFICATION

With a full catalog of the data landscape, it becomes important to identify all the tables that contain the reference data.

E.g. If there are 100 databases, each containing 100 tables, each containing 10 columns, one must understand the semantic metadata associated with 100,000 columns (at most) to find where the customer related information lies.

To identify the columns containing the reference data, we automate the profiling of all the data objects that have been cataloged, using a parallel processing approach. Using this approach we can identify all the reference data objects in the data landscape.

STEP III:REFERENCE DATA CLASSIFICATION

After obtaining a list of all the reference objects, it is important to classify the objects into the appropriate categories like customer, vendor, employee etc.

E.g. If we have determined that there are 200 reference data categories associated with the 100,000 columns in the landscape, we need to classify all the reference data into the appropriate categories.

To classify the columns containing a specific category of reference data, we profile the columns of all the data objects that have been identified as reference data. In this manner we can classify the reference data into the appropriate buckets.

STEP IV:REFERENCE DATA INTEGRATION

After having scanned the data landscape and obtained a list of all the reference objects in a particular category (e.g. customer), it is possible to integrate this information together, to create a single unified view of enterprise reference data.

E.g. If we have determined that the customer records are dispersed across a 100 databases, some of them being.

Database (Sales)-Table(Cust) contains 3 million customer records

Database (Financials)-Table(CIF) contains 2 million customer records

Database (CRM)-Table(Customer) contains 4 million customer records

Database (Marketing)-Table(Cust) contains 5 million customer records

To obtain a definitive picture of enterprise customers, the records have to be integrated to create a single table that

Cross References all the customer records

Builds a current and historical view of the customers

Synchronizes this information automatically to create an up-to-date version.

Our technology can largely automate the integration of this data through sophisticated record-matching algorithms. This proprietary technology allows us to performing matching using 5 types of algorithms.

  1. Uni-dimensional exact matches
  2. Multidimensional exact matches
  3. Uni-dimensional inexact matches
  4. Multidimensional inexact matches
  5. Globalids™ matches, using unique global identifiers.

STEP V:REFERENCE DATA INTEGRATION

Once the Integrated Reference Data table has been created, it must be synchronized with the different databases, so that the information is continuously updated

E.g. If we have a single Enterprise Customer Reference Table, the data in that table has to be updated with all the changes in customer data in all the customer tables across all the databases, on a periodic basis (hourly, daily, weekly etc).

Our technology allows us to continuously monitor all the reference data tables in enterprise databases, and create alerts based on changes in the data. The software can independently, keep track of changes and update the appropriate data in the Enterprise Reference Data tables.

STEP VI:REFERENCE DATA STEWARDSHIP

With an up-to-date Repository of Reference Data, organizations can get an enterprise-wide view of their customers, vendors, employees etc. This would allow companies to assign Data Stewards who would be responsible for ensuring and certifying the quality of the reference data.

E.g. Customer Reference Data Steward would be responsible for validating the quality of the customer reference data and making sure that the process for maintaining the data quality was being adhered to. Our software can provide the Reference Data Steward with the tools that are required to do the job. Our applications can help the Data Steward

  1. Define the Reference Data Stewardship Process
    Scan
    • Discover
    • Profile
    • Identify
    • Classify
    • Extract
    • Transform
    • Parse
    • Compute Quality Metrics
    • Analyze
    • Standardize
    • Correct
    • DeDupe
    • Enhance
    • Aggregate
    • Match
    • Report
    • Synchronize
    • Add / Delete / Modify
    • Archive

  2. Document the Reference Data Stewardship Process
  3. Reporting on the Integrated Enterprise Reference Data
  4. Defining Multiple Hierarchies (Taxonomies)
  5. Computing Reference Data Quality Metrics
  6. Approving additions, deletions and modifications to Reference Data. z
  7. Archiving Reference Data

 

Site best viewed in Microsoft Internet Explorer 5.0+ in 1024x768 resolution.