The issues associated with performing QA and data validation in an enterprise-wide business intelligence initiative are complex, and in many ways transcend the QA tenets that are appropriate for online transaction processing systems (OLTP). Attempts at establishing effective data validation often fail due to the enormity of the task.
The growth of business intelligence in insurance is driven by a number of key factors; one of the most important being the compelling need for an insurer to have a "single version of the truth." In order to achieve this lofty goal, an overall BI strategy must be defined, organized, developed and tested. It is the testing component or more specifically, the validation of the data that is the subject of this paper. We use the term "QA" throughout the paper. For our purposes, QA refers to the process of testing and validating the data that is used to populate the BI system that is being implemented. It does not refer to functional validation that is more appropriate to an OLTP environment.
A BI initiative is difficult by definition. It typically requires a multi-faceted process that takes disparate data from multiple, heterogeneous sources and organizes it in a way that it becomes useful to all levels and functional reaches within the organization. To be useful, this data must earn credibility and must be beyond reproach. Sounds like a reasonable goal, but why is that so difficult?
Let's start out with the source data. It will come from multiple systems. Each of these systems will have had multiple instances of code and data changes in its past as performed by multiple people, many of whom are no longer with the organization. Each of these systems treats each facet of a premium and loss transaction a little differently. Codes will be different among systems and there is need to create conformity.
And, there is the size issue for a QA person to contend with. In our joint effort with Millbrook and the Montana State Fund, there were approximately 2,500 individual data elements consisting of facts and attributes. These were spread among 30 dimensions. The facts are oriented by multiple time dimensions (Transaction Dates, Policy Year, Calendar Year, Accident Year) and exposed in multiple aggregates and roll-up data marts. Our math determines that the number of possible combinations of these data elements that can occur within the installation is in the trillions!
So the question becomes how can you implement a BI environment and test it thoroughly enough that it becomes the accepted single version of the truth?
This is a Business Initiative - Not Just an IT EffortIT does not create the data, nor use it in a business context. IT is asked to store the data and to provide the mechanisms to get it safely into the computer systems. IT is also required to provide the necessary reporting assistance for using the company's data. Because only the business actually creates, uses and understands the data, it is logical that they must have a significant ownership in the process of validating the data that will fuel the company.
Our joint findings from the project confirm this notion along with;
? Create Data Stewardship - This should be implemented at an early stage in the project or better, before the project begins. Typically a data steward is not an individual, but a shared responsibility and "ownership" of data integrity, between key functions in the company. This typically should involve Finance, Underwriting, Claims, Actuarial as well as Marketing and Premium Audit. The validation of data is an important component of such a program. Most often, the business process and rules that drive the collection of the data are the chief culprits for poor data. The ownership of the business functions facilitates the ability to recognize root cause of the data quality issues and implement remediated processes.
? Get on the Same Page - At the outset of the project, it is critical that organizationally, every one should agree to what the business terms of the data mean to everyone. It is possible to spend weeks of effort agreeing on definitions. Typically, counts such as "Incurred Losses", "Claim Count" or even "Written Premium" may have assumed different definitions for different functions. For example, is a claim count the count of open claims for a single claim number, or is it a feature count, or even a count of claimants? ?Be prepared to spend a lot of the time on counts. It is vastly more difficult to validate data unless there is a common definition of what that data actually is.
? Finding the Right Resources - We find that the ideal QA person in this environment is a person who knows and uses the data, is familiar with the source system idiosyncrasies, and who has a background as a functional analyst. Additionally, a working knowledge of SQL is very helpful to find and research anomalies in the data and trace issues back to their source. Unfortunately this is not a common skill set, so we have found that the pairing of more technical resources with functional data experts works well. These functional staff members are typically the "go to" people when a report is needed.
In order to maximize the chance for a successful data validation effort, it is critical that the appropriate expectations are set regarding the process and the metrics defining success or failure.
? Is Perfection Achievable? - Let's start with the simple fact that there is no such thing as "perfect" data. Because a human being will be recording the data at the outset of the process, there will be expected human errors. Additionally, there are many fields in most processing systems that are unedited, or poorly edited and therefore prone to error. To prevent issues during data validation, critical steps at the outset are:
o Define what the fields are that require perfection. In this case, this will typically be financial fields such as written premium, earned premium, paid losses, outstanding reserves, etc. Textual attributes or other poorly controlled fields that may have known accuracy problems may be correctly put on the non-critical list and addressed with a systemic clean up as part of an ongoing data stewardship program.
o Define what is perfect. For example, if outstanding reserves are expected to be perfect, it must be explicitly understood that the reserve is the best estimation of the claims adjuster of the expected closing value of the claim given the information that is available. Obviously this will move over time. Therefore, it is not "perfect" by definition. However, the accurate handling of that reserve by the BI system needs to be spot on. This should be the definition of perfect for that data element.
o Address long term systemic errors. In many systems, historic data may simply be wrong. Unfortunately, if this is a key data element, say, earned premium, that error is already embedded in the company's financials and actuarial projections and reports. By "correcting" this error, significant expense and damage may be done. The defined methods for handling these situations need to be decided by the business team and addressed before validation commences.
? Reload or Repair? - It is expected and inevitable that issues will be found during the validation process. One of the key questions regarding these findings is whether to repair the data or reload. If the data anomaly found is the result of a flaw in the ETL logic, then the choice has to be made to make the ETL change and re-load the data or to run a script. While many of these issues will depend on the degree of severity, it is good practice to make a determination up front as to the general rules regarding such an event.
In some installations, a reload is not a major issue because of a limited data-set size. In the case of large sets of data, each re-load may consume days, making iterative re-loads impractical. In the case of large loads, we have found it advisable to "batch" issues and have a larger re-load at pre-defined intervals. A loss of type 2 data from a type 1 source system needs to also be considered.
? Prep the Team - One of the larger issues that we see with integrated QA teams is that there is great deal of enthusiasm when the process starts and it diminishes steadily over the process. One of the key culprits here is the lack of level setting with the team concerning the stringent nature of the process. The QA team will be likely comprised of staff that is new to a BI data validation effort. Establishing realistic goals and expected results are important before commencement.
Given the trillions of data combinations that exist in the company data it is not possible to check everything. That is a given. Instead, be prepared to arm the staff with the test scripts necessary to validate the data in a controlled fashion. The following should be included in the script and "QA Kit" for use by the team members:
o Agreed upon definitions of all key measurements and attributes.
o Searchable document that details business and data decisions made during the process.
o Tools to profile data in existing processing source systems. Profiling data generally consists of listing the distinct occurrences of data elements within a single field within a single table. This simple visual inspection is an excellent means of identifying the quality of the processing source system data. The tools can be simple SQL statements or commercially available data profiling tools.
o Production reports that are the "bible" reports for the organization. In order to establish credibility, data must reconcile to these reports.
o Scripts for identifying unexpected results (anomalies) due to logical data errors or the absence of documented business rules vs. the "bible" reports. In many cases, we find that the BI data is right and the original source reports are logically incorrect due to poor filtering in the report source code. In most of these cases we have found that it has been wrong for years but nobody except the original programmer knew about the aberrant filtering conditions.
o Scripts for visual verification of data at a premium and loss level. These scripts are ideally implemented as pre-set reports or subject areas within the BI tool chosen.
? Revisit Expectations - In the heat of data validation, it is common to lose sight of the basic objectives. During the process, it will be necessary to revisit priorities and refine expectations. Condition the team to relate data issues back to whether it impacts critical data and record and define ongoing efforts that may improve data at the source systems. Moving this process from the emotional to systematic allows you to steady the ship and keep enthusiasm high
? Peel the Onion - The process of data validation is best approached by starting at a high level and moving to lower levels of detail. Each layer should be tested, with identified anomalies analyzed by drilling into successively lower levels of granularity. This approach will identify whether the data anomaly is systemic and wide-spread or is confined to one or just a few records with no associated pattern
? Leave Enough Time in Project Plan - Data validation and QA is generally short-changed in the project planning process. In a average system implementation (approximately 300 or so data elements), this is typically afforded only 4 weeks. We would argue that this is not enough. There will be problems and there will be a need to reload/repair data and retest. This is always an iterative process and must be considered. A more appropriate time is 6 to 8 weeks, with a contingency built in if more serious errors are encountered.
This is often a surprise to team members who are familiar with OLTP implementations. Data requires more time to validate and the implications of not allowing sufficient time can lead to project failure.
Once the business side is engaged and the level setting process complete, it's time to start the data validation process. There are several key best practices in the process that should be understood and employed to maximize the possibility for success:
? How Much Data? - We have found that it makes sense to "smoke test" the earliest phase of QA by creating a sampling of approximately 1,000 policies and their associated claims. The policies selected should represent all lines of business, product types and associated coverages. The QA team should interface with the business functions to determine if there are unique conditions that should be identified and brought into this initial test bed. This "smoke test" should test the transformations and should be used to validate against the source systems on a policy by policy and claim by claim basis. As the processing source systems are a moving target, a snapshot, or a "frozen data state" of each source system should be taken and exposed to QA staff so that there is a "apples to apples" comparison between the processing system data and the BI data.
Once initial testing is completed, you will want to create an initial load of data for validation testing. This is necessary as the reports that the team will use to validate will contain the same data. As in the smoke test load, at the time the full load is run, a snapshot version of the related processing source systems should be taken to ensure uniformity in reconciliation. The initial load should contain enough data from the last several calendar years and enough historical data should be present to reconcile to the prior year end reports.
This approach will also provide the ability to do load testing and begin the process of fine tuning your ETL code.
? What Tools To Validate? - The most effective means of validating data is to use the BI tool that is being implemented as part of the initiative. This tool should allow for easy aggregation of data for verifying the "bible reports" of the organization. More importantly, properly structured, they should provide drill down capabilities to quickly identify anomalies in the data.
Another key benefit of using the BI tool as the validation point is that the user will be able to visually scan data at the required aggregation level. In many installations, this will be the first time staff may have this capability and the discovery possibilities are enormous. The simple act of scrolling through filtered and categorized data often yields the most "ah-ha" moments of the data validation process.
Using the BI tool as the validation mechanism also provides the key benefit of providing validation of the BI semantic layer. The semantic layer is the "brain" of the BI system where all data elements, table joining and other configuration is defined so that reports and dashboards may be easily prepared.
Of course, the use of the BI solution as the validation tool pre-supposes that the semantic layer in the tool has been completed to the extent required to support the effort. As a best practice, it becomes important to pre-build the semantic layer before the data validation effort commences.
? Metadata Use - In a best practices BI environment, the use of a complete source to target metadata management layer becomes a major asset in the data validation stage. A key success component in our mutual success with the Montana State Fund was the development and implementation of a metadata solution that provided complete data lineage detail including transformation activities from the individual source through the BI, or reporting layer. This provided the data validation team with an ability to instantly assess the path the data travelled to its ultimate reporting destination. It also provided an enforcement to the business rules and definitions finalized at the beginning of the process.
? Common Stumbling Blocks - The QA plan should recognize that there are several key areas that are especially problematic and should have recognized concentration:
o Status Changes - A best practice BI installation will provide the ability to understand and detail the status of a policy or a claim at any point in its life-cycle. There are typically multiple rules that must be constructed in the ETL code to govern this process. Where this becomes problematic is that the full totality of the rules that are managed in the processing source systems are not fully known or documented. Careful validation of these status changes are required on a policy and claim basis vs. the processing source system record.
o Counts - The initial business sessions where the definitions are set will discuss and define counts (policy and claim side). As noted earlier in this paper, counts may have evolved to different definitions and understandings by different functions within an organization for the "same" count. As result, these sessions, most likely, will result in an expanded and/or changed set of definitions where no existing reports are available to validate against. Ongoing sessions with the business representatives should focus on the definitions and validation activities required:
? A reasonability review of whether the counts make practical sense.
? A deeper dive where a segment of counts can be validated at a detail level vs. detail from the processing source systems.
o Transformations - Where there is transformation activity in the ETL process, there will be potential data problems. The effective use of a best practice metadata management process that provides business definitions and source information to the tester facilitates the validation efforts.
o Converted Data/ Changing Business Rules - Historical data being moved from a single source system may have undergone multiple metamorphoses during the length of its journey. The data that you see in your processing source system today, may have started in another system (or more than one system) long ago. At the time of data conversion(s) business rules may have been enacted on the data in order to get it into the new processing source system cleanly. This may have included "dummy" records and/or data fields being used. In short, while allowing the processing source system to function, the data may not accurately reflect the goal of the "single source of the truth". Unfortunately, that data is what it is. It is important to detail up front what the expected anomalies will yield at key inflection points in the data's points and make sure that validation activities test this data thoroughly.
If an organization business rules associated with a conversion during the requirements phase, sometimes these business rules can be embedded in the ETL logic for the period of time when these business rules apply.
We hope that we successfully made the point that data validation in an enterprise-wide BI initiative is a challenging task. The sheer weight of the validation of potentially trillions of data combinations, demands that an appropriate process be planned and put in place to address this critical need.
To maximize your success potential:
? Get the business side involved up front and with the right people.
? Make sure that you establish the right. expectations and provide the right tools.
? Create and follow a well-planned process.
It goes without saying that the down-side to not skillfully executing a BI installation is that it suffers from credibility issues. When this happens, business users quickly lose confidence in the data and recapturing user confidence can become a herculean effort. While difficult, we have found that the data validation can be very successful when planned and implemented following the recommendations of this best practices approach.
(Jack Plunkett is president of Millbrook)