The issues associated with performing QA and datavalidation in an enterprise-wide business intelligence initiativeare complex, and in many ways transcend the QA tenets that areappropriate for online transaction processing systems (OLTP).Attempts at establishing effective data validation often fail dueto the enormity of the task.

|

The growth of business intelligence in insurance is driven by anumber of key factors; one of the most important being thecompelling need for an insurer to have a "single version of thetruth." In order to achieve this lofty goal, an overall BI strategymust be defined, organized, developed and tested. It is the testingcomponent or more specifically, the validation of the data that isthe subject of this paper. We use the term "QA" throughout thepaper. For our purposes, QA refers to the process of testing andvalidating the data that is used to populate the BI system that isbeing implemented. It does not refer to functional validation thatis more appropriate to an OLTP environment.

|

A BI initiative is difficult by definition. It typicallyrequires a multi-faceted process that takes disparate data frommultiple, heterogeneous sources and organizes it in a way that itbecomes useful to all levels and functional reaches within theorganization. To be useful, this data must earn credibility andmust be beyond reproach. Sounds like a reasonable goal, but why isthat so difficult?

|

Let's start out with the source data. It will come from multiplesystems. Each of these systems will have had multiple instances ofcode and data changes in its past as performed by multiple people,many of whom are no longer with the organization. Each of thesesystems treats each facet of a premium and loss transaction alittle differently. Codes will be different among systems and thereis need to create conformity.

|

And, there is the size issue for a QA person to contend with. Inour joint effort with Millbrook and the Montana State Fund, therewere approximately 2,500 individual data elements consisting offacts and attributes. These were spread among 30 dimensions. Thefacts are oriented by multiple time dimensions (Transaction Dates,Policy Year, Calendar Year, Accident Year) and exposed in multipleaggregates and roll-up data marts. Our math determines that thenumber of possible combinations of these data elements that canoccur within the installation is in the trillions!

|

So the question becomes how can you implement a BI environmentand test it thoroughly enough that it becomes the accepted singleversion of the truth?

This is a Business Initiative - Not Just an IT Effort

Our joint findings from the project confirm this notion alongwith;

|

? Create Data Stewardship - This should beimplemented at an early stage in the project or better, before theproject begins. Typically a data steward is not an individual, buta shared responsibility and "ownership" of data integrity, betweenkey functions in the company. This typically should involveFinance, Underwriting, Claims, Actuarial as well as Marketing andPremium Audit. The validation of data is an important component ofsuch a program. Most often, the business process and rules thatdrive the collection of the data are the chief culprits for poordata. The ownership of the business functions facilitates theability to recognize root cause of the data quality issues andimplement remediated processes.

|

? Get on the Same Page - At the outset of theproject, it is critical that organizationally, every one shouldagree to what the business terms of the data mean to everyone. Itis possible to spend weeks of effort agreeing on definitions.Typically, counts such as "Incurred Losses", "Claim Count" or even"Written Premium" may have assumed different definitions fordifferent functions. For example, is a claim count the count ofopen claims for a single claim number, or is it a feature count, oreven a count of claimants? ?Be prepared to spend a lot of the timeon counts. It is vastly more difficult to validate data unlessthere is a common definition of what that data actually is.

|

? Finding the Right Resources - We find thatthe ideal QA person in this environment is a person who knows anduses the data, is familiar with the source system idiosyncrasies,and who has a background as a functional analyst. Additionally, aworking knowledge of SQL is very helpful to find and researchanomalies in the data and trace issues back to their source.Unfortunately this is not a common skill set, so we have found thatthe pairing of more technical resources with functional dataexperts works well. These functional staff members are typicallythe "go to" people when a report is needed.

Setting Expectations

In order to maximize the chance for a successful data validationeffort, it is critical that the appropriate expectations are setregarding the process and the metrics defining success orfailure.

|

? Is Perfection Achievable? - Let's start withthe simple fact that there is no such thing as "perfect" data.Because a human being will be recording the data at the outset ofthe process, there will be expected human errors. Additionally,there are many fields in most processing systems that are unedited,or poorly edited and therefore prone to error. To prevent issuesduring data validation, critical steps at the outset are:

|

o Define what the fields are that requireperfection. In this case, this will typically be financialfields such as written premium, earned premium, paid losses,outstanding reserves, etc. Textual attributes or other poorlycontrolled fields that may have known accuracy problems may becorrectly put on the non-critical list and addressed with asystemic clean up as part of an ongoing data stewardshipprogram.

|

o Define what is perfect. For example, ifoutstanding reserves are expected to be perfect, it must beexplicitly understood that the reserve is the best estimation ofthe claims adjuster of the expected closing value of the claimgiven the information that is available. Obviously this will moveover time. Therefore, it is not "perfect" by definition. However,the accurate handling of that reserve by the BI system needs to bespot on. This should be the definition of perfect for that dataelement.

|

o Address long term systemic errors. In manysystems, historic data may simply be wrong. Unfortunately, if thisis a key data element, say, earned premium, that error is alreadyembedded in the company's financials and actuarial projections andreports. By "correcting" this error, significant expense and damagemay be done. The defined methods for handling these situations needto be decided by the business team and addressed before validationcommences.

|

? Reload or Repair? - It is expected andinevitable that issues will be found during the validation process.One of the key questions regarding these findings is whether torepair the data or reload. If the data anomaly found is the resultof a flaw in the ETL logic, then the choice has to be made to makethe ETL change and re-load the data or to run a script. While manyof these issues will depend on the degree of severity, it is goodpractice to make a determination up front as to the general rulesregarding such an event.

|

In some installations, a reload is not a major issue because ofa limited data-set size. In the case of large sets of data, eachre-load may consume days, making iterative re-loads impractical. Inthe case of large loads, we have found it advisable to "batch"issues and have a larger re-load at pre-defined intervals. A lossof type 2 data from a type 1 source system needs to also beconsidered.

|

? Prep the Team - One of the larger issues thatwe see with integrated QA teams is that there is great deal ofenthusiasm when the process starts and it diminishes steadily overthe process. One of the key culprits here is the lack of levelsetting with the team concerning the stringent nature of theprocess. The QA team will be likely comprised of staff that is newto a BI data validation effort. Establishing realistic goals andexpected results are important before commencement.

|

Given the trillions of data combinations that exist in thecompany data it is not possible to check everything. That is agiven. Instead, be prepared to arm the staff with the test scriptsnecessary to validate the data in a controlled fashion. Thefollowing should be included in the script and "QA Kit" for use bythe team members:

|

o Agreed upon definitions of all key measurements andattributes.

|

o Searchable document that details business and data decisionsmade during the process.

|

o Tools to profile data in existing processing source systems.Profiling data generally consists of listing the distinctoccurrences of data elements within a single field within a singletable. This simple visual inspection is an excellent means ofidentifying the quality of the processing source system data. Thetools can be simple SQL statements or commercially available dataprofiling tools.

|

o Production reports that are the "bible" reports for theorganization. In order to establish credibility, data mustreconcile to these reports.

|

o Scripts for identifying unexpected results (anomalies) due tological data errors or the absence of documented business rules vs.the "bible" reports. In many cases, we find that the BI data isright and the original source reports are logically incorrect dueto poor filtering in the report source code. In most of these caseswe have found that it has been wrong for years but nobody exceptthe original programmer knew about the aberrant filteringconditions.

|

o Scripts for visual verification of data at a premium and losslevel. These scripts are ideally implemented as pre-set reports orsubject areas within the BI tool chosen.

|

? Revisit Expectations - In the heat of datavalidation, it is common to lose sight of the basic objectives.During the process, it will be necessary to revisit priorities andrefine expectations. Condition the team to relate data issues backto whether it impacts critical data and record and define ongoingefforts that may improve data at the source systems. Moving thisprocess from the emotional to systematic allows you to steady theship and keep enthusiasm high

|

? Peel the Onion - The process of datavalidation is best approached by starting at a high level andmoving to lower levels of detail. Each layer should be tested, withidentified anomalies analyzed by drilling into successively lowerlevels of granularity. This approach will identify whether the dataanomaly is systemic and wide-spread or is confined to one or just afew records with no associated pattern

|

? Leave Enough Time in Project Plan - Datavalidation and QA is generally short-changed in the projectplanning process. In a average system implementation (approximately300 or so data elements), this is typically afforded only 4 weeks.We would argue that this is not enough. There will be problems andthere will be a need to reload/repair data and retest. This isalways an iterative process and must be considered. A moreappropriate time is 6 to 8 weeks, with a contingency built in ifmore serious errors are encountered.

|

This is often a surprise to team members who are familiar withOLTP implementations. Data requires more time to validate and theimplications of not allowing sufficient time can lead to projectfailure.

Process Lessons

Once the business side is engaged and the level setting processcomplete, it's time to start the data validation process. There areseveral key best practices in the process that should be understoodand employed to maximize the possibility for success:

|

? How Much Data? - We have found that it makessense to "smoke test" the earliest phase of QA by creating asampling of approximately 1,000 policies and their associatedclaims. The policies selected should represent all lines ofbusiness, product types and associated coverages. The QA teamshould interface with the business functions to determine if thereare unique conditions that should be identified and brought intothis initial test bed. This "smoke test" should test thetransformations and should be used to validate against the sourcesystems on a policy by policy and claim by claim basis. As theprocessing source systems are a moving target, a snapshot, or a"frozen data state" of each source system should be taken andexposed to QA staff so that there is a "apples to apples"comparison between the processing system data and the BI data.

|

Once initial testing is completed, you will want to create aninitial load of data for validation testing. This is necessary asthe reports that the team will use to validate will contain thesame data. As in the smoke test load, at the time the full load isrun, a snapshot version of the related processing source systemsshould be taken to ensure uniformity in reconciliation. The initialload should contain enough data from the last several calendaryears and enough historical data should be present to reconcile tothe prior year end reports.

|

This approach will also provide the ability to do load testingand begin the process of fine tuning your ETL code.

|

? What Tools To Validate? - The most effectivemeans of validating data is to use the BI tool that is beingimplemented as part of the initiative. This tool should allow foreasy aggregation of data for verifying the "bible reports" of theorganization. More importantly, properly structured, they shouldprovide drill down capabilities to quickly identify anomalies inthe data.

|

Another key benefit of using the BI tool as the validation pointis that the user will be able to visually scan data at the requiredaggregation level. In many installations, this will be the firsttime staff may have this capability and the discovery possibilitiesare enormous. The simple act of scrolling through filtered andcategorized data often yields the most "ah-ha" moments of the datavalidation process.

|

Using the BI tool as the validation mechanism also provides thekey benefit of providing validation of the BI semantic layer. Thesemantic layer is the "brain" of the BI system where all dataelements, table joining and other configuration is defined so thatreports and dashboards may be easily prepared.

|

Of course, the use of the BI solution as the validation toolpre-supposes that the semantic layer in the tool has been completedto the extent required to support the effort. As a best practice,it becomes important to pre-build the semantic layer before thedata validation effort commences.

|

? Metadata Use - In a best practices BIenvironment, the use of a complete source to target metadatamanagement layer becomes a major asset in the data validationstage. A key success component in our mutual success with theMontana State Fund was the development and implementation of ametadata solution that provided complete data lineage detailincluding transformation activities from the individual sourcethrough the BI, or reporting layer. This provided the datavalidation team with an ability to instantly assess the path thedata travelled to its ultimate reporting destination. It alsoprovided an enforcement to the business rules and definitionsfinalized at the beginning of the process.

|

? Common Stumbling Blocks - The QA plan shouldrecognize that there are several key areas that are especiallyproblematic and should have recognized concentration:

|

o Status Changes - A best practice BIinstallation will provide the ability to understand and detail thestatus of a policy or a claim at any point in its life-cycle. Thereare typically multiple rules that must be constructed in the ETLcode to govern this process. Where this becomes problematic is thatthe full totality of the rules that are managed in the processingsource systems are not fully known or documented. Carefulvalidation of these status changes are required on a policy andclaim basis vs. the processing source system record.

|

o Counts - The initial business sessions wherethe definitions are set will discuss and define counts (policy andclaim side). As noted earlier in this paper, counts may haveevolved to different definitions and understandings by differentfunctions within an organization for the "same" count. As result,these sessions, most likely, will result in an expanded and/orchanged set of definitions where no existing reports are availableto validate against. Ongoing sessions with the businessrepresentatives should focus on the definitions and validationactivities required:

|

? A reasonability review of whether the counts make practicalsense.

|

? A deeper dive where a segment of counts can be validated at adetail level vs. detail from the processing source systems.

|

o Transformations - Where there istransformation activity in the ETL process, there will be potentialdata problems. The effective use of a best practice metadatamanagement process that provides business definitions and sourceinformation to the tester facilitates the validation efforts.

|

o Converted Data/ Changing Business Rules -Historical data being moved from a single source system may haveundergone multiple metamorphoses during the length of its journey.The data that you see in your processing source system today, mayhave started in another system (or more than one system) long ago.At the time of data conversion(s) business rules may have beenenacted on the data in order to get it into the new processingsource system cleanly. This may have included "dummy" recordsand/or data fields being used. In short, while allowing theprocessing source system to function, the data may not accuratelyreflect the goal of the "single source of the truth".Unfortunately, that data is what it is. It is important to detailup front what the expected anomalies will yield at key inflectionpoints in the data's points and make sure that validationactivities test this data thoroughly.

|

If an organization business rules associated with a conversionduring the requirements phase, sometimes these business rules canbe embedded in the ETL logic for the period of time when thesebusiness rules apply.

|

We hope that we successfully made the point that data validationin an enterprise-wide BI initiative is a challenging task. Thesheer weight of the validation of potentially trillions of datacombinations, demands that an appropriate process be planned andput in place to address this critical need.

|

To maximize your success potential:

|

? Get the business side involved up front and with the rightpeople.

|

? Make sure that you establish the right. expectations andprovide the right tools.

|

? Create and follow a well-planned process.

|

It goes without saying that the down-side to not skillfullyexecuting a BI installation is that it suffers from credibilityissues. When this happens, business users quickly lose confidencein the data and recapturing user confidence can become a herculeaneffort. While difficult, we have found that the data validation canbe very successful when planned and implemented following therecommendations of this best practices approach.

|

(Jack Plunkett is president of Millbrook)

|

Want to continue reading?
Become a Free PropertyCasualty360 Digital Reader

  • All PropertyCasualty360.com news coverage, best practices, and in-depth analysis.
  • Educational webcasts, resources from industry leaders, and informative newsletters.
  • Other award-winning websites including BenefitsPRO.com and ThinkAdvisor.com.
NOT FOR REPRINT

© 2024 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.