How insurers can identify fraudulent claims before they are paid

Traditionally, insurance fraud detection strategies focus onidentifying fraudulent claims once the claim has been paid to theclaimant.

It is easier to mitigate the losses, however, when the fraud isidentified before the claim is paid.

With the advancement in computing and data analytics, it is now possible to adopt apredictive approach to fraud detection. As a result, insurers areturning to data-driven fraud detection programs aimed atprevention, detection, and management of fraudulent claims.

Advanced analytics in fraud detection

Since insurers have a large amount of data, it makes sense toevaluate internal and external data for identifying claims with ahigher propensity of being fraudulent. By careful analysis of thisaccumulated data, insurers can identify patterns and anomalies withthe help of advanced analytical tools and techniques. This helps indetermining characteristics of a fraudster and the need forinvestigating a claim further.

The key lies in employing predictive techniques such asstatistical modeling and machine learning algorithms, which providepro-active insights into potential fraud events. In this article,we will discuss two advanced analytics techniques used in frauddetection: logistic regression and gradient boosting model(GBM).

Before we move on to explaining the inner mechanics of usingthese advanced analytics techniques, it is critical to understandthe flow of information in an insurance claim procedure. Thisincludes:

First Notice of Loss (FNOL): Thestage at which the claimant first notifies the insurer that a losshas occurred.
First Contact (FC): The stage at which theinsurer contacts the claimant, after FNOL, asking for moreinformation about the loss that has occurred.
Ongoing (OG): The continuous back and forth ofinformation between the claimant and insurer after FC until theclaim is closed.

This flow of information makes it more relevant to carry out arobust identification of potential fraud right at the first stage;the next two stages can be used to appropriately leadinvestigations in a particular direction. The following processdescribes a stair-step approach towards deploying data analytics toidentify fraud at the different stages of the insurance claimprocess.

Step No. 1: Collating the right data

To uncover factors/KPIs indicating fraudulent behavior, anexhaustive data sourcing exercise needs to be undertaken,considering both internal and external data. Internal datacomprises information centered on customers, claims, claimants andpolicies. On the other hand, external data consists of informationnot captured by the insurer. This includes regional demographics,industrially accepted standard scores, and information pertainingto weather conditions that prevailed when the loss occurred, aswell as information on catastrophes that may have occurred duringthe time period of interest. The end result of this step is a'master dataset' created by weaving the collected internal andexternal data.

The variables are classified on their availability duringvarious claim stages and made available accordingly during the 3 stages of model building.

Insurers can utilize these techniques to fast-track the claim handling process.

Insurers can use these techniques to fast-track the claimhandling process. (Photo: iStock)

Step No. 2: Applying analytics techniques

Once the master data set is ready and the variables areidentified, we choose two analytics techniques to identifyfraudulent behavior:

Logistic regression: A statisticalmethod for analyzing a dataset with one or more independentvariables that determine a binary outcome. This predictiveanalytics technique produces an outcome that is measured with adichotomous variable (which has only two possible outcomes).Plausible fraudulent claims are a rare event; almost lessthan 1% of all claims. As logistic regression underestimates theprobability score in case of rare events, in order to ensureunbiased results, an oversampled data set needs to be created wherethe event rate should be >=5%.

Since the flow of information is in three stages (FNOL,FC, and OG), a residual modeling technique is applied for logisticregression. This means the logistic score from one stage is used asan offset variable in the subsequent stage. Hence, under logisticregression, the information gains that happen at one stage arepassed on to the subsequent stage. As a result, as claims moveforward from one stage to other, the insurers have more claritywhether they are genuine or fraud.

Gradient boosting model (GBM): Amachine-learning technique that aims to improve a single model byfitting many models and combining them for prediction. The need tocreate an over-sampled data doesn't arise, and modeling exercisecan be performed by gradient boosting of classification trees.

GBM doesn't support sequential modeling, therefore a paralleldevelopment approach is followed at each of the threestages — FNOL, FC, and OG.

Step No. 3: Running the analysis and analyzing the results

Under logistic regression, a standard approach to variableselection is carried out. Variables can be eliminated onthe basis of fill rates, correlation analysis and clustering. Toolslike SAS are used for step-wise selection of variables in thelogistics procedure. Further shortlisting can be done to get rid ofmulticollinearity. No such treatment is required under GBM. Theoutput of these two techniques can be measured and analyzed interms of 'lift', 'k-s' and precision values for each of the threestages.

Logistic Regression and GBM: A comparison

Both techniques have different algorithms running in thebackground. Logistic regression requires human intervention atdifferent stages whereas GBM is based on machine-learningalgorithms that require minimal human involvement. In terms of output, GBMproduces a scored dataset based on probability values of allobservations whereas logistic regression provides a scored data anda mathematical equation (that can then be used to score the newincoming claims). Hence, logistic regression allows insurers toestablish causality between predictor and predicted variables. Inthe case of logistic regression, the interaction terms and variabletransformation are subject to the discretion of the data scientistbuilding the model, whereas GBM itself introduces and tests forinteraction and variable transformation.

	LogisticRegression	GBM
Human Intervention	Yes, at every step	Minimal
Output	Scored Dataset + Mathematicalequation	Scored Dataset (probabilityvalues)
Handle Non-linear Data	No	Yes
Handle Observations with MissingVariables	Requires input	Yes

The takeaway

Both techniques have their own merits and limitations. Dependingon what the business wants to accomplish, an appropriate techniquecan be selected. It is also not necessary that both of thesetechniques be used independently. Using logistic regression intandem with GBM on the dataset can provide a better perspective onthe authenticity of claims.

Insurers can use these techniques to fast-track theclaim-handling process during the FNOL and realign claim resourceswith more complex claim-handling activities.

See also:

Combating agent fraud

Finding fraud after HurricaneHarvey

Want to continue reading?
Become a Free PropertyCasualty360 Digital Reader

All PropertyCasualty360.com news coverage, best practices, and in-depth analysis.
Educational webcasts, resources from industry leaders, and informative newsletters.
Other award-winning websites including BenefitsPRO.com and ThinkAdvisor.com.

NOT FOR REPRINT