Relying on traditional web search engine queries to mine information may omit critical data needed to break your case.
Unsophisticated users of Google, Bing, or Yahoo believe those search engines locate everything on the web. As web research exploded because of the ease of typing a search query into Google or Bing, specialized search results lay hidden.
However, Google and Bing searches should be analogized to ice floating on the top of much deeper water. Estimates are that Google indexes 40 billion web pages, but there are 450 billion web pages (often database-related) hidden to search engines. Where is the hidden web data? Below, we explore the “hidden web,” why everything is not visible, strategies to conduct more effective searches and deep web resources to consider when performing a database query in your claims investigation.
The Hidden Web: Who is the Real Walter White?
The hidden web is data not indexed by a search engine because it is closed off. Hidden web data is specialized content from a specific data topic of changing content. Unreported general search data was originally called “invisible” but now referred to as part of the hidden or deep web.
Why is data not indexed? One reason is that some search engines seek static HTML pages that are also linked to other pages. However, estimates are that dynamic web pages outnumber static pages 100 to 1.
Hidden web pages are dynamic search results generated from searching a database with a customized query. Once those results are viewed and closed, the results cease to exist on the web. Analysts, experts or researchers at a vast array of worldwide institutions compile databases that are not normally recoverable by a general Google search. Much like the character Walter White in the TV series “Breaking Bad,” Google search results looks can be deceiving (on the surface, Walter White is a mild mannered high school science teacher diagnosed with cancer who in fact starts making methamphetamine initially to pay for his medical bills eventually becoming a murderous drug kingpin).
Why do some search engines miss this hidden web data when they use sophisticated computer crawling technology? Search engines use web crawlers/spiders that follow hyperlinks through protocol numbers. Spiders are artificial intelligence programs that search the public Internet reading static web pages. The spider reports to its mother database with the results. Those results are cataloged for general searching by users.
That technique is effective to identify resources on the surface web. However, spidering technology often returns links based on popularity not content. Those results are not necessarily showing recent data or relevant information. Some web data contains robot/spyder exclusions blocking certain pages from being indexed. Password access material is also not indexed. Moreover, some creators submit their own web pages directly for listing with search engines. Nevertheless, search engines do not use database queries because of the unlimited possible number of potential queries in the database format. Those web crawler programs do not type. They do not think. They do not input key words in separate search boxes in databases. Nor do they enter passwords.
Thus, databases that require individualized searches generate pages on demand and are not accurately reflected in Google web search results. If you want to fly to San Francisco, you can search Google, and you will be directed to airlines or to services offering discount airfares. You will not be able to initially get times and days of flight you need, because you have not individually entered those appropriate search queries in the airlines’ search query. The actual flights you need to get to San Francisco are not shown in a Google search result, requiring you to do a deeper query at the airline web site. Hence, if you assume your Google or Bing searches will pull all responsive data to your investigation, your investigation will likely never “get off the ground.”
Preparing An Effective Search: Seeking Walter’s Lab Location
Hidden web data does not mean the information cannot be accessed. In “Breaking Bad,” Walter White and his former high school student partner Jessie Pinkman first start cooking methamphetamine in an RV and later in an underground room, all the while staying one step ahead of the DEA’s search for “Heisenberg” (Walter White’s pseudonym in the drug business). To pursue Walter White’s lab’s location, DEA Agent Hank Schrader had to establish search parameters in Albuquerque, New Mexico. Similarly, if you want to conduct an effective search of the hidden web, you have to plan your search more than just typing search words into Google or Bing.
Analyze your search topic. Where do you begin? Are there unique terms, jargon or phrases that describe your issue? For example, your search involves “organizational fraud intelligence.” Are there equivalent terms or different ways to spell your search query? Should you consider use of bureaucratic, departmental, and managerial in place of the word “organizational?” For “fraud,” should you instead use extortion, deceit, and scam? Compile a list of all terms and potential search queries using the alternate terms.
Start your search in the right place. Identify specific databases you want to search about your claim. Is there a directory for the data you are seeking? Are there organizations, people, groups or societies that may have the information you want? Do those organizations have databases you can access? Some databases may be pay for access only. Sometimes you get what you pay for. The University of Michigan developed OAIster, a searchable database that provides access to public materials from research and academic institutions.
Are there experts in your field of interest? What organizations do they belong to? Is there a discussion group/blog for those organizations? Are there searchable databases at those organizations?
Continue to refine your search terms and the databases you are individually searching. If your search strategy does not work, try another approach.
Deep Web Resources: Searching for the “Blue Sky” Formula
In “Breaking Bad,” Heisenberg develops a formula for “Blue Sky” methamphetamine that is 99.1% pure described by his partner Jesse as “the bomb.” Assuming the hypothetical formula was on the hidden web, what resources need to be assessed to identify a database that might help explain the likely properties of “Blue Sky?”
If you need a mega portal to jump start your hidden web search, consider InfoMine. It has thousands of links to hundreds of databases collected by the University of California, Riverside under subject categories Bio, Ag & Med Services, Business & Economics, PhysSci, Engr, CS & Math to name just a few.
Another option is The Complete Planet, which provides what it calls a “comprehensive listing” of dynamic searchable databases that are not crawled or indexed by search engines with a topic break out categories.
The Virtual Library provides a quick search option and category jumping off points, while the Library Spot can be used to obtain an overview of the subject. Claims professionals should also consider the following resources: check for scientific material on Intute; global scientific updates at WorldWide Science; Science.gov ; Google scholar; and other similar specialized search engines. If you don’t know of a specific database to search, then consider a metasearch engine that combines results of several top search engines, such as Clusty.
The “Blue Sky” methamphetamine sold by Heisenberg in “Breaking Bad” took time to penetrate the drug market and ultimately its success led to the downfall of the operation. Competitors wanted to take over the product or put Heisenberg and his partner Jesse out of business. Similarly, piercing the hidden web by searches takes trial and error. The old saying “Nothing ventured nothing gained” comes to mind. The failure to pursue information on the hidden web means viable claim information about your issue of interest, claimant or insured will remain undiscovered. A false picture much like Heisenberg as a meek high school teacher will color perceptions of your claims investigation results. Go the extra mile by going beyond a generic Google or Bing search to specialized databases.
Peter A. Lynch is a partner in the subrogation and recovery department at Cozen O’Connor and a legal columnist for interfire.org and the California Conference of Arson Investigators. He can be reached at firstname.lastname@example.org or follow him on twitter @firesandrain.