Getting Content Under Control Is Simply Complicated

I spent last weekend reconfiguring and rebuilding the search indexes for a client. We indexed something like
3.5 million items, consisting of about three terabytes of data. It was a lengthy process, which provided me a lot of time to think about best practices for document management, document storage, and enterprise search. The first thing that struck me was the sheer magnitude of the document store. Granted, my client is a pretty big company, but I find it difficult to believe all those files are necessary and required for the day-to-day operations of the firm. In fact, we weren't indexing and searching the totality of documents in the organization, only those items contained within a particular collaboration/content management system.

Most of the content did not have any metadata associated with it beyond the obvious–author, time of creation, type of file, etc. So, the net result was a full text search and index of all those documents, which is not all that useful as corporate documents tend to use the same terms over and over, rendering search results that are difficult to deal with.

Paper to Bits?

I have another client–a municipal government. One of its projects that I have been invited to participate in has a stated goal of rendering all existing historical documents into electronic format. The folks there currently have rooms stacked floor to ceiling with bankers' boxes stuffed with pieces of paper spanning back 80 years. I think all records before that were destroyed in a fire. When I asked them why they wanted them in electronic format, they could not formulate a truly excellent reason for the process. Since the scanning would render most of the records as images, they can't be searched and indexed. Optical character recognition is not a viable option because much of the information is handwritten and is not readily recognized with available OCR software. And even if it were rendered using OCR, the results then would need to be proofed for accuracy. This would put the project seriously over budget.

The argument can be made electronic data lasts longer than physical information, but that is only valid if the media is upgraded and the files are transferred to new media on a regular basis. Hard drives have a limited life span as do all other commonly used electronic storage media. I recently spoke with an application manager for a firm that has its most critical information stored on first-generation optical drives that are so old they are afraid to take them offline to transfer the data elsewhere. That is an extreme case, but if you truly want to persist electronic data far into the future, you had better plan on a new storage mechanism every five to seven years. Then there is the little problem of constantly changing file formats. I am not sure whether I still could read my old WordStar files, even if they weren't stuck on five-and-a-quarter-
inch floppies.

Love the Tree

I hear a lot of talk about companies going paperless. And that is a noble sentiment–very green and all. In fact, my insurance carrier has been paperless for years, and I have a great deal of respect for the company. But don't try to go paperless just for the sake of going paperless. I saw a demonstration recently of a system that would scan (OCR) invoices into a content management system. It kept an image of the invoice in addition to the recognized data from the OCR scan. At that point, someone had to manipulate the data manually so it was assigned to the proper fields in the data list. All of a sudden going paperless seemed to be creating more work. My suggestion is the proper way to "go paperless" is to use electronic forms or electronic billing and e-commerce. Scanning a paper document into an electronic format may really not be the most efficient way to do things.

Smart Search?

The point I have been talking around here is we have become so obsessed with document management and storage, we want to store and retain and search everything–whether the content is useful or not. We tend to place indiscriminately all of our documents into some sort of ECM or storage system and then rely on a search engine to help track down the document we want or need. The fact is search engines aren't intelligent. They operate on rule sets created by humans, analyzing data provided by humans. If search engines were as smart as some people seem to think they are, why do we spend so much money and time on search engine optimization (SEO)? Shouldn't Google just "know" we provide the best property coverage in the Midwest and automatically put us above the fold? It doesn't work that way. I worked with a firm that literally dumped all its file shares into its new ECM system and then was puzzled why no one still could find anything.

Just the Facts

The simple fact is this: If a piece of information truly is necessary for the operation of your business, then that information needs to be tagged and classified in such a way it rises above the rubble and the rabble. Let's first look at the documents you need to run your business–those internal forms and documents that ensure employees get paid, can administer their 401(k)s, and can access their health benefits. Each of these documents should exist in one location–there must be one and only one "alpha" document. (When I use the word document, I do so in a broad generic sense to refer to any electronic file that contains some useful information.)

You probably will have other versions of a document somewhere in your system; obviously, some person or team had to work to create the document. But those prior versions should not be generally accessible. You should have secure work spaces where these documents/works-in-progress are stored. And by stored, I do not mean on someone's desktop or attached to an e-mail. Even "works in progress" need a single, true copy of that work. These preliminary documents need to be readily accessible via search or navigating through a well-designed information architecture, but they should not be returned in the same search as the alpha document.

And this is where human intervention ensures our employees can access the information they need to be employees. A search scope for "human resources" should include only those pieces of information we have been calling the alpha copy. That means when I type 401(k) using the human resources search scope, only the current, correct documents are returned. It sounds so simple, and yet it is rarely accomplished. It takes work to apply the correct metadata to a document so it can be easily retrieved. And in my experience, very few organizations are willing to take the time or the effort to do that work.

Inventory

Every group within the organization needs to perform an intellectual property inventory. All the file shares and ECM systems and e-mail storage must be sifted through. The purpose of this exercise is to determine what documents we really need to run our business. I guarantee my client in the opening paragraph does not require those 3.5 million items to operate efficiently. That doesn't mean they aren't important; it just means they don't need to be floating around with the truly "necessary" documents. I am, of course, speaking metaphorically–every file, every bit of information can successfully exist and coexist in the same ECM system. They just need to be separated through the use of metadata or stored in specialized silos defined in the taxonomy and information architecture of the organization.

Archive

I have 300 GB of storage on my primary work laptop with about 20 GB free. Now, about half of that storage is virtual machines, so I won't count those. I probably have 50 GB to 60 GB of actual data on the machine. I back up my stuff all the time–I have more external hard drives than I can enumerate. But the interesting thing is when I get a new machine or have a hard drive crash, I rarely need to fetch much of anything from my backups. The reason is all the data I really need is the data pertinent to my current project list. If I would practice what I preach, all my old project information would be archived to our project portal, and my machine always would be clean.

It probably doesn't matter if individuals clutter up their machines with unnecessary data as long as the alpha copies of the documents they own are in their proper place and properly tagged. But when you look at this same issue from the enterprise level, it does matter. Every document should have a clearly defined life cycle. When a project is completed, all project documents should be correctly archived and marked as final. At some defined interval, those archived files should be moved to some other sort of records storage and marked for permanent retention or deletion at the
appropriate time.

Compliance

Many industries are regulated, and regulated industries have clearly specified document retention, storage, and deletion rules and regulations. Regulated documents must be properly tagged and stored to meet compliance requirements. That does not imply only those documents should have special handling. Careful document tagging and creating search scopes based on that tagging will ensure only those documents that are in constant or current use will be returned in the primary searches. Even the most sophisticated and powerful search engines will provide more meaningful results when querying a smaller set of data. You still can have your massive set–search everything in the organization, including file shares and e-mails and the old intranet when you really need to track down some old, arcane document–but that search should prove the exception rather than the rule.

Where Is That Darn Form?

One of the biggest reasons corporate intranets (and by extension, enterprise content management systems) fail is because information workers can't find what they need. And that is why they squirrel things away on their personal file share and hard drives when they do find them. I was talking with some workers from the human resources department of a client of mine. When I asked them what they spent most of the day doing, their answer was they took phone calls from people hunting for where they could find a particular document or form.

Out of that conversation grew a project to rearchitect the human resources intranet so users could find exactly what they wanted in two clicks or a single search. It took a little work to design and tag everything, but the end result is these same workers now can do their real jobs instead of fielding questions
all day.

Now or Never

This is simple stuff. Inventory your intellectual property. Set aside the things that are truly important. Get rid of duplicates (only one alpha copy). Define and implement life cycles for data. Properly tag documents so they can be easily found. Design an information architecture and taxonomy for your intranet that matches the discoveries you made when you did your inventory. Create meaningful search scopes. Easy enough. Wonder why no one does it? TD

NOT FOR REPRINT

© Touchpoint Markets, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to TMSalesOperations@arc-network.com. For more information visit Asset & Logo Licensing.

Getting Content Under Control Is Simply Complicated

Recommended Stories

Progressive expands down payment assistance program

Around the P&C insurance industry: May 6, 2026

Scammers steal over $300K from bank accounts in text scheme