Guide to Information Management: Data Classification, Search and Management

Six steps to ILM implementation

By Mike Karp, Network World, 10/28/06

Many things can spur a company to kick off an ILM project, but two reasons lead all the rest: a desire to implement storage tiers to reduce costs and the need to align corporate IT practices with regulatory compliance demands.

There is no need to do ILM if you do not have to, though the odds are, you do.

First, determine whether your company's data is answerable to regulatory demands. If you work for a U.S. company, this likely means checking out the California Privacy Law Compliance (SB1), Gramm-Leach-Bliley, Health Insurance Portability and Accountability Act, Real Estate Settlement Procedures Act or the Sarbanes-Oxley Act.

If you do business in Europe, also check out Basel II compliance. This is complicated stuff, and no one expects an IT manager to know about it. Immediately set up a meeting with someone in your legal department to discuss this.

Large organizations have compliance officers who are there for this sort of discussion. Be prepared to learn that the list of regulatory requirements above does not begin to scratch the surface.

Second, determine whether your company uses its storage in an optimal manner. If you have lots of data online, some of which is quite old, and if you have only one kind of storage, there's a good chance that high-value and low-value data are intermixed at your site. That's a pretty good indicator that you need to re-examine your storage strategy. Keep in mind that if your shop is run according to service-level agreements and you frequently fall out of conformance with those objectives, something is very wrong.

If either of the above typifies your situation, go on to Stage 2.

Understanding the value of data lies at the heart of ILM, so this raises the question of what do you need to know in order to value the data properly? At the very least, identify the following: file type, users accessing the data and key words used. This can be accomplished only by meeting with the data owners.

First, make a list of regulatory requirements that may apply. Get this from your legal department or compliance office. Don't assume that the people involved in the next step are aware of these requirements. In many cases, they are not. Bring this list with you during the next step.

Second, define stakeholder needs. You must understand what users need and what they consider to be nonnegotiable. Engaging early with the various lines of business helps focus everyone on what is really necessary and sufficient, and puts a human face on IT (sometimes a good idea). The result of such engagements are SLAs that are driven by user requirements and give you a service-level objective, a targeted service level.

Third, verify the data life cycles. Everyone understands that data life cycles are a function of business value, but not everyone can take it from there. In some instances, key players can't even agree on when the data's value to the business shifts. Therefore, verify the value change for each life cycle with at least two other sources, a second source within the department that owns the data (if that is politically impossible, raise the issue through management), and someone familiar with the potential legal issues.

Fourth, define success criteria and get them widely accepted. Useful criteria are simple and easy to understand, and include cost savings, well-defined improvements in application or data availability, improved performance or recoverability and lowered incidences of being out of compliance with SLAs.

At this point it is time to identify the business value of each type of data object, which means understanding three things: what kind of data you are dealing with, who will be using it and what its keywords are. This stage is preliminary to doing the classifications. The strategy you deliver to your various organizations must emphasize mitigating risks in three areas: data security, data availability and data integrity. In the new nomenclature of IT, the way you go about this will be your policies.

First, create classification rules. This means assigning value based on criteria such as business importance, availability and performance requirements, and legal/regulatory/corporate governance rules. By doing this, you will be creating guidance that indicates where data is to be stored. The list, in the chart below, shows one way to go about this. The classification scheme will be your own, reflecting local needs, but whatever terminology you use, it will be useful to identify at least three classes of data.

Data class Description Attributes
Mission-critical The most valuable data, high access High performance, highest possible availability/protection, may require continuous data protection
Business-critical Valuable data, average access Good performance, good availability, less than eight-hour recovery
Fixed content Compliance or reference data Good performance, high availability, recovery typically depends on regulatory issues
Nonsensitive Rarely accessed, low value data that still should be kept online Low performance, low investment in hardware and protection services

Second, build retention policies. Establish linkage between each class of data and the hardware and services they require, including rules governing when each data class should be moved to the next storage tier. At the very least, this means determining correct storage tiers, security levels, degree of data protection and migration strategies. For example, files answerable to Sarbanes-Oxley will typically require disk-based backup for rapid retrieval; for others, backup to tape is fine. Don't forget the lowest level of the hierarchy, the offline archives. You will find that in many cases a data class may really refer to only a single data set, and that archiving for one class of data may not be at all suitable for another class.

Aligning data classes with the data life cycle predefines the events that will drive ILM policies, and plays a key role in enabling both automated resource provisioning and automated data migration. With few exceptions, this process will be the same at all sites - classification simply means aligning your stakeholders' business requirements to the IT infrastructure. Create a formal procedure that identifies each group's requirements, how it values its data and how satisfactory current performance levels are.

Talk is cheap, but working with vendors rarely stays that way. Most IT folks have been doing this for as long as they can remember, and the rules haven't changed just because they pertain to ILM products. In some cases, one vendor can supply all your needs. In most cases, however, that will not be true. Concentrate on vendors whose solutions can incorporate your legacy systems, and whose offerings include data classification capabilities. Don't be afraid to talk to consultants, read what the analysts have to say and compare notes with colleagues at other sites.

When you engage with the vendors, make sure to understand their products' capabilities in each of the following areas:

* Ability to tag files as compliant for each required regulation.

* Data classification.

* Data deduplication.

* Disaster recovery and business continuity.

* Discovery of compliance-answerable files across Windows, Linux, Unix and any other operating systems you may have.

* Fully automated file migration based on locally set migration policies.

* Integration with backup, recovery and archiving solutions already on-site.

* Searching (both tag-based and other metadata-based).

* Security (access control, identity management and encryption).

* Security (antivirus).

* Set policies to move files to appropriate storage devices (content-addressed storage, WORM tape).

* Finding and tagging outdated, unused and unwanted files for demotion to a lower storage tier.

* Tracking access to and lineage of objects through their life cycle.

The point in the process in which you bring a vendor's product on board will depend on the product's capabilities. If it does automatic data classification, a good rule of thumb will be to inject it into the process sooner rather than later. The ability to install and manage all of this in a manner that is nondisruptive to the workers at your site will carry significant value.

Do not be misled by the idea of a "data half-life." The value of data is not like an isotope, and almost never decays at the same rate throughout its life cycle. In fact, many data life cycles demonstrate that some data gains in value after having lost value over time.

The data-classification stage must be revisited periodically for each set of data. A likely time for this: when the next year's set of SLAs is being written. If the SLAs don't change, there isn't likely to be a need to change the classification criteria. By doing this you ensure that the ILM policies you build today will continue to align with future application and data availability, performance and other requirements.

In theory, most information within organizations is easy to classify. In reality, classifying data can become quite subjective. Nowhere will this be more evident than with unstructured data.

Fortunately, several products are able to classify data from such vendors as from Abrevity, Index Engines, Kazeon, Njini and StoredIQ, among others. Just about every IT site will find that a product that classifies data and then automates its migration across the infrastructure provides excellent ROI. Once these steps have been completed you can take any actions necessary, and can look for solutions to automate the needed processes.

First comes the pilot project. In one sense at least, ILM projects are no different from any other: Test the waters before jumping in. A well-tested IT rule of thumb is as applicable here as it is with any other large IT initiative: Validate everything (strategy, procedures, the whole lot) with a pilot project before full corporate cutover.

A helpful hint: Identify an IT service that everyone interacts with (the most likely candidate at just about every site is e-mail) and begin there. Another way to view this is to choose a project that offers aggressive ROI, because it is likely to reduce costs of storage or management, or because it will likely provide measurable improvements in performance or in meeting service levels.

Phase in the next data sets and move to full implementation. After incorporating what you learned during the previous step, things get much easier. Determine some appropriate order for the phase-in. Again, this will be site-dependent. An encouraging word: Successful early deployments will have made your team comfortable with the process, and will enable them to extend implementation to business-critical systems with greater ease. Success will have bred success.

Vetting the ILM players

These leading players offer many of the hardware and software components you'll need for ILM.

By Mike Karp, Network World, 10/28/06 vendor

The short answer is an emphatic no. Unfortunately, investing in ILM is not as easy as buying a pastrami sandwich at the corner deli where all the works are behind the same glass case.

It's likely that each vendor you speak with will have a different approach to ILM and that no one vendor will address the gamut of considerations and solution components you require.

Your solution will likely encompass components from multiple vendors and from your own legacy systems. Especially when it comes to data classification, the big vendors will typically approach you with a product mix that includes offerings from such specialty vendors as Abrevity, Index Engines, Kazeon, Mimosa, Njini and StoredIQ.

A list of the leading ILM vendors must include CA, EMC, HP, Hitachi Data Systems, IBM, Sun and Symantec, though many smaller companies offer key parts such as data classification and continuous data protection that may deliver exactly the value you need.

The following snapshots show each vendor's approach to ILM and some of their key offerings. All offer some degree of automation in their approaches.

CA pays lots of attention to the interrelationship between storage and date security, securing information at each phase of its life through its eTrust suite of identity and access management software. A secure infrastructure driven by ILM policies means that information can move around the enterprise based upon its business value and usefulness, while maintaining complete confidentiality and integrity.

CA sells software only, but has close alliances with most hardware vendors. Key strengths include records management, e-mail and other messaging, security, backup and recovery products. A year ago, CA extended its range of ILM products through the acquisition of iLumin and earlier this year with the acquisition of records-management firm MDY. Records and e-mail management are now key parts of the company's information management offering.

EMC takes a three-phased approach to ILM. Step one covers infrastructure tiering, data classification and process definition. Specific projects might include consolidating data, establishing a business-continuity system or putting a backup-to-disk method into place.

The second phase addresses application-specific issues such as identifying low-value data and defining policies for backup, recovery, archiving, regulatory compliance and managing unstructured content.

Phase three focuses on creating a unified approach to accessing, manipulating and protecting information across a site's applications, building enterprisewide information repositories that should provide integrated views of information assets. It's here that enterprise content-management or policy-based automation is implemented.

EMC recently acquired leading security firm RSA Security, and subsequently changed its tag line from "where information lives" to "where information lives securely." These days, secure ILM is a part of just about every EMC pitch.

HP positions its offerings not as a storage strategy but as an extension of a business strategy, linking processes, policies, technologies and products in specific implementations to control information capture, management, retention and delivery. Not surprisingly, ILM addresses more than digital data storage for HP. It also includes a wide range of other nondigital content from such business tools as mobile phones and PDAs, a slant that leverages the company's expertise in imaging, printing and personal systems as well as ILM and traditional computer products and services.

HP's products include a hardware and software mix of both homegrown and partner-developed applications. Particular attention is paid to data capture (handheld, laptop, imaging products), management (data protection, data migration, resource management), retention (backup, e-mail, CDP, electronic vaulting) and delivery (document delivery and hosted solutions). By partnering with Adic, HP also has a special focus on digital archiving for rich media.

HDS has tiered hardware and management products for open systems, mainframes and network-attached storage (NAS). The vendor emphasizes data replication and moving data nondisruptively across the various storage tiers, using its virtualization controllers as a front end to the external storage systems.

When it comes to ILM, HDS collaborates extensively with Arkivio, StoredIQ and a number of other partners to provide a comprehensive offering. HDS' contribution is the hardware (particularly storage devices and virtualization controllers) and the storage-resource management (its HiCommand suite, which manages discovery, tuning, tiering and a number of other storage aspects) that enable policy-based automated storage movement across storage tiers. The company also provides a suite of business-continuity tools that create data copies, replicating them across local heterogeneous tiers of storage and out to remote recovery sites.

For the last several years, on-demand computing and data access have been the basis of IBM's enterprise business strategy. The company recognizes that ILM plays a key role in successfully delivering information on demand, and particularly addresses regulatory compliance and general data-growth issues.

ILM offerings are broken into four solution areas: application and database archiving, data life-cycle management, e-mail archiving, and enterprise content management (imaging, digital asset management and digital content integration).

IBM also has created a number of hardware-software packages, some offering generalized ILM services and others specifically attuned to the requirements of vertical markets. The DR550, for example, comes preconfigured and integrated to help store, retrieve and manage regulated and nonregulated data. It features automatic provisioning, migration, expiration and archiving capabilities.

Sun approaches ILM by linking business intent (as expressed in plans, requirements and service-level agreements), operational processes (policy management and data classification) and storage management. The company offers both software and hardware products, but third-party offerings are also part of the mix.

Sun sorts its ILM offerings into four categories: security management (including role-based access management, authentication and encryption), retention management (compliance requirements, managing reference information, archiving, tiered storage and access controls), continuous data protection and infrastructure optimization.

Sun's Virtual Storage Manager underpins much of this, as does a content-management product that provides a single point of access for managing most types of electronic records. Probably most important is Sun's SAM-FS software, which provides data classification, centralized metadata management, policy-based data placement, protection, migration, long-term retention and recovery technology. A bundled compliance archiving system puts the company's compliance archiving software on an NAS appliance.

Symantec, mostly through its Veritas-developed products, has focused on archiving, storage management and data protection, though not within the context of ILM. Each of these, however, is an important aspect of ILM, and Symantec is changing to meet the times. It has recently concentrated much effort on managing e-mail and other forms of unstructured electronic content, building a product set to provide secure, searchable online archives, particularly in the e-mail space.

This centralized archive is crucial to Symantec's approach to ILM - no surprise considering the company's history in volume management. From this centralized archive it can address several of the forces that drive ILM adoption, particularly regulatory compliance, legal requirements management and improving the organization's ability to find information.

The heart of the Symantec ILM offering is an e-mail capture and management package integrating mailbox management, regulatory compliance and eDiscovery. Symantec solutions are software-only, but they interoperate with just about all vendor hardware.

Tracking Information Management Market Trends

By Deni Connor

Managing the glut of information is a huge undertaking made larger by the requirement to retain data and be able to classify it and recover it for e-discovery, compliance, data privacy or general business intelligence purposes.

"The biggest driver of any type of information retention and search in North America is electronic discovery [e-discovery] right now," says Brian Babineau, senior analyst for the Enterprise Strategy Group.

"There's a lot more to data than just how to keep it, prevent it from being deleted and keep it from being modified during the legal preservation process," Babineau says. "There is a need to analyze it and classify it based on taxonomies so that corporate counsels can actually take millions and millions of content information -- files, e-mail messages and databases -- and get to a reasonable corpus of information that is relative evidence."

In December 2006, the Federal Rules of Civil Procedure (FRCP) were amended to include a company's obligation to readily disclose how electronically stored information was retained and how it could be retrieved as part of the discovery process in any legal proceeding that crossed state lines or was filed in a federal court.

The new e-discovery guidelines are forcing companies to adopt methods for retaining any information that might be part of a legally actionable claim, identify it and retrieve it readily within the discover process.

The privacy of information such as social security numbers and credit card information is also a strong focus for businesses involved in e-commerce and as importantly in protecting the confidential information of its employees and clients, Babineau says.

"The second biggest driver for information management systems is information privacy -- the ability to identify the relevant subset of information inside a corporation that contains sensitive or confidential data -- Social Security numbers or credit cards -- and then take the appropriate action whether it is to encrypt it, delete it, move it somewhere else, put it behind a firewall, whatever," says Babineau.

Enterprise Strategy Group research estimates that 47% percent of users believe that half of their organization's data could be considered confidential. But data isn't treated that way in many cases and studies are finding that customers are ill-prepared to recover and protect that information in a timely and cost-efficient manner.

Whether it's the Payment Card Industry (PCI) or laws in California or other states such as the U.S. House Resolution 4127, the Data Accountability and Trust Act, which call for notification when a data breach occurs, protecting this information from unauthorized eyes is important.

"The third area fostering information management is the general business intelligence -- how corporations use information to improve their business processes," says Babineau. "Businesses can learn more about their customers or take advantage of systems like Sharepoint and content management because they can search and see that a lot of invoices are stored on file servers that should be stored in a content management system where they can be managed more effectively. This gives businesses insight to say operations would be better off if they organized it. And then upon organization, the usefulness of the information becomes much greater."

Babineau says that the last, but not least important area of data retention, discovery and retrieval, is the storage utilization and information resource angle. "If a company can establish retention policies and periods for records retention and can move the data to the most appropriate tier of infrastructure based on its class or whatever, they will realize better utilization of corporate IT assets," says Babineau.

How exactly does information management software work?

By Deni Connor

Information management software incorporates a variety of functions that make useful information out of the glut of data stored on heterogeneous network attached storage (NAS) appliances, file servers and storage area networks (SANs).

As much as 80% of this data -- intellectual property, product plans, customer information, sales forecasts and personnel records -- exists as unstructured and semi-structured content in the form of word processing documents, e-mails, Adobe PDF files, spreadsheets, images, video, audio and content from Web sites. The remainder of the data consists of structured transactional data from databases.

In order for this data to be useful to the organization for litigation discovery, archiving, lifecycle management or improved business intelligence purposes, it needs to be able to be discovered, indexed, classified, searched for and reported upon according to policies an IT administrator sets.

Discovery of data consists of electronically scanning for information on file servers, NAS and SAN devices and gathering information into a common repository that can be indexed, classified and searched.

Classify and index
Once data has been discovered, it is classified and assigned a metadata reference that contains the name of the file, its size and other information that identifies it to the system. From there it can be acted on depending on rules IT sets. For medical images, IT would set a retention date for the data – in the case of pediatric records for instances, the retention period may be 21 years. Adult images that have not been accessed in more than a year may have a rule applied to them that specifies when they are migrated to secondary storage.

Data is then indexed and information such as the creator of the file, the date of last access and its format are stored. It is then grouped into categories once again based on rules the IT manager sets so that it can be managed, moved, copied, deleted, encrypted or take some other action with the data.

With the classification groups created, the next step is creating policies that manage them. A policy consists of rules that define the characteristics of the data – ownership, age or content -- and the actions that must be performed on data matching the filter. Actions might include copying the data to a different storage tier, copying or moving it off to an archive device for compliance or packaging it for e-discovery processing by corporate legal or human resources representatives.

For more information on data classification techniques, read this story on the topic.

Search and analyze
Searches of data can also be performed with either separate software or with integrated search capabilities. These searches are done based on parameters the IT administrator sets. In e-discovery, for instance, the search may consist of the discovery of all e-mails from 'Frank Green' to the manager of a hazardous waste disposal facility. For information lifecycle management, search and discovery may be of any file that has not been accessed in 180 days and which can be deleted. A search of structured database information, may relate all invoices for a customer in the last year or establish links between customer data.

Once data has been collected, classified and searched, detailed reports may be necessary. In the case of e-discovery or compliance these reports provide an auditable chain of custody for evidence. They also may highlight files that are not appropriately secured.

Most information management software is supplied as a software based appliance that attaches to the network via Gigabit Ethernet. A Web console lets users create, define classifications and rules and actions that will be taken on the collected information.

Most software packages from vendors such as StoredIQ, Kazeon, Njini are also tailored to the specific user benefit, whether it is e-discovery, information lifecycle management, business intelligence or information privacy.

Subscribe to the Best of PCWorld Newsletter