Revealing the hidden knowledge in the data


PS Consultant, Teradata, a division of NCR

Aug 30 - Sep 05, 2004





Data Mining is the process of identifying valid, novel, potentially useful, and ultimately comprehensive knowledge from a large volume of raw data that is used to make crucial business decisions. In both broad and narrower terms, it refers to a way of exploring, learning and developing knowledge from detailed data.

The process of data mining discovers and interprets the patterns in the data in order to solve a business problem. A business problem can be anything related to driving a more profitable organization that leads to customer satisfaction while reducing costs, increasing profitable customers and instigating an eventual growth of the organization.

Data mining enables businesses to explore diverse strategies in marketing, telecom industries, banking, sales, medical and other research areas. This process is initiated for finding data patterns that help in identifying fraudulent transactions, increasing sales, targeting customers, gaining valued customers, and discovering the agitating and unusual behaviors of the customers. Every year organizations save millions of dollars by identifying such hidden agents in their data and taking proactive measures in order to reduce losses due to such occurrences.

Ten years ago, Teradata, a division of NCR, pioneered the field of data mining by looking at sales data from a retailer and discovering that in the evening hours, beer and diapers are often purchased together. This relationship, called a data mining affinity, captured the imagination of industry watchers, setting off a legend that has been recounted hundreds of times and is frequently cited as the textbook example of data mining.

In order to gain a practical perspective on how some of the world's leading industries used data mining to enhance their operations and product/service provisions, let's take a look at the following instances.

A European health care provider used data mining to discover that one patient was obtaining prescriptions at such a rate that he would have died had he actually consumed the prescribed dosage. On investigation, the authorities found out that the patient was feeding the medication to his ailing horse. A national welfare agency used data mining to identify welfare frauds and identity thefts. The agency estimates it saves millions every year by avoiding fraudulent welfare claims and overpayments. A financial institution used data mining to discover that credit applicants who used pencil on the form were much more likely to default than those who filled out their applications using ink. A tax organization collected millions of dollars using data mining to uncover individuals and corporations that were avoiding or underpaying state tax.



In general, organizations in several industries widely use data mining to gain and retain valued customers and to achieve a precise picture of their business. For instance, banks utilize data mining in order to monitor the movement of their revenue among several accounts, identify credit card frauds and other such fraudulent transactions. Marketing and sales oriented organizations adopt data mining for targeting customers and evaluating the effectiveness of different campaigns, cross selling activities and other innovative modes of customer attraction. Telecom industries use data mining in order to identify churn customers and to identify the fraudulent usage of telecom services being offered by the Communication Service Provider (CSP).

Having gained a workable insight on some of the eventual benefits that organizations have achieved through data mining, let us now examine a few fundamental issues related to the implementation of data mining.

The data mining process is initiated by identifying business problems and specifying the business objectives associated with each problem. This requires a sound knowledge and a deep involvement in the business areas of the organization. The problem then needs to be clarified, operationalised, and prioritized. Prior to this, the implementation necessitates defining, designing and building an environment for initializing the data mining exercise. A business wide access to the data is recommended, thus if more users have access to the data, the chances of identifying more relationships between the data items will increase. The environment should, therefore be built keeping in view the scalability factor as growth of the business and emergence of new ideas (that are resultant of the data mining implementation) will lead to an increased number of users, increased amount of data and increase in the sophistication of the queries.

Data mining conducts an exploratory data analysis, and builds analytical models in order to utilize the findings in the exploration of the data. During implementation, the analytical approach is validated and all the analysis, procedures, and results are documented. After proper verification and approval, the process is released for the production environment and monitored. It is pertinent to mention here that transfer of knowledge to the business users is essential as they should be empowered in order to perform these analyses and incorporate new ideas and findings in the existing analytical models and processes.

The selection of the data mining tool is also important. The tool must enable the right choice of analytical algorithms, user friendly presentation of the data analysis and results, for example through, regression, time series, neural networks, decision trees, rule induction etc. The tool should also provide the ease to incorporate and modify exploration methods in order to perform further hypothetical analysis so as to reveal more useful information from the data. Moreover, it is also important to select an appropriate source for the data mining. Detailed data is a good input for data mining applications.

In short, a successful data mining process requires primarily a thorough understanding of the business, its underlying problem definition, the selection of the data source (i.e. detailed data), the data quality, selection of the data mining tool, the algorithms supported by the tool and the process model adopted for data mining. All factors that can be of use to the business should be taken in consideration in order to reveal all hidden agents in the data.