Comprehensive Guide on Data Mining (and Data Mining Techniques)
Just hearing the phrase “data mining” is enough to make your average aspiring entrepreneur or new businessman cower in fear or, at least, approach the subject warily. It sounds like something too technical and too complex, even for his analytical mind, to understand.
Out of nowhere, thoughts of having to learn about highly technical subjects related to data haunts many people. Many cave in and just opt to find other people to take care of that aspect for them. Worse, in other cases, they pay little attention to it, thinking they can get away with not having anything to do with data mining in their business.
Once they try to understand what data mining really is, they will realize that it is something that cannot be ignored or overlooked, since it is part and parcel of the management of a business or organization.
Businesses cannot do away with implementing or applying various business intelligence methodologies, applications and technologies in order to gather and analyze data providing relevant information about the market, the industry, or the operations of the business. It just so happens that data mining is one of the most important aspects of business intelligence.
WHAT IS DATA MINING?
Forget about any highly technical definition you may associate with data mining and let us look at it for the relatively simple concept that it truly is. Data mining is basically the process of subjecting available data to analysis by looking at it from different perspectives, to convert it into information that will be useful in the management of a business and its operations.
A simple way to describe data mining is that it is a process that aims to make sense of data by looking for patterns and relationships, so that it can be used in making business decisions.
For the longest time, many people have associated data mining with the image of a set of high-end computers utilizing equally high-end software and technology to obtain data and process them. This isn’t entirely wrong, because technology is definitely a huge and integral part of data mining. However, data mining is actually a broader concept, not just limited to the use of technology and similar tools.
Perhaps one of the biggest reasons why many are intimidated by the very mention and idea of data mining is the fact that it involves more than one or two disciplines. When we talk of data mining, we are talking about database management and maintenance, which automatically means the involvement or use of database software and technologies. Thus, it also often entails machine learning and heavy reliance on information science and technology.
Further, the analysis of data, especially of the numerical kind, is bound to make use of statistics, which is another area that some people find complicated. This will also demand a lot in terms of visualization.
In short, being involved in data mining implies dipping one’s fingers and toes in more than a few rivers, so to speak, since it entails the use or application of multiple disciplines. This is what often makes data mining a challenge in the eyes of most people.
We can gain a deeper understanding of what data mining is by talking about its five major elements.
- Extraction, transformation and uploading of the data to a data warehouse system.
- Data storage and management in a database system.
- Data access to analysts and other users.
- Data analysis using various software, tools and technologies.
- Data presentation in a useful and comprehensible format.
IMPORTANCE OF DATA MINING
Businesses, organizations and industries share the same problems when it comes to data. Either they aren’t able the find the data that they require or, even if they know where to find it, they have difficulty actually getting their hands on it. In other cases, they may have access to the data, but they cannot understand it. Worse, the data may be readily available to them, and they may be able to have comprehension of it.
However, for some reason or another, they find that they are unable to use the data.
This is where data mining comes in.
The main reason why data mining is very important is to facilitate the conversion of raw data into information that, in turn, will be converted into knowledge applicable for decision-making processes of businesses.
Data mining has become increasingly important, especially in recent years, when nearly all industries and sectors all over the world are facing problems on data explosion. All of a sudden, there is simply too much data, and this rapid rise in the amount of data demands a corresponding increase in the amount of information and knowledge. Thus, there is a need to quickly, efficiently and effectively process all that data into usable information, and data mining offers the solution. In fact, you could say that data mining is the solution.
You will find data mining to be most often used or applied in organizations or businesses that maintain fairly large to massive databases. The sheer size of their databases and the amount of information contained within them require more than a small measure of organization and analysis, which is where data mining comes in. Through data mining, users are able to look at data from multiple perspectives in their analysis. It will also make it easier to categorize the information processed and identify relevant patterns, relationships or correlations among the various fields the data or information belong to.
Therefore, we can deduce that data mining involves tasks of a descriptive and predictive nature. Descriptive, because it involves the identification of patterns, relationships and correlations within large amounts of data, and predictive, because its application utilizes variables that are used to predict their future or unknown values.
APPLICATIONS OF DATA MINING
The application of data mining is apparent across sectors and industries.
Retail and Service
The sale of consumer goods and services in the retail and service industries results in the collection of large amounts of data. The primary purpose of using data mining in these industries is to improve the firm’s customer relationship management, its supply chain management and procurement processes, its financial management, and also its core operations (which is sales).
The most common areas where data mining becomes highly effective among retail and service provider companies include:
- Promotion Effectiveness Analysis, where the company will gather and analyze data on past successful (and unsuccessful or moderately successful) campaigns or promotions, and the costs and benefits that the campaigns provided to the company. This will give the firm an insight on what elements will increase the chances of a campaign or promotion being successful.
- Customer Segmentation Analysis, where the firm will take a look at the responses of the customers – classified in appropriate segments – to shifts or any changes in demographics or some other segmentation basis.
- Product Pricing, where data mining will play a vital role in the firm’s product pricing policies and price models.
- Inventory Control, where data mining is used in monitoring and analyzing the movements in inventory levels with respect safety stock and lot size. Lead time analysis also greatly relies on data mining.
- Budgetary Analysis, where companies will need to compare actual expenditures to the budgeted expenses. Incidentally, knowledge obtained through data mining will be used in budgeting for subsequent periods.
- Profitability Analysis, where data mining is used to compare and evaluate the profitability of the different branches, stores, or any appropriate business unit of the company. This will enable management to identify the most profitable areas of the business, and decide accordingly.
Essentially, the areas where data mining is applied in manufacturing companies are similar to those in retail and service companies. However, manufacturing businesses also use data mining for its quality improvement (QI) initiatives, where data obtained through quality improvement programs such as Six Sigma and Kaizen, to name a few, are analyzed in order to solve any issues or problems that the company may be having with regards to product quality.
Finance and Insurance
Banks, insurance companies, and other financial institutions and organizations are also actively using data mining in its business intelligence initiatives. Risk Management is generally the area where data mining is most utilized. This time, data mining is used to recognize and subsequently reduce credit and market risks that financial institutions are almost always faced with. Other risks assessed with the help of data mining include liquidity risk and operational risk.
For example, banks and credit card companies use data mining for credit analysis of customers. Insurance companies are mostly concerned with gaining knowledge through claims and fraud analysis.
Telecommunication and Utilities
Organizations engaged in providing utilities services are also recipients of the benefits of data mining. For example, telecommunication companies are most likely to conduct call record analysis. Electric and water companies also perform power usage or consumption analysis through data mining.
The global popularity of cellular phones in almost all transactions has made it a playground for many hackers and security threats. This spurred Coral Systems, a Colorado-based company, to create FraudBuster, which is described to be able to “track” down the types of fraud through data mining, specifically through analysis of cellular phone usage patterns in relation to fraud.
In the transport industry, it is mainly all about logistics, which is why that is the area where data mining is most applied. Thus, logistics management benefits greatly from data mining. State or government transport agencies are also using data mining for its various projects, such as road construction and rehabilitation, traffic control, and the like.
The real estate industry heavily relies on information gleaned from property valuations which, in turn, resulted from the application of data mining. The focus is not entirely on the bottomline or the sales. Instead, data on property valuation trends over the years, as well as comparison on appraisals, are tackled.
Healthcare and Medical Industry
Every day, researches, studies and experiments are conducted in the healthcare and medical industry, which implies that there are tons of data being generated every single day. Data mining is often an integral part of those researches and studies.
STEPS IN DATA MINING
Data mining is a process, which means that anyone using it should go through a series of iterative steps or phases. The number of steps vary, with some packing the whole process within 5 steps. The one below involves 8 steps, primarily because we have broken down the phases into smaller parts. For example, steps #2 through #5 are lumped by other sources as a single step, which they call “Data Pre-processing”.
For purposes of this discussion, however, let us take each step one at a time.
Step #1: Defining the Problem
Before you can get started on anything, you have to define the objectives of the data mining process you are about to embark on. What do you hope to accomplish with the data mining process? What problems do you want to address? What will the organization or business ultimately obtain from it as benefit?
Step #2: Data Integration
It starts with the data, or the raw tidbit about an item, event, transaction or activity.The goal is to provide the users (those who are performing data mining) a unified view of the data, regardless of whether they are from single or multiple sources.
This step involves:
- Identification of all possible sources of data. Chances are high that the initial list of sources will be quite long and heterogeneous. Integrating these data sources will save you a lot of time and resources later on in the process.
- Collection of data. Data are gathered from the sources previously identified and integrated. Usually, data obtained from multiple sources are merged.
Data integration aims to lower the potential number and frequency of data redundancy and duplications in the data set and, consequently, improve the efficiency (speed) and effectiveness (accuracy) of the data mining process.
Step #3: Data Selection
After the first step, it is highly probable that you will be faced with a mountain of data, a large chunk of which are not really relevant or even useful for data mining purposes. You have to weed out those that you won’t need, so you can focus on the data that will be of actual use later on.
- Create a target data set. The target data set establishes the parameters of the data that you will need or require for data mining.
- Select the data. From all the data gathered, identify those that fall within the data set you just targeted. Those are the data you will subject to pre-processing.
Step #4: Data Cleaning
Also called “data cleansing” and “data scrubbing”, this is where the data selected will be prepared and pre-processed, which is very important before it can undergo any data mining technique or approach.Some data mining processes refer to data cleaning as the first of a two-step data pre-processing phase.
Data obtained, in their raw form, have a tendency to contain errors, inaccuracies and inconsistencies. Some may even prove to be incomplete or missing some values. Basically, the quality of the data is compromised. It is for these reasons that various techniques are employed to “clean” them up. After all, poor or low quality data is unreliable for data mining.
One of the biggest reasons for these errors is the data source. If data came from a single source, the most common quality problems that require cleaning up are:
- Data entry errors, mostly attributed to ‘human’ factor, or error of the person in charge of the input of data into the data warehouse. They could range from simple misspellings to duplication of entries and data redundancy.
- Lack of integrity constraints, such as uniqueness and referential integrity. Since there is only one source of data, there is no way of ascertaining whether the data is unique or not. In the same way, duplication and inconsistency may arise due to the lack of referential integrity.
Similarly, data obtained from multiple sources also have quality problems.
- Naming conflicts, often resulting from the fact that there are multiple sources of the same data, but named differently. The risk is that there may be data duplication brought about by the different names. Or it could be the other way around. More than one or two sources may use the same name for two sets of data that are completely unrelated or different from each other.
- Inconsistent aggregating, or contradictions arising from data being obtained from different sources. Duplications of data may result to them canceling each other out.
- Inconsistent timing, where data may tend to overlap among each other, resulting to more confusion. The data then becomes unreliable. For example, data on shopping history of a customer may overlap when sourced from various shopping sites or portals.
Cleaning up data often involves performing data profiling, or examining the available data and their related statistics and information, to determine their actual content, quality and structure.Other techniques used are clustering and various statistical approaches. Once the data has been cleaned, there is a need to update the record with the clean version.
Step #5: Data Transformation
This is considered to be the second data pre-processing step. Other authors even describe data transformation as part of the data cleaning process.
Despite having “cleaned” the data, they may still be incapable of being mined. To make the clean data ready for mining, they have to be transformed and consolidated accordingly. Basically, the source data format is converted into “destination data”, a format recognizable and usable when using data mining techniques later on.
The most common data transformation techniques used are:
- Smoothing. This method removes “noise” or inconsistencies in data. “Noise” is defined as a “random error or variance in a measured variable. Smoothing often entails performing tasks or operations that are also performed in data cleaning, such as:
- Binning. In this method, smoothing is done by referring to the ‘neighborhood’ of the chosen data value, and categorically distribute them in ‘bins’. This neighborhood essentially refers to the values around the chosen data value. Sorting the values in bins or buckets will smooth out the noise.
- Clustering. This operation is performed by organizing values into clusters or groups, ordinarily according to a certain characteristic or variable. In short, data values that are similar will belong to one cluster. This will smooth and remove any data noise.
- Regression. As a method for smoothing noise in data values, linear regression works by determining the best line to fit two variables and, in the process, improve their predictive value. Multiple regression, on the other hand, also works, but involves more than two variables.
- Aggregation. This involves the application of summarization tactics on data to further reduce its bulk and streamline processes. Usually, this operation is used to create a data cube, which will then be used later for analysis of data. A common example is how a retail company summarizes or aggregates its sales data periodically per period. Therefore, they have data on daily, weekly, monthly and annual sales.
- Generalization. Much like aggregation, generalization also leads to reduction of data size. The low-level or raw data are identified and subsequently replaced with higher-level data. An example is when data values on customer age is replaced by the higher level data concept of grouping them as pre-teen, teen, middle-aged, and senior. In a similar manner, raw data on families’ annual income may be generalized and transformed into higher-level concepts such as low-level, mid-level, or high-income level families.
- Normalization or Standardization. Data variations and differences can also have an impact of data quality. Large gaps can cause problems when data mining techniques are finally applied. Thus, there is a need to normalize them. Normalization is performed by specifying a small and acceptable range (the standard), and scaling the data in order to ensure they fall within that range.Examples of normalization tactics employed are Min-Max Normalization, Z-Score Normalization, and Normalization by Decimal Scaling.
Step #6: Data Mining
Data mining techniques will now be employed to identify the patterns, correlations or relationships within and among the database. This is the heart of the entire data mining process, involving extraction of data patterns using various methods and operations.
The choice on which data mining approach or operation to use will largely depend on the objective of the entire data mining process.
The most common data mining techniques will be discussed later in the article.
Step #7: Pattern Evaluation
The pattern, correlations and relationships identified through data mining techniques are inspected, evaluated and analyzed. Evaluation is done by using “interestingness” parameters or measures in figuring out which patterns are truly interesting and relevant or impactful enough to become a body of useful knowledge.
The interpretation in this stepwill formally mark the transformation of a mere information into an entire “bag of knowledge”.
Step #8: Knowledge Presentation
The knowledge resulting from the evaluation and interpretation will now have to be presented to stakeholders. Presentation is usually done through visualization techniques and other knowledge representation mechanisms. Once presented, the knowledge may, or will, be used in making sound business decisions.
DATA MINING TECHNIQUES
Over the years, as the concept of data mining evolved, and technology has become more advanced, more and more techniques and tools were introduced to facilitate the process of data analysis. In Step #5 of the Data Mining process, the mining of the transformed data will make use of various techniques, as applicable.
Below are some of the most commonly used techniques or tasks in data mining, classified whether they are descriptive or predictive in nature.
Descriptive Mining Techniques
Clustering or Cluster Analysis
Clustering is, quite possibly, one of the oldest data mining techniques, and also one of the most effective and simplest to perform. As briefly described earlier, it involves grouping data values that have something in common, or have a similarity, together in a meaningful subset or group, which are referred to as “clusters”.
The grouping or clustering in this technique is natural, meaning there are no predefined classes or groups where the data values are distributed or clustered into.
Perhaps the most recognizable example of clustering used as a data mining tool is in market research, particularly in market segmentation, where the market is divided into unique segments. For instance, a manufacturer of cosmetic and skin care products for females may cluster its customer data values into segments based on the age of the users. Most likely the main clusters may include teens, young adults, middle age and mature.
Association Rule Discovery
The purpose of this technique is to provide insight on the relationships and correlations that associate or bind a set of items or data values in a large database. Analysis of data is done mostly by looking for patterns and correlations.
Customer behavior is a prime example of the application of Association Rules in data mining. Businesses analyze customer behavior in order to make decisions on key areas such as product price points and product features to be offered.
Incidentally, this technique may also be predictive, such as when it is used to predict customer behavior in response to changes. For example, if the company decides to launch a new product in the market, how will the consumers receive it? Association Rules may help in making hypotheses on how the customers will accept the new product.
Sequential Pattern Discovery
This mining technique is slightly similar to the Association Rule technique, in the sense that the focus is on the discovery of interesting relationships or associations among data values in a database. However, unlike Association Rule, Sequential Pattern Discovery considers order or sequence within a transaction and even within an organization.
Sequence Discovery or Sequence Rules is often applied to data contained in sequence databases, where the values are presented in order. In the example about customer behavior, this technique may be used to get a detailed picture of the sequence of events that a customer follows when making a purchase. He may have a specific sequence on what product he purchases first, then second, then third, and so on.
Concept or Class Description
This technique is straightforward enough, focusing on “characterization” and “discrimination” (which is why it is also referred to often as the Characterization and Discrimination technique. Data, or its characteristics, are generalized and summarized, and subsequently compared and contrasted.
A data mining system is expected to be able to come up with a descriptive summary of the characteristics or data values. That is the data characterization aspect.
For example, a company planning to expand its operations overseas is wondering which location would be most appropriate. Should they open an overseas branch in a county that experiences precipitation and storms for a greater half of the year, or should they pick a location that is mostly dry and arid throughout the year? Data characteristics on these two regions will be looked into for their descriptions, and then compared (or discriminated) for similarities and differences.
Predictive Mining Techniques
This method has several similarities with Clustering, which leads many to assume that they are one and the same. However, what makes them different is how, in Classification, there are already predetermined and pre-labeled instances, groups or classes. In clustering, the clusters are defined first, and the data values are put into the clusters they belong to. In classification, there are already pre-defined groups and, of course, it in these groups where the data values will be sorted into.
In Classification, the data values will be segregated to the grouping or instances and be used in making predictions on how each of the data values will behave, depending on that of the other items within the class.
An example is in medical research when analyzing the most common diseases that a country’s population suffers from. The classifications of diseases are already existing, and all that is left is for the researchers to collect data on the symptoms suffered by the population and classify them under the appropriate types of diseases.
Nearest Neighbor Analysis
This predictive technique is also similar to clustering in the sense that it involves taking the chosen data value in context of the other values around it. While clustering involves data values in extremely close proximity with each other, seeing as they belong to the same cluster, the nearest neighbor is more on the nearness of the data values being matched or compared to the chosen data value.
In the cosmetic and skin care product manufacturing company example cited above, this technique may be used when the company wants to figure out which of their products are the bestsellers in their many locations or branches. If Product A is the bestseller in Location 1, and Location 10 is where Product J is selling like hot cakes, then the chances are greater that Location 2, which is nearer to Location 1 than Location 10 is, will also record higher sales for Product A more than Product J.
Regression techniques come in handy when trying to determine relationships dependent and independent variables. It is a popular technique primarily because of its predictive capabilities, which is why you are likely to see it applied in business planning, marketing, budgeting, and financial forecasting, among others.
- Simple linear regression, which contains only one predictor (independent variable) and one dependent variable, resulting to a prediction. Presented graphically, the regression model that demonstrates a shorter distance or line between the X-axis (the predictor) and the Y-axis (the prediction or data point) will be the simple linear regression model to be used for predictive purposes.
- Multiple linear regression, which aims to predict the value of the responses or predictions with respect to multiple independent variables or predictors. Compared to the simple regression, this is fairly more complicated and work-intensive, since it deals with a larger data set.
Regression analysis is often used in data mining for purposes of predicting customer behavior in making purchases using their credit cards, or making an estimate of how long a manufacturing equipment will remain serviceable before it requires a major overhaul or repair. In the latter example, the company may plan and budget its expenditure on repairs and maintenance of equipment accordingly, and maybe even assess the feasibility of purchasing a new equipment instead of repeatedly spending more money on maintenance of the old one.
So, now here is the fun stuff (hint: it’s the video :-).
What makes this predictive technique very popular is its visual presentation of data values in a tree. The tree represents the original set of data, which are then segmented or divided into the branches, with each leaf representing a segment. The prediction is the result of a series of decisions, presented in the tree diagrams as a Yes/No question.
What makes this model even more preferred is how the segments come with descriptions. This versatility – offering both descriptive and predictive value in an easy-to-understand presentation – is the main reason why decision trees are gaining much traction in data mining and database management, in general.
In instances where there are already established models or general behavior expected from data objects, data mining may be done by taking a look at the exceptions or, in this case, what we call the “outliers”. These are the data objects that do not fall within the established model or do not comply with the expected general behavior. The result of these deviations may prove to be data that can be used as a body of knowledge later on.
A classic example of applying outlier analysis is in credit card fraud detection. The shopping history of a specific customer already provides an e-tailer (online retail store) a set of general behavioral data to base on. When trying to find if the fraudulent purchases have been made using the credit card of that customer, the focus of the analysis will be unusual purchases in his shopping history, such as surprisingly large amounts spent on a single purchase, or the unusual purchase of a specific item that is completely unrelated to all previous purchases.
If the customer, for the past three years, has made a purchase at least once in every 2 months, a single month with the customer purchasing more than two or three times is enough to raise a red flag that his credit card may have been stolen and being improperly and fraudulently used.
When the data to be subjected to mining inherently changes or evolves over time, and the goal is to establish a clear pattern that will help in predicting the future behavior of the data object, a recommended approach is evolution analysis.
Evolution analysis involves the identification, description and modeling of trends, patterns and other regularities with respect to the behavior of data objects as they evolve or change. Thus, you will often find this applied the mining and analysis of time-series data. Stock market trends, specifically on stock prices in the stock market, are subjected to time-series analysis. The output will enable investors and stock market analysts to predict the future trend of the stock market, and this will ultimately guide them in making their stock investment decisions.
There are a lot of other techniques used in data mining, and we named only a few of the most popular and the most commonly used approaches. Application of these techniques also require the use of other disciplines and tools, such as statistics, mathematics, and software management.
The success of a business rides a lot on how good management is at decision-making. And let us not forget that a decision will only be as good as the quality of the information or knowledge tapped into by the decision-makers. High quality information will rely heavily on how the collection, processing and evaluation of data. If data mining was unsuccessful or less than effective in the first place, then there is a great chance that the resulting “bag of knowledge” will not be as accurate and effective as well, and poor business decisions may be arrived at.
In San Francisco, we meet co-founder and CEO of Scribd, Trip Adler. He shares his story how …