Text Mining Classification Implementation in Java Using WEKA


This tutorial contains the implementation of Text Mining using WEKA tool.
The basic steps are given below:

1.                  Preprocessing :
1.                  Remove Special Characters.
2.                  Remove stop words.
3.                  Tokenize data.
4.                  Stemming using LovinsStemmer of WEKA
2.                  Identify Distinct Words.
3.                  Generate Document Matrix.
4.                  Calculate TF(d,t).
5.                  Calculate IDF(t).
6.                  Calculate TF-IDF(d,t).
7.                  Generate TF-IDF Matrix.
8.                  Apply weka.classifiers.trees.J48  classifier using WEKA API

9.                  Print classifier results

1.                 Preprocessing:
First and important step of text mining is data preprocessing. As we don’t know how the structure or format of our selected data set will be. So to make sure that we don’t get any noisy data or any unstructured data we make use of Data preprocessing to make consistent and less noisy data.
1.                  Remove Special Character :
First step is to get rid of all the special symbols that have been used. Like $,@,.,_,-,+,- etc. As we are going to find relation between the text we should have only text and numeric data in hand before proceeding further.
2.                  Remove Stop Words :
In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text). There is not one definite list of stop words which all tools use and such a filter is not always used. Some tools specifically avoid removing them to support phrase search.
Any group of words can be chosen as the stop words for a given purpose. For some search machines, these are some of the most common, short function words, such as theis,atwhich, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as 'The Who', 'The The', or 'Take That'. 

3.                  Tokenize Data:
Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value. The token is a reference (i.e. identifier) that maps back to the sensitive data through a tokenization system. The mapping from original data to a token uses methods which render tokens infeasible to reverse in the absence of the tokenization system, for example using tokens created from random numbers. The tokenization system must be secured and validated using security best practices applicable to sensitive data protection, secure storage, audit, authentication and authorization. The tokenization system provides data processing applications with the authority and interfaces to request tokens, or detokenize back to sensitive data.
4.                  Stemming:
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
Stemming programs are commonly referred to as stemming algorithms or stemmers.
We have used WEKA stemmer named as LovinsStemmer.

2.                 Identify Distinct Keywords :
After all the preprocessing we have bag of words for all the documents that have been selected as our data source. Once we get bag of words then we will check for occurrences of same words in all the documents. Like ‘ Play’,’Accuracy’,’Performance.’
This will give use words and their occurrences in all documents then next step is to create document matrix.

3.                 Generate Document Matrix.
Generate the document matrix with respect to the all the key words that are available in all document. 

4.                 Calculate TF(d,t) – Term Frequency :
We have d documents and t terms in our given document matrix.
Using given values out TF(d,t) can be calculated as

TF(d,t)=0                                             freq(d,t)=0
            =1+log(1+log(freq(d,t)))          freq(d,t)>0

5.                 Calculate IDF(d,t) – Inverse Document Frequency:
IDF of any term t can be calculated as 

IDF(t) = log( (1 + |d|) / |dt| )


Where
|d| is Total no. of Documents.
|dt| is no. of documents in which term ‘t’ is occurred.

6.                 Calculate TF_IDF(d,t):
TF_IDF can be calculate as :
                        TF_IDF = TF(d,t) * IDF(t)

7.                 Generate TF-IDF Matrix.
Generate the document matrix with respect to the all the key words that are available in all document. Each cell (I,j) value will be the TF-IDF of document I and term j.
8.                 Apply weka.classifiers.trees.J48  classifier using WEKA API
Apply the generated arff file to the WEKA API for classification.
We have used weka.classifiers.trees.J48  classifier of WEKA.

Classifier Result:

Classifier...: weka.classifiers.trees.J48 -U -M 2
Filter.......: weka.filters.unsupervised.instance.Randomize -S 42
Training file: IFIDF.arff

J48 unpruned tree
------------------
: ham (1000.0/152.0)

Number of Leaves  : 1

Size of the tree :   1


Correctly Classified Instances         848               84.8    %
Incorrectly Classified Instances       152               15.2    %
Kappa statistic                          0    
Mean absolute error                      0.2578
Root mean squared error                  0.359
Relative absolute error                 99.7921 %
Root relative squared error             99.9998 %
Total Number of Instances             1000    

=== Confusion Matrix ===

   a   b   <-- classified as
 848   0 |   a = ham
 152   0 |   b = spam

=== Detailed Accuracy By Class ===

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 1         1          0.848     1         0.918      0.494    ham
                 0         0          0         0         0          0.494    spam
Weighted Avg.    0.848     0.848      0.719     0.848     0.778      0.494







Comments

  1. Hii Parvez,
    I need a suggestion regarding clustering...i have a set of data consisting of some thousands documents and some hundreds of words and the frequency of occurrence of the words in each document as valuees...ds are unlabelled data....first i want to cluster and then classify..no nominal values also..i am doubtful which algo i should use...shall i use xmeans in weka?

    ReplyDelete
  2. why tfidf.arff is so strange? all 0.0

    ReplyDelete

Post a Comment

Popular posts from this blog

Implementing X-means clustering in Java using WEKA API.

Topological Sort of JUNG graph in Java