Text Mining Classification Implementation in Java Using WEKA
This tutorial contains the implementation of Text Mining using WEKA tool.
The basic steps are given below:
1.
Preprocessing :
1.
Remove Special Characters.
2.
Remove stop words.
3.
Tokenize data.
4.
Stemming using LovinsStemmer of
WEKA
2.
Identify Distinct Words.
3.
Generate Document Matrix.
4.
Calculate TF(d,t).
5.
Calculate IDF(t).
6.
Calculate TF-IDF(d,t).
7.
Generate TF-IDF Matrix.
8.
Apply weka.classifiers.trees.J48 classifier using WEKA API
9.
Print classifier results
1.
Preprocessing:
First and
important step of text mining is data preprocessing. As we don’t know how the
structure or format of our selected data set will be. So to make sure that we
don’t get any noisy data or any unstructured data we make use of Data
preprocessing to make consistent and less noisy data.
1.
Remove
Special Character :
First step
is to get rid of all the special symbols that have been used. Like
$,@,.,_,-,+,- etc. As we are going to find relation between the text we should
have only text and numeric data in hand before proceeding further.
2.
Remove Stop
Words :
In computing, stop
words are words which are filtered out prior to, or
after, processing of natural language data (text). There is not one
definite list of stop words which all tools use and such a filter is not always
used. Some tools specifically avoid removing them to support phrase search.
Any group of words can be
chosen as the stop words for a given purpose. For some search machines,
these are some of the most common, short function words, such as the, is,at, which,
and on. In this case, stop words can cause problems when searching
for phrases that include them, particularly in names such as 'The Who', 'The
The', or 'Take That'.
3.
Tokenize Data:
Tokenization, when
applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a
token, that has no extrinsic or exploitable meaning or value. The token is a
reference (i.e. identifier) that maps back to the sensitive data through a
tokenization system. The mapping from original data to a token uses methods
which render tokens infeasible to reverse in the absence of the tokenization
system, for example using tokens created from random numbers. The tokenization system must be secured and validated
using security best practices applicable to sensitive data protection,
secure storage, audit, authentication and authorization. The tokenization
system provides data processing applications with the authority and interfaces
to request tokens, or detokenize back to sensitive data.
4.
Stemming:
In linguistic
morphology and information retrieval, stemming is
the process for reducing inflected (or sometimes derived) words to
their stem, base or root form—generally a written word form. The
stem need not be identical to the morphological root of the word; it
is usually sufficient that related words map to the same stem, even if this
stem is not in itself a valid root. Algorithms for stemming have been
studied in computer science since the 1960s. Many search
engines treat words with the same stem as synonyms as a kind of query
expansion, a process called conflation.
Stemming programs are
commonly referred to as stemming algorithms or stemmers.
We have used WEKA stemmer
named as LovinsStemmer.
2.
Identify Distinct Keywords :
After all the preprocessing
we have bag of words for all the documents that have been selected as our data
source. Once we get bag of words then we will check for occurrences of same
words in all the documents. Like ‘ Play’,’Accuracy’,’Performance.’
This will give use words and
their occurrences in all documents then next step is to create document matrix.
3.
Generate
Document Matrix.
Generate the document matrix with
respect to the all the key words that are available in all document.
4.
Calculate TF(d,t)
– Term Frequency :
We have d documents and t terms
in our given document matrix.
Using given values out TF(d,t) can
be calculated as
TF(d,t)=0 freq(d,t)=0
=1+log(1+log(freq(d,t))) freq(d,t)>0
5.
Calculate
IDF(d,t) – Inverse Document Frequency:
IDF of any term t can be calculated as
IDF(t) = log( (1 + |d|) / |dt| )
Where
|d| is Total no. of Documents.
|dt| is no. of documents in which
term ‘t’ is occurred.
6.
Calculate
TF_IDF(d,t):
TF_IDF
can be calculate as :
TF_IDF = TF(d,t) * IDF(t)
7.
Generate
TF-IDF Matrix.
Generate
the document matrix with respect to the all the key words that are available in
all document. Each cell (I,j) value will be the TF-IDF of document I and term
j.
8.
Apply weka.classifiers.trees.J48 classifier using WEKA API
Apply
the generated arff file to the WEKA API for classification.
We
have used weka.classifiers.trees.J48 classifier
of WEKA.
Classifier Result:
Classifier...:
weka.classifiers.trees.J48 -U -M 2
Filter.......:
weka.filters.unsupervised.instance.Randomize -S 42
Training file: IFIDF.arff
J48 unpruned tree
------------------
: ham (1000.0/152.0)
Number of Leaves : 1
Size of the tree : 1
Correctly Classified Instances 848 84.8 %
Incorrectly Classified Instances 152 15.2 %
Kappa statistic 0
Mean absolute error 0.2578
Root mean squared error 0.359
Relative absolute error 99.7921 %
Root relative squared error 99.9998 %
Total Number of Instances 1000
=== Confusion Matrix ===
a b <-- classified as
848 0
| a = ham
152 0
| b = spam
=== Detailed Accuracy By Class ===
TP Rate FP Rate
Precision Recall F-Measure
ROC Area Class
1 1 0.848 1
0.918 0.494 ham
0 0 0 0
0 0.494 spam
Weighted Avg. 0.848
0.848 0.719 0.848
0.778 0.494
Hii Parvez,
ReplyDeleteI need a suggestion regarding clustering...i have a set of data consisting of some thousands documents and some hundreds of words and the frequency of occurrence of the words in each document as valuees...ds are unlabelled data....first i want to cluster and then classify..no nominal values also..i am doubtful which algo i should use...shall i use xmeans in weka?
why tfidf.arff is so strange? all 0.0
ReplyDeletei also cannot run your classifier
ReplyDelete