Monday, October 30, 2006

Weka

Weka is a Java program that contains data mining algorithms for data in arff (attribute-relation file format).

This was the main software used by a Machine Learning subject taught at honours and masters level at my university. This subject contains the theory and some practical aspects of many basic Machine learning and data mining algorithms. I've had a play around with Weka in unix and on windows. I prefer the unix environment because it's run like any other Java program.

For example, a command line argument to play with some association mining followed by the results would be
>> java weka.associations.Apriori -t weather.nominal.arff

Apriori
=======

Minimum support: 0.15 (2 instances)
Minimum metric : 0.9
Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12

Size of set of large itemsets L(2): 47

Size of set of large itemsets L(3): 39

Size of set of large itemsets L(4): 6

Best rules found:

1. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. outlook=overcast 4 ==> play=yes 4 conf:(1)
4. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)
5. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
8. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2 conf:(1)
10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2 conf:(1)

Installation in Unix is fairly simple to do, making sure your CLASSPATH is set correctly to include the weka.jar.

I've also installed it onto Windows and had a play around with that. The products provided offer much more than what I really need. The Explorer, KnowledgeFlow and Experimenter look interesting but I never really had a play with it. I mostly used the SimpleCLI because I wanted to do some different basic mining on my data to see if it produced sensible results. This is pretty much gives the same results as when running from unix. Windows installation is fairly simple to do, there's a .exe file that installs everything for you. Just make sure you have Java 1.4/1.5 somewhere.

You can do all sorts of clever things with classification using Weka. You can filter attributes, cluster data, build trees, do some linear regression, bayes and so on. As long as you have suitable data or data that Weka was designed for, it will work.

My data was not quite built for Weka so I had some problems. Weka takes files in arff format where arff stands for attribute-relation file format. An example of the generic that is contained with the download is
@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no


So you say what the relation is called, then you describe the attributes and after you've defined the attributes and the possible values, the transactions can be listed. The format is fairly simple and it is not too difficult to copy.

I tried to use it for my user-object permission assignment data to see if any of the mining algorithms could produce things that were interesting. The one that I was most interested in was frequent pattern mining that could be generated through association mining.

While fundamentally an access control matrix can be placed into the required format, the transactions produced would be very large. Arff file transactions are all the same length with each allowable attribute value defined. So in an access control matrix, either the users are attributes or permissions are attributes. Either way, the static transaction length would be enormous. Even with only a couple of transactions, Weka is unable to process the data. This was due to the long transaction length. Also associations rules in Weka do not accept numerical values. Which is fine, I just need another way to represent 0 and 1 (yes/no).

Under unix, the following error was produced:


>> java weka.associations.Apriori -t group.short.arff
Exception in thread "main" java.lang.OutOfMemoryError:
JVMXE006:OutOfMemoryError, stAllocArray for executeJava failed
at weka.associations.AprioriItemSet.mergeAllItemSets(Unknown Source)
at weka.associations.Apriori.findLargeItemSets(Unknown Source)
at weka.associations.Apriori.buildAssociations(Unknown Source)
at weka.associations.Apriori.main(Unknown Source)


The process is simply […killed] under Windows, even when I give it more VM.

It appears to be unfortunate for our data. Weka is implemented in Java and unable to process data sets with length attribute lists. It was hoped that Weka can process this arff file because it offers a suite of machine learning tools and had the potential to offer us more insight into what different mining techniques could find.

Although I do not need all the functionality of Weka, some supervisors suggested using a jprofiler in NetBeans IDE to have a look at what Weka is doing wrong and why it is being so crazy. Maybe I will, but since I only really need frequent mining for now, I might investigate alternative methods for extracting what I need.

2 comments:

K K W I E E R Computer Engg said...

I have used WEKA and found it to be very useful. But I did not understand the following.

Suppose I have some dataset D consisting of 1000 instances. I create two files. In the First file say file1.arff I copied all the 1000 instances and in second file say file2.arff, I select randomly only 200 instances from the same dataset D. If I now build two neural network models using following two commands in WEKA CLI, the size of the first model file is higher than the second. Why so? In fact since the models are identical (no. of layers and number of neurons in input, hidden and output layer and no. of inter connections)the file sizes should have also been identical.

commands are

java weka.classifiers.MultilayerPerceptron -t c:\file1.arff -d c:\file1.mod

second command is

java weka.classifiers.MultilayerPerceptron -t c:\file2.arff -d c:\file2.mod

You may check this with any dataset and see the size of the ".mod" files on the disk. They are different. Also note that the MultilayerPerceptron classifiers unlike the "J48" classifiers do not have scheme specific option
"-L". If -L is used with J48 the sizes are different as the model file also stores instance information for future visualization. If not used, instance information is not stored and thus the model file sizes are identical. Does this mean that NN models in WEKA always store instance information? Please comment.

s!nooz said...
This comment has been removed by the author.