Data Mining Blog

Tuesday, October 30, 2012

De-duplicating, merging customer records with clustering

Frustrated with multiple records of the same customer which just differ due to a typo or abbreviation or different possible representations of same address?

Customer duplicate records could be very tricky. They suffer the problems such as abbreviating the address, typos and various possible representation of same address and name.

Say for eg., both these addresses refer to the same place

John Street 23
John st. 23

similarly, in the below example both refer to the same person, but there is a typo and also an abbreviation which stops computers from easily identify that they are infact the same person.

Alphan Majar
Alp. Major

Even with powerful computers, it is difficult to identify these duplicates. we have developed a simple tool to address this problem.

Try Deduper !!

Deduper is a simple command line tool to merge duplicates in customer records. It works based on advanced string matching techniques and clustering. This technique is called blocked nearest neighbor clustering and this general technique is further optimized in this tool for the problem of customer merging.

Deduper is a wrapper on the simile-vinco library . An open source tool called Google Refine uses this library and how this clustering works can be read in more detail from this page.

Give it a try, we will be happy to hear from you to know how it helped you.

Deduper can be downloaded from the link: http://sourceforge.net/projects/deduper/

Friday, October 12, 2012

Optimization plugin for RapidMiner

Optimization in general means selecting a best choice out of various alternatives, which reduces the cost or disadvantage of an objective. Optimization problems are very popular in the fields such as economics, finance, logistics, etc. Optimization is a science of its own and machine learning or data mining is a diverse growing field which applies techniques from various other areas to find useful insights from data. Many of the machine learning problems can be modeled and solved as optimization problems, which means optimization already provides a set of well established methods and algorithms to solve machine learning problems. Due to the importance of optimization in machine learning, in recent times, machine learning researchers are contributing remarkable improvements in the field of optimization. We implement several popular optimization strategies and algorithms as a plugin for RapidMiner, which adds an optimization tool kit to the list of existing arsenal of operators in RapidMiner.

The optimization plugin for RapidMiner is available for download from the link
https://bitbucket.org/venkatesh20/optimization-extension

Wednesday, June 20, 2012

Understanding Job market using Probabilistic Graphical Models

Here, I present my Idea on understanding job market using probabilistic Graphical Models. This is also a simple and practical example of Bayesian networks.

View more PowerPoint from vumaasha

Wednesday, October 5, 2011

25th Place in Hearst Challenge

In my earlier post, I have shared the code for data preparation for Hearst Challenge. The final results of Hearst Challenge is announced. Glad to know that my simple model has got 25th place in the Challenge.

A few lines on the model, I used upsampling to handle the class imbalance. Since the number of negative samples were very very high compared to that of positive ones. I created 10 subsets of the original data, each dataset containing upsampled positive samples and randomly sampled negative ones. An ensemble model with an SVM classifier for each training subset, the model was created. I used SVM perf for the SVM training. This proves a simple model does help though it is not the best.

Wednesday, September 28, 2011

R gets more closer to Hadoop

Revolution Analytics which is a provider of Enterprise R, is partnering up with Cloudera to improve the integration of R and Hadoop that it has already developed. Through this integration to R will be available by default to Cloudera Hadoop Users. For more details : http://blog.revolutionanalytics.com/2011/09/revolution-analytics-partners-with-cloudera.html

Revolution Analytics

View more presentations from templedf

Friday, September 23, 2011

DataMining Tools Catching up with Big Data

With Data Explosion increasing every data, the well established data mining tools are getting ready to attack the Big Data with the help of hadoop Framework. Hadoop is a Mimic of Google's Map reduce built in java. It provides a framework for massive parallel and distributed computing on commodity hardware.

Mahout is a machine learning framework built on top of the Hadoop Framewrok, which implements few of the machine learning algorithms

R is a well known animal among statisticians and also widely used by data miners. R is being integrated with Hadoop by revolutionary analytics. For more details visit the below link where a white paper and presentation download is available. REVOLUTION WEBINAR: LEVERAGING R IN HADOOP ENVIRONMENTS

Rapid Miner is a another good data mining tool which is available for free to the community and practitioners, Rapid-I is on the way to integrate hadoop with Rapid Miner and has come up with Radoop which integrates Rapid Miner with Hadoop and Mahout and aims to provide an easy user interface to Mahout and Big Data Analytics.

Everyone is in the urge to keep up the pace to handle the Big Data. It will be great if Weka, a widely used Machine Learning open source tool which is Memory Based Java Implemented also gets rewritten to leverage the advantages of Hadoop. Of course we should keep in mind, not all algorithms can be implemented using Map-Reduce and Integrating Weka with Hadoop could be daunting task.

Friday, August 5, 2011

Data preperation code for Hearst Challenge Data Mining Competition

http://code.google.com/p/hearstchallenge/

This code project contains the data preprocessing scripts in python for the hearst challenge, this converts hearst challenge data in to
svm light format. The script current supports only the svm light format but it can be easily modified to
write the data in some other format.

1. Concat the Modeling files in to single file
Modify the directory path in the concatFiles.py to the path which contains the Modeling_n.csv and run
python concatFiles.py
this creates a concatenated file newModeling.csv
2. Data PreProcessing.
(i) Most of the categorical attributes and IDs are converted in to binary nominal attributes
eg.ETECH_GROUP,ETHNICITY_DETAIL,EXPERIAN_INCOME_CD_V4,etc..
(ii) The attribute City contains 15478 distinct values and to convert it in to numerical value, distance between each city
and the geographic center of USA. Wanna Know how I did it see distinctCityState.py and getCityCoordinates.py
(iii)Trait attributes are converted in to 73 binary nominal attributes for traits listed in the data dictionary
(iv) The code contains lot of comments and is self explanatory
(v) place the concatenated Model file, Validation file and the files distFromCentreOfUSA_new.csv and traits.csv in a directory
configure this directory as basedir in the file hearst2svmlite.py and then run
python hearst2svmlite.py

You will get the below files:
distinctVal.txt -
open_flag.train - svm lite training file for open_flag
click_flag.train - svm lite training file for click_flat
valid.test - scaled validation file in svm light format for prediction
valid_id.test - contains new_id,new_mailing_id for each line in the validation
test

The general Problem with the data is it contains very less % of Positives, 92% are negatives and only 8% are positives in the training data and there is a challenge of handling high imbalance.

You are welcome to use the code and I will be happy to hear from you if you have anything to say.

I got this question from Vivek(vivek.vichare@gmail.com), you can find my answer below:

Hey Venkatesh,

I am novice to the field of data mining and came across your python code for Hearst challenge. I myself have tried to participate in the challenge and submitted using GBM algo but not with great results.

I tried your data preprocessing code on python and it works to create datasets for SVM. I wished to try out SVM by myself but am facing challenges as to how to run svmlight. What platform to use and how to go about. It would be great if you can provide some direction.

Thanks,

Vivek

Answer:
Hi Vivek,
Please post this question to my blog comments, as it will be useful for others who are following the code as well.

1. I have updated the latest scripts, yesterday make sure you generate the svmlight files out of them
2. There is huge class imbalance in action, 92% -ve and only 8% +ve classes, hence you need use some techniques such as sampling to handle the class imbalance
3. there are more than 1000 features in the generated svmlight file, so u may need to do some feature extraction / feature selection before training the model
4. For general instructions on running svm, refer to this link A Practical Guide to Support Vector Classification

Good Luck !!

update:
I am currently in 49th place in the leader-board with a score 0.26307, the score of the 1st rank is 0.2272. You can see that top 50 scores are very close. This could be probably due to the reason of difficulty that exists in predicting click_flag, my models seem to perform decently in predicting the open_flag, but miserably fail in case of click_flag. Any model which manages to predict the click_flag will to even a little extent will sure top the leader-board.

*** Moved to 44th Place in the leader-board with score 0.25126