Tuesday, October 30, 2012

De-duplicating, merging customer records with clustering


Frustrated with multiple records of the same customer which just differ due to a typo or abbreviation or different possible representations of same address?

Customer duplicate records could be very tricky. They suffer the problems such as abbreviating the address, typos and various possible representation of same address and name.

Say for eg., both these addresses refer to the same place


  • John Street 23
  • John st. 23


similarly, in the below example both refer to the same person, but there is a typo and also an abbreviation which stops computers from easily identify that they are infact the same person.


  • Alphan Majar
  • Alp. Major

Even with powerful computers, it is difficult to identify these duplicates. we have developed a simple tool to address this problem. 

Try Deduper !!

Deduper is a simple command line tool to merge duplicates in customer records. It works based on advanced string matching techniques and clustering. This technique is called blocked nearest neighbor clustering and this general technique is further optimized in this tool for the problem of customer merging.

Deduper is a wrapper on the simile-vinco library . An open source tool called Google Refine uses this library and how this clustering works can be read in more detail from this page.

 Give it a try, we will be happy to hear from you to know how  it helped you.

Deduper can be downloaded from the link: http://sourceforge.net/projects/deduper/

4 comments:

  1. Hello there,
    I really like your blog. You have some really nice stuff here and I do hope you will carry on writing.
    I am looking for passionate writers to join our community of bloggers - glipho.com
    It might be a good idea to give your writing and your blog more exposure while having fun and meeting fellow writers.
    Please check us out and drop me a line at hubert@glipho.com for any questions.
    Best!
    Hubert

    ReplyDelete



  2. That's interesting! Can you please share more about it? Thank you.



    Data Mining

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete