Data Preprocessing – Normalization - Real time example
In this article, we are going to see normalization in action in a popular web application. People who are not familiar with normalization please refer to my previous post.
We all know very well the capability of Google to exploit the available technology and give innovative products to us. Google insights for search is one such great product from Google. This application's concept is almost completely based on the normalization concepts. Let us see what this application allows us to do, suppose if I want to find who is a more popular tennis player in the year 2009. Serena Williams or Venus Williams? Insight allows to me to find the answer for this question based on the web traffic ( News articles, searches ) for these two keywords.
In the above screen shot, you can see from the chart that Serena Williams is more popular than Venus Williams in the year 2009 based on the web search interest. But in what way is this relevant to normalization? Yes i can hear your question. See the totals section, you can notice that the actual number of search hits are not displayed. Google normalizes the actual search volume count and makes the data to fit in the scale of [0-100]. Normalization often converts all the values in the data set to fit in the range [0.0 - 1.0], but we can also choose the lower and upper bound values to suit our need.
Let us see another interesting thing here, Let us rephrase our query. I want to find who is more popular tennis player in the year 2009 among Roger Federer, Serena Williams and Venus Williams? You may think Obviously everyone of us would know that Federer is going to be popular, what is interesting in this. Come on !! be patient I swear its going to be interesting, look at the new graph which is being displayed below .
Look at the normalized scores now, in the first case when only Serina and Venus were compared Serena got a score of 19, Where as when Federer is also included in the query Serana's score has decreased to 10. So what is the inference from this? The normalized scores do not make any sense individually as they are not the actual measure, but it makes sense to understand who is more popular only when relative difference is analyzed. Its interesting isn't it?
It doesn't end with this, I am going to share a critical secret about normalization with you. We programmers, computer scientists are very familiar with the concept of encapsulation, which is nothing but hiding the actual implementation details. Think what google is doing here, Google doesnt want to share the actual count of search volume hits but it does want to convey to us relative significance between keywords. Yes you got it I think. Normalization can be considered as a method to encapsulate the actual data. Say for eg, you want a data mining expert to analyse your business but you do not want to reveal the actual revenue your business generates, just normalize the data in a scale of say [0-100000] and handle the data to him, you get your analysis done and he never gets an opportunity to know your actual revenue if you keep your normalization algorithm as a secret. Of course we need to understand that there may be some errors based on the normalization technique that we adopt.
I appreciate your feed backs and comments.