Wednesday, October 26, 2016

Regularization, Standardization, and Normalization, Oh My!

Regularization, Standardization, and Normalization, Oh My!

I decided that instead of only posting my class projects as I go, I also want to periodically post about the topics we’re learning about. It will provide more in-depth looks into topics and the projects can represent them put to practice.

When dealing with data sets, all data does not come equal. You have to account for and deal with discrepancies in values and scaling, and therefore relationships throughout the data set. Those 3 words at the top are pretty confusing to distinguish at first. Adding ‘-ization’ to any word seems to automatically turns the word into some kind of action or computation. First I will talk about normalization and standardization since those are somewhat similar. After that, I will discuss regularization. I will then try to put into perspective why these skills are incredibly useful, and in many cases, mandatory.

Let’s say you have a data set that details a professional Soccer team. It’s very basic and it only has a few columns: age, jersey number, and salary per week.


A quick detailing of the code.  I made a dataframe from a dictionary with proper values, and simply called mean() on the whole thing. As an aside, the average of jersey number will never be a useful statistic, but it serves the purpose of this.



As you can see, these numbers are significantly different. The range (max - min) of these numbers is massive. This is where normalization and standardization come into play. Keeping these numbers how they are, you would never be able to properly see the relationship between the columns, because the scale is just too large. 

Before considering normalization and standardization, you have to at least be aware of a normal distribution curve. Put briefly, this curve represents normalized data around the mean = 0 and standard deviation = 1, where that % is the probability the data will fall in the distribution.




Ok, so far so good? Normalization reduces the values you are using into a scaled range of 0-1. The unfortunate side effect of this is that outliers tend to be lost in the translation, otherwise the non-outliers would be scaled together too small.  Standardization scales the data to have a mean equal to 0 and standard deviation of 1.  You want to use standardization when you are concerned with the amount of standard deviations the data points are from the mean.

To normalize the data use the following:


X represents a data value. Again, outliers get lost in the process of this because they throw off the scale too much. So they cannot be used for the x-max or x-min points, or else the values produced will be very close to 0 or very close to 1. Looking at the data set above, the salary column would have to be left out, otherwise it would dominate the scaling process.

To find the z-score, which represents the amount of standard deviations the value is from the mean, use the following formula:

Put plainly, the z-score is the values relationship to the mean of the group.  For each point x, subtract the mean of the data set population and divide by the standard deviation of the population.  The tradeoff here is that to use this formula, you must already have the population information. Typically, you will not find yourself in this position. When this is the case, you can substitute an analogous t-statistic by subtracting the mean of the sample set and dividing by the standard deviation of the sample set.

Both of these techniques rescale the data to be used so that the data values are weighted equally and better represent the data and the relationships between columns in the data.

Regularization refers to techniques that help prevent the overfitting of the model. As we add parameters to our model to better predict data, it becomes more complex, we slowly inch closer to overfitting our model to the specific data we are training our model with. The result of this is reducing the model's ability to predict in a more generalized fashion. The model becomes highly sensitive to small changes in the data set.

Two techniques used are Lasso regression and Ridge regression. Getting into those is for another post, just know that they help you find the point at which more parameters become a bad thing and when you begin reducing your model's predictive ability and its effectiveness to predict from more generalized data.

When first receiving a data set, the first thing to be done is clean up the data. Next, you need to determine whether or not you should normalize, standardize, and/or regularize your data. Doing this is very important if you want to extract meaningful and significant analysis from your data set, otherwise any results you find will be nullified.




No comments:

Post a Comment