Tuesday, November 15, 2016

Project 6

Back again with another project. This time, I was to use the imdb top 250 movies list to determine some good features for what makes a movie popular. A lot of the info was readily available on the imdb site and easy to scrape, but some I had to look for specifically. The end goal is to help better predict what movies would be popular, and to recommend to watch next, such as on Netflix.

Getting the data

First thing I had to go do was scrape the data off the website. There is an api for this, but it kept returning an empty list that was throwing everything off, so I used BeautifulSoup for the job.


After getting all the data, I got rid of some bad stuff and add it to the dataframe.



Ok everything looks good on for the data I scraped. Next we will want to add a CountVectorizer. What this does is convert some text into a binary representation of word(s). If that row contains that word, it will have a 1 in the column and 0 if not.


This will allow us to see if any particular words stand out, such as directors, actors, language, etc.

Visualize

It's good to plot some things out to see if there are any visualzations you can see that tell you something about the relationship of the data. This is a swarmplot that plots the Rating along with the total number of imdb Votes. As you can see, a significant portion of the movies are rated R.


We won't use this information any further (regarding rating vs votes) but it could be a good tangent to go on to see if something is there. Not Rated and NOT RATED were separated because of caps, but since I wasn't using it further, I did not bother to change it. Plus it would be largely insignificant.

This swarmplot shows votes as a product of metascore. We can see there is no rhyme nor reason to this graph, which still is helpful because we can see there's no correlation with metascore. Even the highest or lowest rated received similar vote numbers.

Evaluation

Ok, we have all our data and we've cleaned it and taken a few looks at it, now it's time to do the evaluations. First thing we have to do is normalize and scale the data properly.


Because imdbVotes has numbers ranging very high, along with gross, we need everything to be on the same scale. The next thing on the list is to train-test split the data in order to fit and run models on it. Going through decision tree, bagged decision tree, AdaBoost, random forest and gradient boost, it turns out gradient boost was the highest.


Since gbr was the highest, we will use that for our predictions. Next thing we want to do is find the k-best number of features to do the predictions with. Without flooding the page, we can see below:


The best number of features is 52. We will use this to find our final predictions of what the 5 most useful features are.

So we initialize a new train-test split with those number of features.


With the plot we can see that the predicted tends to run a little high compared to the actual, but overall it is pretty close. Our score increased roughly 8 percentage points, which is a nice jump.

Viewing it another way:
Hard to see the detail on here, but this is just a visual representation of the decisions our model was making, and where it would split the data based off certain conditions. Really gives a lot of insight into how the model is functioning on the inside.

Results

In the end, we were able to determine the most useful features. imdbVotes is the clear winner, followed with decent numbers by the next 3, and from the 5th on, it falls off very quickly.




Monday, October 31, 2016

Project 4

Project 4

For our project this week, we were tasked with using BeautifulSoup to scrape a website, and then within that data we scraped, do some kind of prediction with logistic regression.

The project called for us to web scrape Indeed.com for data science jobs. Because so few of the jobs actually list the salary data, we were to predict the salaries using the ones that did provide it.


This is what we needed to search for on Indeed. All the job listings start out in a <div> tag with class of "row results". Finding this would be our starting point. After we scraped all the relevant data, we made functions to properly extract specific parts, such as job title, location, salary (if present), and company.


So far so good. Above is an example of one of the extraction functions, they all were more or less the same. After we extracted our pertinent data, the next move is to place it into a pandas DataFrame.


Great. We have our data and the table we need, but it clearly needs a lot of cleaning up. The salaries all look different, and there's a lot of junk that needs to be removed in multiple columns.

First thing is to drop the NaN's (not a number) and remove the duplicates. 


For salary, all the nonsense was removed, and the numbers had to be uniformly presented. After, some of the numbers had to be summed and averaged because there was a range of possible salaries. We need cold, hard numbers.


For the job and company columns, there was a lot of '\n' which are newline characters. Plus it was in unicode, which made it harder to deal with, and particularly send to csv.


Looks much better now. No newlines, and salaries are uniformly presented.

After all the clean up, we had to start prepping to do some regression analysis on the data.

This is so we can know where the individual salaries fall relatively in the column of salaries

Finding some common break points is a good idea. This will split it by quartile and the median, all useful pieces of information. So we can know what is in certain ranges, as in top 25%, middle 50%(split into 25% each), and bottom 25%.

Now we are ready to do analysis.  Doing a basic logistic regression with statsmodels, we get the following:



Lots of numbers here, but the important one is the Psuedo R-squared value. Using only the location, it is 0.4243. While this number is not super useful by itself, its a mark we can use to determine if extra features increase or decrease the Pseudo R-squared value. That value is essentially a 'goodness of fit' measure for the model.


We are checking how Pseudo R-squared changed to the previous model

Adding in some features, using key words that were used often in the job titles, it added a slight amount of 'goodness of fit' to the model.

Rebuilding the model, but this time with sci-kit learn (another Python library useful for regression analysis)

First we set up our extra features and target: 


Then we perform cross validation in order to determine the best C (regularization fit) and penalty type:



After we get the best parameters with best_params_, we use those values going forward

Here we see that our best C value is 0.75, and penalty is L1. What this means, is that the C value, 0.75, is the best number to use for regularization of the model, in order to prevent overfitting for our model. L1 is the type used, and it is called Lasso Regression.


Running a classification report, followed by a confusion matrix. A classification report gives us useful information with precision, recall, and f1-score. Respectively: how many selected items are relevant (is it grabbing a lot of stuff you don't need?), how many relevant items are selected (is it grabbing the right stuff?), and the measure of a test's accuracy.

The confusion matrix shows the predicted values in columns compared to the actual values in the rows. So you can see how the model predicted for each bin. 

After running the in between code, we can see the Classification report for L1 versus L2 respectively: 

L1 (Lasso) Regression


L2 (Ridge) Regression



What is interesting here is our L2 model performed better than our L1 model. Unsure why that would be!

In this specific case Ridge (L2) regression performed better than Lasso (L1) regression. L1 brought many of the coefficients down to 0, which essentially labels them as useless.

Most importantly, our model was able to decently predict the salaries of the jobs that didn't provide any salary information, like the company requested. Here are the top 10:




Top predicted job is a Cognitive Software Developer, which sounds cool, at a whopping 200k.

Wednesday, October 26, 2016

Regularization, Standardization, and Normalization, Oh My!

Regularization, Standardization, and Normalization, Oh My!

I decided that instead of only posting my class projects as I go, I also want to periodically post about the topics we’re learning about. It will provide more in-depth looks into topics and the projects can represent them put to practice.

When dealing with data sets, all data does not come equal. You have to account for and deal with discrepancies in values and scaling, and therefore relationships throughout the data set. Those 3 words at the top are pretty confusing to distinguish at first. Adding ‘-ization’ to any word seems to automatically turns the word into some kind of action or computation. First I will talk about normalization and standardization since those are somewhat similar. After that, I will discuss regularization. I will then try to put into perspective why these skills are incredibly useful, and in many cases, mandatory.

Let’s say you have a data set that details a professional Soccer team. It’s very basic and it only has a few columns: age, jersey number, and salary per week.


A quick detailing of the code.  I made a dataframe from a dictionary with proper values, and simply called mean() on the whole thing. As an aside, the average of jersey number will never be a useful statistic, but it serves the purpose of this.



As you can see, these numbers are significantly different. The range (max - min) of these numbers is massive. This is where normalization and standardization come into play. Keeping these numbers how they are, you would never be able to properly see the relationship between the columns, because the scale is just too large. 

Before considering normalization and standardization, you have to at least be aware of a normal distribution curve. Put briefly, this curve represents normalized data around the mean = 0 and standard deviation = 1, where that % is the probability the data will fall in the distribution.




Ok, so far so good? Normalization reduces the values you are using into a scaled range of 0-1. The unfortunate side effect of this is that outliers tend to be lost in the translation, otherwise the non-outliers would be scaled together too small.  Standardization scales the data to have a mean equal to 0 and standard deviation of 1.  You want to use standardization when you are concerned with the amount of standard deviations the data points are from the mean.

To normalize the data use the following:


X represents a data value. Again, outliers get lost in the process of this because they throw off the scale too much. So they cannot be used for the x-max or x-min points, or else the values produced will be very close to 0 or very close to 1. Looking at the data set above, the salary column would have to be left out, otherwise it would dominate the scaling process.

To find the z-score, which represents the amount of standard deviations the value is from the mean, use the following formula:

Put plainly, the z-score is the values relationship to the mean of the group.  For each point x, subtract the mean of the data set population and divide by the standard deviation of the population.  The tradeoff here is that to use this formula, you must already have the population information. Typically, you will not find yourself in this position. When this is the case, you can substitute an analogous t-statistic by subtracting the mean of the sample set and dividing by the standard deviation of the sample set.

Both of these techniques rescale the data to be used so that the data values are weighted equally and better represent the data and the relationships between columns in the data.

Regularization refers to techniques that help prevent the overfitting of the model. As we add parameters to our model to better predict data, it becomes more complex, we slowly inch closer to overfitting our model to the specific data we are training our model with. The result of this is reducing the model's ability to predict in a more generalized fashion. The model becomes highly sensitive to small changes in the data set.

Two techniques used are Lasso regression and Ridge regression. Getting into those is for another post, just know that they help you find the point at which more parameters become a bad thing and when you begin reducing your model's predictive ability and its effectiveness to predict from more generalized data.

When first receiving a data set, the first thing to be done is clean up the data. Next, you need to determine whether or not you should normalize, standardize, and/or regularize your data. Doing this is very important if you want to extract meaningful and significant analysis from your data set, otherwise any results you find will be nullified.




Monday, October 24, 2016

Back again to update with another project. This time we are exploring a data set that comes out of Iowa, for liquor sales in the state. It has 18 columns, including date of sales, costs associated, and locations, and most weren't needed. The first thing I had to do was look through the data, and see if it was appropriate types and values.

Surprisingly, most of the columns are appropriate types. A few things needed to be changed, though.


Voila! After tidying up the data, now it's ready to be worked with. A good next step would be to explore the data, and get a rough picture of what we're dealing with. I decided to see which counties were performing the best in sales. This would give us a quick overview of which stores to maybe focus in on.

Well, that doesn't give us much. Looking at the code:


This will return us a list with the highest and lowest 10 for total sales in dollars and the amount of transactions. Using this treemap, we can see that a couple stores represent a large portion of the liquor sales in Iowa for this time period, and many smaller stores fill in the remaining.

This could lead to more questions, such as why do certain stores perform better, neighborhoods, etc. Also need to consider what kind of store it is. Grocery stores that sell liquor will probably sell higher volumes total, because people are already there and just have to add it to their cart. Small gas stations on a remote highway would not fair as well. However, many of these questions are past the scope of this project, so moving on.



Using Linear Regression, I sought out to try and determine if the sales for 2015 would be a good predictor for the sales of 2016. I made a new dataframe with the columns I wanted to be used as parameters for the prediction (average of bottles sold, average of volume sold, average of bottle_volume, total of sales), and another one for the results I desired (sales). I did not remove outliers.


It has a pretty high R^2 score, so chances were good it would predict 2016 well.


So it looks like 2016 will be a better selling year than 2015, and 2015 was a pretty good representation of what we could expect to get in 2016. The difference is about $100k.

Monday, October 17, 2016

Looking over the data for the Billboard top 100 chart for the year 2000, we initially see some concerns. First, is that the genre's seem to be very broad. For example, there are many songs with categories that don't seem to match up for their appropriate sound. One glaring example is Sisqo being labeled 'Rock'.

Secondly, there are an incredible number of NaNs in the data, including all columns after the 66th week. So we can quickly take that away, and deduce that no song from the year 2000 made it for longer than 66 weeks on the Billboard top 100 chart. However, a year is 52 weeks, so it's safe to assume that some songs came into 2000 already on the chart, and possibly remained on for the entire duration of the year.

Looking at a quick graph, we can see how many songs each artist had on the top 100. There are many artists that had more than one song, showing that many artists had much more success than a 'one hit wonder'.

number of tracks (10 = 1 track for some reason)




By far, most artists only had 1 track make it that year, but a few seemed to transcend the trend and produce multiple hits that year. Changing it to get a representation of which genre's are most represented, we can see that:



Rock, Country, and Rap all but dominated the charts that year. Other genres have some slight representation, but by and large, it fell into one of those three. Genre seems to be arbitrarily set, as I could not find any documentation as to how they go about it. For me, some of the songs seem to be classified indirectly, or automatically placed into a category based off something specific, like "this song has a drum and guitar set."

If we look genre, and compare week 1 to week 25, we can see that week 1 has a much more diverse representation of genres. Much smaller categories, like Gospel, Jazz, and Reggae all have at least 1 song. This can probably be attributed to something of "going viral" back then. Where a song finds incredible popularity, but it is very short lived. Once the song fades out of the public eye and popular media, it is all but forgotten about.

As you can see, by the 25th week, only a couple of the genres remain, with Rock taking the lion's share. A lot of the genres cater to a specific sound, such as Country or Latin. Rock is so broadly perceived, so diversely interpreted, that almost anyone can find a song they enjoy in the Rock category.


The graph above shows the songs popularity from week 25 to week 27. The shorter the bar, the higher the ranking on the Billboard 100. Once a song made it this far, it tended to not change too quickly because it had sustaining popularity that is simply fading over time as new songs come into the picture.

The picture below represents the initial week 1 ranking for an artist onto the Billboard 100. The closer to the bottom, the higher the ranking. There is an average line going through it, showing that the average spot starts out at about 79 or 80 on the Billboard 100.


Comparing that with week 27 on the chart:


Much harder to tell, but the average is about 27. This tells us some stuff about it. First, that for a song to stay on the Top 100, it has to be growing in popularity. It isn't enough to simply stagnate and level off at a comfortable spot. People need to be actively enjoying it and new people need to be finding it in order to keep it relevant. Second, once a song reaches high popularity, it drops off the chart much slower than songs that can never make it far past the initial average.

Some other questions I had that I'd like to answer eventually:

  • If a song hit number 1, how long, on average, did it tend to stay there?
  • Would any song stay on the chart at no higher than a certain number (different thresholds, such as the average) for a number of weeks?
  • How often would genres outside the top 3 (Rock, Country, Rap) break into the top 20?
  • How did the Billboard top 100 team include Sisqo in the Rock category?

Monday, October 10, 2016

first test post

This is a test post



describing data


heat maps n bar charts

a'hyuck