Back again with another project. This time, I was to use the imdb top 250 movies list to determine some good features for what makes a movie popular. A lot of the info was readily available on the imdb site and easy to scrape, but some I had to look for specifically. The end goal is to help better predict what movies would be popular, and to recommend to watch next, such as on Netflix.
Getting the data
First thing I had to go do was scrape the data off the website. There is an api for this, but it kept returning an empty list that was throwing everything off, so I used BeautifulSoup for the job.
After getting all the data, I got rid of some bad stuff and add it to the dataframe.
Ok everything looks good on for the data I scraped. Next we will want to add a CountVectorizer. What this does is convert some text into a binary representation of word(s). If that row contains that word, it will have a 1 in the column and 0 if not.
This will allow us to see if any particular words stand out, such as directors, actors, language, etc.
Visualize
It's good to plot some things out to see if there are any visualzations you can see that tell you something about the relationship of the data. This is a swarmplot that plots the Rating along with the total number of imdb Votes. As you can see, a significant portion of the movies are rated R.
We won't use this information any further (regarding rating vs votes) but it could be a good tangent to go on to see if something is there. Not Rated and NOT RATED were separated because of caps, but since I wasn't using it further, I did not bother to change it. Plus it would be largely insignificant.
This swarmplot shows votes as a product of metascore. We can see there is no rhyme nor reason to this graph, which still is helpful because we can see there's no correlation with metascore. Even the highest or lowest rated received similar vote numbers.
Evaluation
Ok, we have all our data and we've cleaned it and taken a few looks at it, now it's time to do the evaluations. First thing we have to do is normalize and scale the data properly.
Because imdbVotes has numbers ranging very high, along with gross, we need everything to be on the same scale. The next thing on the list is to train-test split the data in order to fit and run models on it. Going through decision tree, bagged decision tree, AdaBoost, random forest and gradient boost, it turns out gradient boost was the highest.
Since gbr was the highest, we will use that for our predictions. Next thing we want to do is find the k-best number of features to do the predictions with. Without flooding the page, we can see below:
The best number of features is 52. We will use this to find our final predictions of what the 5 most useful features are.
So we initialize a new train-test split with those number of features.
With the plot we can see that the predicted tends to run a little high compared to the actual, but overall it is pretty close. Our score increased roughly 8 percentage points, which is a nice jump.
Viewing it another way:
Hard to see the detail on here, but this is just a visual representation of the decisions our model was making, and where it would split the data based off certain conditions. Really gives a lot of insight into how the model is functioning on the inside.
Results
In the end, we were able to determine the most useful features. imdbVotes is the clear winner, followed with decent numbers by the next 3, and from the 5th on, it falls off very quickly.