Project 4
For our project this week, we were tasked with using BeautifulSoup to scrape a website, and then within that data we scraped, do some kind of prediction with logistic regression.
The project called for us to web scrape Indeed.com for data science jobs. Because so few of the jobs actually list the salary data, we were to predict the salaries using the ones that did provide it.
This is what we needed to search for on Indeed. All the job listings start out in a <div> tag with class of "row results". Finding this would be our starting point. After we scraped all the relevant data, we made functions to properly extract specific parts, such as job title, location, salary (if present), and company.
So far so good. Above is an example of one of the extraction functions, they all were more or less the same. After we extracted our pertinent data, the next move is to place it into a pandas DataFrame.
Great. We have our data and the table we need, but it clearly needs a lot of cleaning up. The salaries all look different, and there's a lot of junk that needs to be removed in multiple columns.
First thing is to drop the NaN's (not a number) and remove the duplicates.
For salary, all the nonsense was removed, and the numbers had to be uniformly presented. After, some of the numbers had to be summed and averaged because there was a range of possible salaries. We need cold, hard numbers.
For the job and company columns, there was a lot of '\n' which are newline characters. Plus it was in unicode, which made it harder to deal with, and particularly send to csv.
Looks much better now. No newlines, and salaries are uniformly presented.
After all the clean up, we had to start prepping to do some regression analysis on the data.
This is so we can know where the individual salaries fall relatively in the column of salaries |
Finding some common break points is a good idea. This will split it by quartile and the median, all useful pieces of information. So we can know what is in certain ranges, as in top 25%, middle 50%(split into 25% each), and bottom 25%.
Now we are ready to do analysis. Doing a basic logistic regression with statsmodels, we get the following:
Lots of numbers here, but the important one is the Psuedo R-squared value. Using only the location, it is 0.4243. While this number is not super useful by itself, its a mark we can use to determine if extra features increase or decrease the Pseudo R-squared value. That value is essentially a 'goodness of fit' measure for the model.
We are checking how Pseudo R-squared changed to the previous model |
Adding in some features, using key words that were used often in the job titles, it added a slight amount of 'goodness of fit' to the model.
Rebuilding the model, but this time with sci-kit learn (another Python library useful for regression analysis)
First we set up our extra features and target:
Then we perform cross validation in order to determine the best C (regularization fit) and penalty type:
After we get the best parameters with best_params_, we use those values going forward |
Here we see that our best C value is 0.75, and penalty is L1. What this means, is that the C value, 0.75, is the best number to use for regularization of the model, in order to prevent overfitting for our model. L1 is the type used, and it is called Lasso Regression.
Running a classification report, followed by a confusion matrix. A classification report gives us useful information with precision, recall, and f1-score. Respectively: how many selected items are relevant (is it grabbing a lot of stuff you don't need?), how many relevant items are selected (is it grabbing the right stuff?), and the measure of a test's accuracy.
The confusion matrix shows the predicted values in columns compared to the actual values in the rows. So you can see how the model predicted for each bin.
After running the in between code, we can see the Classification report for L1 versus L2 respectively:
L1 (Lasso) Regression
L2 (Ridge) Regression
What is interesting here is our L2 model performed better than our L1 model. Unsure why that would be!
In this specific case Ridge (L2) regression performed better than Lasso (L1) regression. L1 brought many of the coefficients down to 0, which essentially labels them as useless.
Most importantly, our model was able to decently predict the salaries of the jobs that didn't provide any salary information, like the company requested. Here are the top 10:
Top predicted job is a Cognitive Software Developer, which sounds cool, at a whopping 200k.
No comments:
Post a Comment