Scrape Indeed using Python
Indeed is one of the biggest job listing platforms available in the market. They claim around 300M visitors on their website on monthly basis. As a data engineer, you want to identify which job is in great demand. Well, then you have to collect data from websites like indeed to identify and make a conclusion.
In this article, we are going to scrape Indeed using Python 3.x. We are going to scrape python jobs from indeed in new york. At the end of this tutorial, we will have all the jobs that need python as a skill in new york.
Why Scrape Indeed?
Scraping Indeed can help you in multiple ways. Some of them are:
- With this much data, you can train an AI model to predict salaries in the future for any given skill.
- Companies can use this data to analyze what salaries their rival companies are offering for a particular skill set. This will help them improve their recruitment strategy.
- You can also analyze what jobs are in high demand and what kind of skill set one needs to qualify for jobs in the future.
Setting up the prerequisites
We would need Python 3.x for this project and our target page will be this one from Indeed.
I am assuming that you have already installed python on your machine. So, let’s move forward with the rest of the installation.
We would need two libraries that will help us extract data. We will install them with the help of pip.
Requests
— Using this library we are going to make a GET request to the target URL.BeautifulSoup
— Using this library we are going to parse HTML and extract all the crucial data that we need from the page. It is also known as BS4.
Installation
pip install requests
pip install beautifulsoup4
You can create a dedicated folder for indeed on your machine and then create a python file where we will write the code.
Let’s decide what we are going to scrape
Whenever you start a scraping project, it is always better to decide in advance what exactly we need to extract from the target page.
We are going to scrape all the highlighted parts in the above image.
- Name of the job.
- Name of the company
- Their ratings.
- The salary they are offering
- Job details.
Let’s scrape Indeed
Before even writing the first line of code, let’s find the exact element location in the DOM.
Every job box is a list tag
. You can see this in the above image. And there are 18 of them on each page and all of them fall under the div tag with class jobsearch-ResultsList
. So, our first job would be to find this div tag.
Let’s first import all the libraries in the file.
import requests
from bs4 import BeautifulSoup
Now, let’s declare the target URL and make an HTTP connection to that website.
l=[]
o={}
target_url = "https://www.indeed.com/jobs?q=python&l=New+York%2C+NY&vjk=8bf2e735050604df"
head= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Connection": "keep-alive",
"Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}
resp = requests.get(target_url, headers=head)
We have declared an empty list and an empty object to store data at the end.
Sometimes(the majority of the time) you might get a 403 status
code. To avoid getting blocked you will need a web scraping API.
Now, let’s find the ul tag
using BS4.
soup = BeautifulSoup(resp.text, 'html.parser')
allData = soup.find("ul",{"class":"jobsearch-ResultsList css-0"})
Now, we have to iterate over each of these li tags
and extract all the data one by one using a for loop
.
alllitags = allData.find_all("div",{"class":"cardOutline"})
Now, we will run a for loop
on this list alllitags
.
As you can see in the image above that the name of the job is under the a tag
. So, we will find this a tag
and then extract the text out of it using .text()
method of BS4.
for i in range(0,len(alllitags)):
try:
o["name-of-the-job"]=alllitags[i].find("a",{"class":"jcs-JobTitle css-jspxzf eu4oa1w0"}).text
except:
o["name-of-the-job"]=None
Let’s find the name of the company with the same method.
Name of the company can be found under the div tag with class heading6 company_location tapItem-gutter companyInfo
. Let’s extract this too.
try:
o["name-of-the-company"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"companyName"}).text
except:
o["name-of-the-company"]=None
Here we have first found the div tag
and then we have used the .find()
method to find the span tag
inside it. You can check the image above for more clarity.
Let’s extract the rating now.
The rating can be found under the same div tag
as the name of the company. Just the class of the span tag will change. The new class will be ratingsDisplay
try:
o["rating"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"ratingsDisplay"}).text
except:
o["rating"]=None
The salary offer can be found under the div tag with class metadata salary-snippet-container
.
try:
o["salary"]=alllitags[i].find("div",{"class":"salary-snippet-container"}).text
except:
o["salary"]=None
The last thing which we have to extract are the job details.
This is a list that can be found under the div tag
with class metadata taxoAttributes-container
.
try:
o["job-details"]=alllitags[i].find("div",{"class":"metadata taxoAttributes-container"}).find("ul").text
except:
o["job-details"]=None
l.append(o)
o={}
In the end, we have pushed our object o
inside the list l
and made the object o
empty so that when the loop runs again it will be able to store data of the new job.
Let’s print it and see what are the results.
print(l)
Complete Code
You can make further changes to extract other details as well. You can even change the URL of the page to scrape jobs from the next pages.
But for now, the complete code will look like this.
import requests
from bs4 import BeautifulSoup
l=[]
o={}
target_url = "https://www.indeed.com/jobs?q=python&l=New+York%2C+NY&vjk=8bf2e735050604df"
head= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Connection": "keep-alive",
"Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}
resp = requests.get(target_url, headers=head)
print(resp.status_code)
soup = BeautifulSoup(resp.text, 'html.parser')
allData = soup.find("ul",{"class":"jobsearch-ResultsList css-0"})
alllitags = allData.find_all("div",{"class":"cardOutline"})
print(len(alllitags))
for i in range(0,len(alllitags)):
try:
o["name-of-the-job"]=alllitags[i].find("a",{"class":"jcs-JobTitle css-jspxzf eu4oa1w0"}).text
except:
o["name-of-the-job"]=None
try:
o["name-of-the-company"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"companyName"}).text
except:
o["name-of-the-company"]=None
try:
o["rating"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"ratingsDisplay"}).text
except:
o["rating"]=None
try:
o["salary"]=alllitags[i].find("div",{"class":"salary-snippet-container"}).text
except:
o["salary"]=None
try:
o["job-details"]=alllitags[i].find("div",{"class":"metadata taxoAttributes-container"}).find("ul").text
except:
o["job-details"]=None
l.append(o)
o={}
print(l)
Using Scrapingdog for scraping Indeed
You have to sign up for the free account to start using it. It will take just 10 seconds to get you started with Scrapingdog.
Once you sign up, you will be redirected to your dashboard. The dashboard will look somewhat like this.
You have to use your own API key.
Now, you can paste your target indeed target page link to the left and then select JS Rendering as No. After this click on Copy Code from the right. Now use this API in your script to scrape Indeed.
You will notice the code will remain somewhat the same as above. We just have to change one thing and that is our target URL.
import requests
from bs4 import BeautifulSoup
l=[]
o={}
target_url = "https://api.scrapingdog.com/scrape?api_key=xxxxxxxxxxxxxxxx&url=https://www.indeed.com/jobs?q=python&l=New+York%2C+NY&vjk=8bf2e735050604df&dynamic=false"
resp = requests.get(target_url)
print(resp.status_code)
soup = BeautifulSoup(resp.text, 'html.parser')
allData = soup.find("ul",{"class":"jobsearch-ResultsList css-0"})
alllitags = allData.find_all("div",{"class":"cardOutline"})
print(len(alllitags))
for i in range(0,len(alllitags)):
try:
o["name-of-the-job"]=alllitags[i].find("a",{"class":"jcs-JobTitle css-jspxzf eu4oa1w0"}).text
except:
o["name-of-the-job"]=None
try:
o["name-of-the-company"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"companyName"}).text
except:
o["name-of-the-company"]=None
try:
o["rating"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"ratingsDisplay"}).text
except:
o["rating"]=None
try:
o["salary"]=alllitags[i].find("div",{"class":"salary-snippet-container"}).text
except:
o["salary"]=None
try:
o["job-details"]=alllitags[i].find("div",{"class":"metadata taxoAttributes-container"}).find("ul").text
except:
o["job-details"]=None
l.append(o)
o={}
print(l)
As you can see we have replaced the target URL of Indeed with the API URL of Scrapingdog. You have to use your own API Key in order to successfully run this script.
With this script, you will be able to scrape indeed with a lightning-fast speed that too without getting blocked.
Conclusion
In this tutorial, we were able to scrape Indeed job postings with Requests and BS4. Of course, you can modify the code a little to extract other details as well.
You can change the page URL to scrape jobs from the next page. Obviously, you have to find the change that happens to the URL once you change the page by clicking the number from the bottom of the page. For scraping millions of such postings you can always use Scrapingdog😜.
I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.
Additional Resources
Here are a few additional resources that you may find helpful during your web scraping journey: