Scrape Indeed using Python

Scrapingdog
7 min readFeb 11, 2023

--

Indeed is one of the biggest job listing platforms available in the market. They claim around 300M visitors on their website on monthly basis. As a data engineer, you want to identify which job is in great demand. Well, then you have to collect data from websites like indeed to identify and make a conclusion.

In this article, we are going to scrape Indeed using Python 3.x. We are going to scrape python jobs from indeed in new york. At the end of this tutorial, we will have all the jobs that need python as a skill in new york.

Why Scrape Indeed?

Scraping Indeed can help you in multiple ways. Some of them are:

  • With this much data, you can train an AI model to predict salaries in the future for any given skill.
  • Companies can use this data to analyze what salaries their rival companies are offering for a particular skill set. This will help them improve their recruitment strategy.
  • You can also analyze what jobs are in high demand and what kind of skill set one needs to qualify for jobs in the future.

Setting up the prerequisites

We would need Python 3.x for this project and our target page will be this one from Indeed.

I am assuming that you have already installed python on your machine. So, let’s move forward with the rest of the installation.

We would need two libraries that will help us extract data. We will install them with the help of pip.

  1. Requests — Using this library we are going to make a GET request to the target URL.
  2. BeautifulSoup — Using this library we are going to parse HTML and extract all the crucial data that we need from the page. It is also known as BS4.

Installation

pip install requests 
pip install beautifulsoup4

You can create a dedicated folder for indeed on your machine and then create a python file where we will write the code.

Let’s decide what we are going to scrape

Whenever you start a scraping project, it is always better to decide in advance what exactly we need to extract from the target page.

We are going to scrape all the highlighted parts in the above image.

  • Name of the job.
  • Name of the company
  • Their ratings.
  • The salary they are offering
  • Job details.

Let’s scrape Indeed

Before even writing the first line of code, let’s find the exact element location in the DOM.

Every job box is a list tag. You can see this in the above image. And there are 18 of them on each page and all of them fall under the div tag with class jobsearch-ResultsList. So, our first job would be to find this div tag.

Let’s first import all the libraries in the file.

import requests
from bs4 import BeautifulSoup

Now, let’s declare the target URL and make an HTTP connection to that website.

l=[]
o={}
target_url = "https://www.indeed.com/jobs?q=python&l=New+York%2C+NY&vjk=8bf2e735050604df"
head= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Connection": "keep-alive",
"Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}
resp = requests.get(target_url, headers=head)

We have declared an empty list and an empty object to store data at the end.

Sometimes(the majority of the time) you might get a 403 status code. To avoid getting blocked you will need a web scraping API.

Now, let’s find the ul tag using BS4.


soup = BeautifulSoup(resp.text, 'html.parser')

allData = soup.find("ul",{"class":"jobsearch-ResultsList css-0"})

Now, we have to iterate over each of these li tags and extract all the data one by one using a for loop.

alllitags = allData.find_all("div",{"class":"cardOutline"})

Now, we will run a for loop on this list alllitags.

As you can see in the image above that the name of the job is under the a tag. So, we will find this a tag and then extract the text out of it using .text() method of BS4.

for i in range(0,len(alllitags)):
try:
o["name-of-the-job"]=alllitags[i].find("a",{"class":"jcs-JobTitle css-jspxzf eu4oa1w0"}).text
except:
o["name-of-the-job"]=None

Let’s find the name of the company with the same method.

Name of the company can be found under the div tag with class heading6 company_location tapItem-gutter companyInfo. Let’s extract this too.

try:
o["name-of-the-company"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"companyName"}).text
except:
o["name-of-the-company"]=None

Here we have first found the div tag and then we have used the .find() method to find the span tag inside it. You can check the image above for more clarity.

Let’s extract the rating now.

The rating can be found under the same div tag as the name of the company. Just the class of the span tag will change. The new class will be ratingsDisplay

try:
o["rating"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"ratingsDisplay"}).text
except:
o["rating"]=None

The salary offer can be found under the div tag with class metadata salary-snippet-container.

try:
o["salary"]=alllitags[i].find("div",{"class":"salary-snippet-container"}).text
except:
o["salary"]=None

The last thing which we have to extract are the job details.

This is a list that can be found under the div tag with class metadata taxoAttributes-container.

try:
o["job-details"]=alllitags[i].find("div",{"class":"metadata taxoAttributes-container"}).find("ul").text
except:
o["job-details"]=None


l.append(o)
o={}

In the end, we have pushed our object o inside the list l and made the object o empty so that when the loop runs again it will be able to store data of the new job.

Let’s print it and see what are the results.

print(l)

Complete Code

You can make further changes to extract other details as well. You can even change the URL of the page to scrape jobs from the next pages.

But for now, the complete code will look like this.

import requests
from bs4 import BeautifulSoup

l=[]
o={}


target_url = "https://www.indeed.com/jobs?q=python&l=New+York%2C+NY&vjk=8bf2e735050604df"
head= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Connection": "keep-alive",
"Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}

resp = requests.get(target_url, headers=head)
print(resp.status_code)
soup = BeautifulSoup(resp.text, 'html.parser')

allData = soup.find("ul",{"class":"jobsearch-ResultsList css-0"})

alllitags = allData.find_all("div",{"class":"cardOutline"})
print(len(alllitags))
for i in range(0,len(alllitags)):
try:
o["name-of-the-job"]=alllitags[i].find("a",{"class":"jcs-JobTitle css-jspxzf eu4oa1w0"}).text
except:
o["name-of-the-job"]=None

try:
o["name-of-the-company"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"companyName"}).text
except:
o["name-of-the-company"]=None


try:
o["rating"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"ratingsDisplay"}).text
except:
o["rating"]=None

try:
o["salary"]=alllitags[i].find("div",{"class":"salary-snippet-container"}).text
except:
o["salary"]=None

try:
o["job-details"]=alllitags[i].find("div",{"class":"metadata taxoAttributes-container"}).find("ul").text
except:
o["job-details"]=None

l.append(o)
o={}


print(l)

Using Scrapingdog for scraping Indeed

You have to sign up for the free account to start using it. It will take just 10 seconds to get you started with Scrapingdog.

Once you sign up, you will be redirected to your dashboard. The dashboard will look somewhat like this.

You have to use your own API key.

Now, you can paste your target indeed target page link to the left and then select JS Rendering as No. After this click on Copy Code from the right. Now use this API in your script to scrape Indeed.

You will notice the code will remain somewhat the same as above. We just have to change one thing and that is our target URL.

import requests
from bs4 import BeautifulSoup

l=[]
o={}


target_url = "https://api.scrapingdog.com/scrape?api_key=xxxxxxxxxxxxxxxx&url=https://www.indeed.com/jobs?q=python&l=New+York%2C+NY&vjk=8bf2e735050604df&dynamic=false"


resp = requests.get(target_url)
print(resp.status_code)
soup = BeautifulSoup(resp.text, 'html.parser')

allData = soup.find("ul",{"class":"jobsearch-ResultsList css-0"})

alllitags = allData.find_all("div",{"class":"cardOutline"})
print(len(alllitags))
for i in range(0,len(alllitags)):
try:
o["name-of-the-job"]=alllitags[i].find("a",{"class":"jcs-JobTitle css-jspxzf eu4oa1w0"}).text
except:
o["name-of-the-job"]=None

try:
o["name-of-the-company"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"companyName"}).text
except:
o["name-of-the-company"]=None


try:
o["rating"]=alllitags[i].find("div",{"class":"companyInfo"}).find("span",{"class":"ratingsDisplay"}).text
except:
o["rating"]=None

try:
o["salary"]=alllitags[i].find("div",{"class":"salary-snippet-container"}).text
except:
o["salary"]=None

try:
o["job-details"]=alllitags[i].find("div",{"class":"metadata taxoAttributes-container"}).find("ul").text
except:
o["job-details"]=None

l.append(o)
o={}


print(l)

As you can see we have replaced the target URL of Indeed with the API URL of Scrapingdog. You have to use your own API Key in order to successfully run this script.

With this script, you will be able to scrape indeed with a lightning-fast speed that too without getting blocked.

Conclusion

In this tutorial, we were able to scrape Indeed job postings with Requests and BS4. Of course, you can modify the code a little to extract other details as well.

You can change the page URL to scrape jobs from the next page. Obviously, you have to find the change that happens to the URL once you change the page by clicking the number from the bottom of the page. For scraping millions of such postings you can always use Scrapingdog😜.

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

--

--

Scrapingdog

I usually talk about web scraping and yes web scraping only. You can find a web scraping API at www.scrapingdog.com