Web Scraping amazon with Python

Scrapingdog
13 min readMar 22, 2023

--

Price monitoring has become quite crucial when it comes to the e-commerce business. Price monitoring can help you save costs and increase profitability. You can gain a competitive advantage in the market once you have access to the pricing your competitor has set.

Once you have access to this data you can then identify the perfect price point to increase your sales. This technique is also effective if you are in the travel business. You can reduce your prices to gain more advantage. In these types of businesses, pricing strategy is too important.

In this article, we are going to scrape amazon with Python to keep a track of price changes in a particular item.

Setting up the prerequisites

I am assuming that you have already installed python 3.x on your machine. If not then you can download it from here. Apart from this, we will require two III-party libraries of python.

  • Requests- Using this library we will make an HTTP connection with the amazon page. This library will help us to extract the raw HTML from the target page.
  • BeautifulSoup- This is a powerful data parsing library. Using this we will extract necessary data out of the raw HTML we get using the requests library.

Before we install these libraries we will have to create a dedicated folder for our project.

mkdir amazonscraper

Now, we will have to install the above two libraries in this folder. Here is how you can do it.

pip install beautifulsoup4
pip install requests

Now, you can create a python file by any name you wish. This will be the main file where we will keep our code. I am naming it amazon.py.

Downloading raw data from amazon.com

Let’s make a normal GET request to our target page and see what happens. For GET request we are going to use the requests library.

import requests
from bs4 import BeautifulSoup

target_url="https://www.amazon.com/dp/B0BSHF7WHW"

resp = requests.get(target_url)

print(resp.text)

Once you run this code, you might see this.

This is a captcha from amazon.com and this happens once their architecture observes that the incoming request is from a bot/script and not from a real human being.

To bypass this on-site protection of amazon we can send some headers like User-Agent. You can even check what headers are sent to amazon.com once you open the URL in your browser. You can check them from the network tab.

Once you pass this header to the request, your request will act like a request coming from a real browser. This can melt down the anti-bot wall of amazon.com. Let’s pass a few headers to our request.

import requests
from bs4 import BeautifulSoup

target_url="https://www.amazon.com/dp/B0BSHF7WHW"

headers={"accept-language": "en-US,en;q=0.9","accept-encoding": "gzip, deflate, br","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36","accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"}

resp = requests.get(target_url, headers=headers)

print(resp.text)

Once you run this code you might be able to bypass the anti-scraping protection wall of amazon.

Now let’s decide what exact information we want to scrape from the page.

What are we going to scrape?

It is always great to decide in advance what are you going to extract from the target page. This way we can analyze in advance which element is placed where inside the DOM.

We are going to scrape five data elements from the page.

  • Name of the product
  • Images
  • Price (Most important😜)
  • Rating
  • Specs

First, we are going to make the GET request to the target page using the requests library and then using BS4 we are going to parse out this data. Of course, there are multiple other libraries like lxml that can be used in place of BS4 but BS4 has the most powerful and easy-to-use API.

Before making the request we are going to analyze the page and find the location of each element inside the DOM. One should always do this exercise in order to identify the location of each element.

We are going to do this by simply using the developer tool. This can be accessed by simply right-clicking on the target element and then clicking on the inspect. This is the most common method, you might already know this.

Identify the location of each element

Location of the title tag.

Once you inspect the title you will find that the title text is located inside the h1 tag with the id title.

Coming back to our amazon.py file, we will write the code to extract this information from amazon.

import requests
from bs4 import BeautifulSoup

l=[]
o={}


url="https://www.amazon.com/dp/B0BSHF7WHW"

headers={"accept-language": "en-US,en;q=0.9","accept-encoding": "gzip, deflate, br","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36","accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"}

resp = requests.get(url, headers=headers)
print(resp.status_code)

soup=BeautifulSoup(resp.text,'html.parser')


try:
o["title"]=soup.find('h1',{'id':'title'}).text.strip()
except:
o["title"]=None





print(o)

Here the line soup=BeautifulSoup(resp.text,’html.parser’) is using the BeautifulSoup library to create a BeautifulSoup object from an HTTP response text, with the specified HTML parser.

Then using soup.find() method will return the first occurrence of the tag h1 with id title. We are using .text method to get the text from that element. Then finally I used .strip() method to remove all the whitespaces from the text we receive.

Once you run this code you will get this.

[{'title': 'Apple 2023 MacBook Pro Laptop M2 Pro chip with 12‑core CPU and 19‑core GPU: 16.2-inch Liquid Retina XDR Display, 16GB Unified Memory, 1TB SSD Storage. Works with iPhone/iPad; Space Gray'}]

If you have not read the above section where we talked about downloading HTML data from the target page then you won’t be able to understand the above code. So, please read the above section before moving ahead.

Location of the image tag.

This might be the most tricky part of this complete tutorial. Let’s inspect and find out why it is a little tricky.

As you can see the img tag in which the image is hidden is stored inside div tag with class imgTagWrapper.

allimages = soup.find_all("div",{"class":"imgTagWrapper"})
print(len(allimages))

Once you print this it will return 3. Now, there are 6 images and we are getting just 3. The reason behind this is JS rendering. Amazon loads its images through an AJAX request at the backend. That’s why we never receive these images when we make an HTTP connection to the page through requests library.

Finding high-resolution images is not as simple as finding the title tag. But I will explain to you step by step how you can find all the images of the product🔥.

  1. Copy any product image URL from the page.
  2. Then click on the view page source to open the source page of the target webpage.
  3. Then search for this image.

You will find that all the images are stored as a value for hiRes key.

All this information is stored inside a script tag. Now, here we will use regular expressions in order to find this pattern of “hiRes”:”image_url”

We can still use BS4 but it will make the process a little lengthy and it might slow down our scraper. For now, we will use (.+?) non-greedy matches for one or more characters. Let me explain what each character in this expression means.

  • The . matches any character except a newline
  • The + matches one or more occurrences of the preceding character.
  • The ? makes the match non-greedy, meaning that it will match the minimum number of characters needed to satisfy the pattern.

Basically, the regular expression will return all the matched sequences of characters from the HTML string we are going to pass.

images = re.findall('"hiRes":"(.+?)"', resp.text)
o["images"]=images

This will return all the high-resolution images of the product in a list. In general, it is not advised to use regular expression in data parsing but it can do wonders sometimes.

Parsing the price tag

There are two price tags on the page, but we will only extract the one which is just below the rating.

We can see that the price tag is stored inside span tag with class a-price. Once you find this tag you can find the first child span tag to get the price. Here is how you can do it.

try:
o["price"]=soup.find("span",{"class":"a-price"}).find("span").text
except:
o["price"]=None

Once you print object o, you will get to see the price.

{'price': '$2,499.00'}

Extract rating

You can find the rating in the first i tag with class a-icon-star. Let’s see how to scrape this too.

try:
o["rating"]=soup.find("i",{"class":"a-icon-star"}).text
except:
o["rating"]=None

It will return this.

{'rating': '4.1 out of 5 stars'}

In the same manner, we can scrape the specs of the device.

Extract specs of the device

These specs are stored inside these tr tags with class a-spacing-small. Once you find these you have to find both the span under it to find the text. You can see this in the above image. Here is how it can be done.

specs_arr=[]
specs_obj={}

specs = soup.find_all("tr",{"class":"a-spacing-small"})

for u in range(0,len(specs)):
spanTags = specs[u].find_all("span")
specs_obj[spanTags[0].text]=spanTags[1].text


specs_arr.append(specs_obj)
o["specs"]=specs_arr

Using .find_all() we are finding all the tr tags with class a-spacing-small. Then we are running a for loop to iterate over all the tr tags. Then under for loop we find all the span tags. Then finally we are extracting the text from each span tag.

Once you print the object o it will look like this.

Throughout the tutorial, we have used try/except statements to avoid any run time error. We have not managed to scrape all the data we decided to scrape at the beginning of the tutorial.

Complete Code

You can of course make a few changes to the code to extract more data because the page is filled with large information. You can even use cron jobs to mail yourself an alert when the price drops. Or you can integrate this technique into your app, this feature can mail your users when the price of any item on amazon drops.

But for now, the code will look like this.

import requests
from bs4 import BeautifulSoup
import re

l=[]
o={}
specs_arr=[]
specs_obj={}

target_url="https://www.amazon.com/dp/B0BSHF7WHW"

headers={"accept-language": "en-US,en;q=0.9","accept-encoding": "gzip, deflate, br","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36","accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"}

resp = requests.get(target_url, headers=headers)
print(resp.status_code)
if(resp.status_code != 200):
print(resp)
soup=BeautifulSoup(resp.text,'html.parser')


try:
o["title"]=soup.find('h1',{'id':'title'}).text.lstrip().rstrip()
except:
o["title"]=None


images = re.findall('"hiRes":"(.+?)"', resp.text)
o["images"]=images

try:
o["price"]=soup.find("span",{"class":"a-price"}).find("span").text
except:
o["price"]=None

try:
o["rating"]=soup.find("i",{"class":"a-icon-star"}).text
except:
o["rating"]=None


specs = soup.find_all("tr",{"class":"a-spacing-small"})

for u in range(0,len(specs)):
spanTags = specs[u].find_all("span")
specs_obj[spanTags[0].text]=spanTags[1].text


specs_arr.append(specs_obj)
o["specs"]=specs_arr
l.append(o)


print(l)

Changing Headers on every request

With the above code, your scraping journey will come to a halt, once amazon recognizes a pattern in the request. To avoid this you can keep changing your headers in order to keep the scraper running. You can rotate a bunch of headers to overcome this challenge. Here is how it can be done.

import requests
from bs4 import BeautifulSoup
import re
import random

l=[]
o={}
specs_arr=[]
specs_obj={}

useragents=['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4894.117 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4855.118 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4892.86 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4854.191 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4859.153 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36/null',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36,gzip(gfe)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4895.86 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_3_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_13) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4860.89 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4885.173 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4864.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_12) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4877.207 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_2_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML%2C like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.133 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4872.118 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_3_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_13) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4876.128 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML%2C like Gecko) Chrome/100.0.4896.127 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36']

target_url="https://www.amazon.com/dp/B0BSHF7WHW"

headers={"User-Agent":useragents[random.randint(0,31)],"accept-language": "en-US,en;q=0.9","accept-encoding": "gzip, deflate, br","accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"}

resp = requests.get(target_url,headers=headers)
print(resp.status_code)
if(resp.status_code != 200):
print(resp)
soup=BeautifulSoup(resp.text,'html.parser')


try:
o["title"]=soup.find('h1',{'id':'title'}).text.lstrip().rstrip()
except:
o["title"]=None


images = re.findall('"hiRes":"(.+?)"', resp.text)
o["images"]=images

try:
o["price"]=soup.find("span",{"class":"a-price"}).find("span").text
except:
o["price"]=None

try:
o["rating"]=soup.find("i",{"class":"a-icon-star"}).text
except:
o["rating"]=None


specs = soup.find_all("tr",{"class":"a-spacing-small"})

for u in range(0,len(specs)):
spanTags = specs[u].find_all("span")
specs_obj[spanTags[0].text]=spanTags[1].text


specs_arr.append(specs_obj)
o["specs"]=specs_arr
l.append(o)


print(l)

We are using random library here to generate random numbers between 0 and 31(31 is the length of the useragents list). These user agents are all latest so you can easily bypass the anti-scraping wall.

But again this technique is not enough to scrape amazon at scale. What if you want to scrape millions of such pages? Then this technique is super inefficient because your IP will be blocked. So, for mass scraping one has to use a web scraping proxy API to avoid getting blocked while scraping.

Using Scrapingdog for scraping Amazon

Advantage os using Scrapingdog are:

  • You won’t have to manage headers anymore.
  • Every request will go through a new IP. This keeps your IP anonymous.
  • Our API will automatically retry on its own if the first hit fails.
  • Scrapingdog uses residentiral proxies to scrape amazon. This increases the success rate of scraping amazon or any other such website.

You have to sign up for the free account to start using it. It will take just 10 seconds to get you started with Scrapingdog.

Once you sign up, you will be redirected to your dashboard. The dashboard will look somewhat like this.

You have to use your own API key.

Now, you can paste your target indeed target page link to the left and then select JS Rendering as No. After this click on Copy Code from the right. Now use this API in your script to scrape Amazon.

import requests
from bs4 import BeautifulSoup
import re


l=[]
o={}
specs_arr=[]
specs_obj={}



target_url="https://api.scrapingdog.com/scrape?api_key=xxxxxxxxxxxxxxxxxxxx&url=https://www.amazon.com/dp/B0BSHF7WHW&dynamic=false"



resp = requests.get(target_url)
print(resp.status_code)
if(resp.status_code != 200):
print(resp)
soup=BeautifulSoup(resp.text,'html.parser')


try:
o["title"]=soup.find('h1',{'id':'title'}).text.lstrip().rstrip()
except:
o["title"]=None


images = re.findall('"hiRes":"(.+?)"', resp.text)
o["images"]=images

try:
o["price"]=soup.find("span",{"class":"a-price"}).find("span").text
except:
o["price"]=None

try:
o["rating"]=soup.find("i",{"class":"a-icon-star"}).text
except:
o["rating"]=None


specs = soup.find_all("tr",{"class":"a-spacing-small"})

for u in range(0,len(specs)):
spanTags = specs[u].find_all("span")
specs_obj[spanTags[0].text]=spanTags[1].text


specs_arr.append(specs_obj)
o["specs"]=specs_arr
l.append(o)


print(l)

You will notice the code will remain somewhat the same as above. We just have to change one thing and that is our target URL. I am not even passing headers anymore. Isn’t that hassle free scraping?

With this script, you will be able to scrape amazon with a lightning-fast speed that too without getting blocked.

Conclusion

In this tutorial we scraped various data elements from amazon. First we used requests library to download the raw HTML and then using BS4 we parsed the data we wanted. You can also use lxml in place of BS4 to extract data. Python and its libraries makes scraping very simple for even a beginner. Once you scale you can switch to web scraping APIs to scrape millions of such pages.

Combination of requests and Scrapingdog can help you scale your scraper🔥. You will get more than 99% success rate while scraping amazon with Scrapingdog.

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media❤️.

Additional Resources

Here are a few additional resources that you may find helpful during your web scraping journey:

--

--

Scrapingdog

I usually talk about web scraping and yes web scraping only. You can find a web scraping API at www.scrapingdog.com