Why do people web scrape?
Before we understand why individuals and companies scrape the web let us first understand what is web scraping.
What is Web Scraping?
Many websites around the globe have a tremendous amount of valuable data like product pricing, hotel pricing, financial data, etc. We can use that data for either beating our own competitor or creating a report on market sentiment. If you want to have access to this data you either have to copy and paste the data manually or you can use web scraping services. If you chose to do it manually then the process of extracting data from million pages would be impossible. Here we can take advantage of web scraping.
Extraction of data from a website in a non-clean fashion using a script is known as web scraping. The data collected can then be stored in a database or can be exported in CSV files. For example, you can use web scraping to export a list of product names and prices from any eCommerce website into a CSV file. If the website does not block you while scraping then you can prepare a python or nodejs script to scrape the website and if it does block you then opt for web scraping tools.
Frankly speaking, web scraping can be a challenge if you are a beginner or you are facing a website with top-notch anti-bot detection like Linkedin. Today websites are built in a very different format. Let us understand how we can scrape websites of all kinds.
How do web scrapers work?
Any web scraper will be given a target URL for web scraping. Then the scraper will either make a normal HTTP GET request in order to extract the data or it will render javascript if the website loads with several API calls. You can also check this by visiting the network tab of that website. Once this is done the scraper can return data in JSON or HTML format, CSV, etc.
On the internet, you can find many web scrapers. Now, web scrapers are available in many different forms:
- Browser Extension
- Desktop app
- APIs
- Proxies
Why do we need to scrape data?
- Many financial companies scrape data from the web so that they can buy and sell stocks at the right time. This data provides them with a clear trend of where the next investments can be made.
- Many restaurants scrape reviews so that they can analyze which dish or department is not working well. Timely they can make an important decision and can even improve the service.
- Travel companies scrape pricing from niche websites to keep track of their pricing. To make a competitive edge in the market you need pricing data from your competitor's website.
- Many Enterprise businesses scrape yelp to generate cold leads. They extract names and contact details in a sheet and then contact them to convert them to their paid customers.
- eCommerce websites scrape the web to analyze which data is in demand or how to set the pricing of any particular product.
- Many governments also scrape data before elections to analyze the mood of the nation. Obviously, they outsource this job. This helps them to pick topics for rallies.
Is Price Scraping even legal?
Well, the correct answer is yes but up to a certain extent. You can legally scrape publically available pages. Legal scraping can be:
- If the page is not behind an authentication wall.
- Does not include any private information of a user.
- Follow the rules of the robots.txt file.
- Do not overload the host server with unnecessary calls.
Recently Linkedin filed a case against a Singapore-based company Mantheos. This company was illegally selling LinkedIn member data to other companies. They were also using the data for sentiment analysis. This is the perfect example of illegal scraping. You cannot go on scraping somebody’s private information and then sell it.
There are many cases like this in the past where the defendant also won. Like eBay vs Bidder’s Edge case, BE Inc was a price comparison website where they were crawling product prices from eBay(an online auction company) on a regular basis. Later BE filed an appeal that if all the websites stopped scraping then the internet will cease to exist. This was an interesting case for scraping industry.