Web Scraping sports data
Web Scraping data is required for performance analytics. You can find this data on websites like NBA, FIFA, NFL, Yahoo Sports, etc. The data can also be used for creating your own sports app. Using web scraping you can show near-to-real-time data on your app or web app. Today in this post we will learn to scrape FIFA 2022 data from Yahoo Sports.
We will use Python language as this is the most preferred language when it comes to web scraping. At the end of this article, you will be able to scrape live FIFA data. Scraping sports data is very simple and we’ll learn it in a step-by-step method.
Why use Python to Scrape Sports Data
Python is the most versatile language and is used extensively with web scraping. Moreover, it has dedicated libraries for scraping the web.
With a large community, you might get your issues solved whenever you are in trouble. If you are new to web scraping with python, I would recommend you to go through this guide comprehensively made for web scraping with it.
Requirements for scraping Yahoo sports
We need Python 3.x for this tutorial and I am assuming that you have already installed that on your computer. Along with that, you need to install two more libraries which will be used further in this tutorial for web scraping.
- Requests will help us to make an HTTP connection with Bing.
- BeautifulSoup will help us to create an HTML tree for smooth data extraction.
Setup
First, create a folder and then install the libraries mentioned above.
mkdir sports
pip install requests
pip install beautifulsoup4
Inside this folder create a python file where will write the code. These are the following data points that we are going to scrape from the target website.
- Live Game Data
How to Scrape Yahoo Sports
First, we should make a normal GET request to the target URL and check whether it returns 200 or not.
import requests
from bs4 import BeautifulSoup
l=list()
o={}
target_url="https://sports.yahoo.com/soccer/world-cup/scoreboard/"
resp=requests.get(target_url)
print(resp.status_code)
Let me explain what we have done here. We have declared a target URL and then we have made an HTTP GET request to the target URL.
If it prints 200 then your code works otherwise you can pass some user agents to make it look like a real browser. Now, we can use BS4 to extract useful data.
soup=BeautifulSoup(resp.text, 'html.parser')
Let’s find the DOM location of live data.
All the live data is located under a tag with the class gamecard-in_progress. Let’s declare a variable where we can hold all this data in one place.
allData = soup.find("a",{"class":"gamecard-in_progress"})
teams= allData.find_all("li",{"class":"team"})
allData variable stores the complete tree of class gamecard-in_progress. teams is a list that holds all the information of two teams shown inside the box.
Now, you can inspect to find where the names and scores are located.
As you can see the name is stored under the span tag with an attribute of data-tst which has a value of first-name. Let’s see where the score is stored.
The score is stored under the div tag with the class “Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)”. We have all the information to extract all the data we need for live score updates.
o["Team-A"]=teams[0].find("span",{"data-tst":"first-name"}).text
o["Score-A"]=teams[0].find("div",{"class":"Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)"}).text
o["Team-B"]=teams[1].find("span",{"data-tst":"first-name"}).text
o["Score-B"]=teams[1].find("div",{"class":"Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)"}).text
l.append(o)
print(l)
After printing, you should get the LIVE updates from any FIFA game.
Complete Code
With just a few changes to the code, you can extract data for upcoming and old games also. But for now, the code will look like this.
import requests
from bs4 import BeautifulSoup
l=list()
o={}
target_url="https://sports.yahoo.com/soccer/world-cup/scoreboard/"
resp=requests.get(target_url)
soup=BeautifulSoup(resp.text, 'html.parser')
allData = soup.find("a",{"class":"gamecard-in_progress"})
teams= allData.find_all("li",{"class":"team"})
o["Team-A"]=teams[0].find("span",{"data-tst":"first-name"}).text
o["Score-A"]=teams[0].find("div",{"class":"Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)"}).text
o["Team-B"]=teams[1].find("span",{"data-tst":"first-name"}).text
o["Score-B"]=teams[1].find("div",{"class":"Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)"}).text
l.append(o)
print(l)
Conclusion
It is important to extract player and score information if you want to create your own app or website. Extracting data from certain non-English websites and then delivering them in normal English can be a boost for your app.
Python is altogether a great language to pull all this information with ease. It has great community support with a long list of libraries which makes web scraping super easy for beginners.
But scraping at scale would not be possible with this process. After some time yahoo sports will block your IP and your data pipeline will be blocked permanently. For seamless scraping use Web Scraping API which will rotate IPs on every new request and will use headless chrome to reduce any chance of blockage.
Additional Resources
Here are a few additional resources that you may find helpful during your web scraping journey: