Exploring FIFA World Cup Analytics with Python and Data Science
Written on
Chapter 1: Introduction to World Cup Analytics
The 2022 FIFA World Cup in Qatar is nearing its conclusion, and it’s a great time to reflect on the tournament's history. Brazil holds the record for the most World Cup victories with five titles, followed closely by Germany and Italy, each with four. However, do you know which country has participated in the most matches or scored the highest number of goals? Or which nations have been the most and least successful in terms of wins and losses?
Join me on this journey to uncover these statistics using Python's BeautifulSoup and Pandas libraries to scrape and analyze data from Wikipedia’s World Cup pages. This project is part of my Python Data Science December series, and all necessary resources, datasets, and Python libraries can be found at the end of the article. ⚽️
Section 1.1: Scraping Data from the 2002 World Cup
To kick off our exploration, I decided to revisit the 2002 FIFA World Cup held in Japan and South Korea, a tournament I fondly remember watching in my youth.
After examining the Wikipedia page, I noticed that the matches are categorized into two sections: the Group Stage and the Knockout Stage. Each match entry follows a consistent format encapsulated in a div with the class name footballbox.
Within this div, we aim to extract three crucial pieces of information: the home team, the away team, and the score. These details can be easily located in three specific th elements: fhome, faway, and fscore. Additionally, the year of the World Cup is found in the page title.
Let’s dive into the code. We will import BeautifulSoup and requests, set the URL for the 2002 World Cup page, and use a GET request to retrieve the HTML content. We’ll then parse this content with BeautifulSoup, which allows us to navigate the HTML structure and extract the required data.
# Importing necessary libraries
import requests
from bs4 import BeautifulSoup
# Setting the URL for the 2002 World Cup page
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extracting the year from the title
year = soup.title.string[:4]
# Looping through the matches to retrieve teams and scores
for match in soup.find_all('div', class_='footballbox'):
teamA = match.find('th', class_='fhome').text
teamB = match.find('th', class_='faway').text
score = match.find('th', class_='fscore').text
print(f'{year} Match: {teamA} vs {teamB} - Score: {score}')
This initial phase is straightforward, but we need to enhance our code to calculate additional statistics. Specifically, we will differentiate between goals scored by Team A and Team B and determine the winning team for each match.
By splitting the score string at the "–" character, we can ascertain the goals for each team. The logic is simple:
- Team A wins if they score more than Team B.
- Team B wins if they score more than Team A.
- If both teams score the same, it’s a draw.
However, the situation gets tricky in matches that go to penalties or have extra time, as seen in the Spain vs. Republic of Ireland match. To address this, we will introduce a variable to account for overtime and adjust our logic accordingly.
Section 1.2: Automating Data Scraping for All World Cups
Once we understand how to scrape data from one World Cup page, the next step is to automate the process for all tournaments. Wikipedia has a comprehensive page listing all FIFA World Cups, which we can utilize.
Using BeautifulSoup, we will locate the table containing the World Cup data. Since all tables share the same class, we must find a unique identifier, such as the header labeled "Attendance," to pinpoint our target table.
The following code will help us gather all World Cup URLs:
# Importing additional libraries
import csv
# URL for the main World Cup page
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Locating the table
attendance_header = soup.find('th', text='Attendance')
table = attendance_header.find_parent('table')
# Extracting World Cup links
worldCups = []
for link in table.find_all('a'):
href = link.get('href')
if 'FIFA_World_Cup' in href:
worldCups.append(href)
With the URLs collected, we can apply the same logic we used for the 2002 World Cup to scrape data from each tournament.
Chapter 2: Video Insights and Practical Applications
To further enhance your understanding of World Cup analytics using Python, check out these insightful videos:
FIFA World Cup Match Analysis using Python by Ayushi Sahu.
In this video, Ayushi Sahu conducts an analysis of FIFA World Cup matches using Python, highlighting various data science techniques.
Loading and Investigating World Cup Data in Python.
This video walks through the process of loading and exploring World Cup data in Python, providing practical examples and coding tips.
📓 Summary & Resources
This marks the end of the 14th installment in my Python Data Science December series, where we explored the Wikipedia pages to gather and analyze football match data. For further details, feel free to engage in the comments, and don’t hesitate to follow my work on Medium for more insights. You can also access the complete code and dataset on GitHub.
✔️ GitHub — Full Code & Dataset