Python Web Scraping Using Beautiful Soup: A Step-by-Step Tutorial

Compared to other Python web scraping libraries and frameworks, BeautifulSoup has an easy-to-moderate learning curve. This makes it ideal for web scraping beginners as well as experts. Why?

BeautifulSoup’s syntax is pretty straightforward. You also get support from a large community of developers and extensive documentation to help you navigate learning. Not forgetting, real-world web pages are notorious for having invalid HTML, which BeautifulSoup can handle effectively.

Despite the pros, note that BeautifulSoup shines in scraping small to medium, well-structured websites with relatively straightforward HTML.

BeautifulSoup may get slower if you try to scrape complex websites with large and intricate HTML documents. Moreover, BeautifulSoup cannot handle HTTP requests. However, there is a solution to this. Do this to scrape a website with BeautifulSoup successfully:

1. Install Python and Create a Virtual Environment

Visit Python’s official website and download the latest version based on your operating system (Linux, macOS, or Windows). Run the Python installer and follow the installation instructions. Don’t skip checking the “Add Python to PATH” box as you install Python.

Next, check if Python is correctly installed by opening the terminal on Linux or macOS (Or Command Prompt on Windows) and typing this command.

Terminal Command: python –version

If you have installed Python correctly, proceed to create a virtual environment. With a virtual environment, you can separate project dependencies, ensuring that your Python web scraping projects do not conflict with other projects on your computer system.

Virtualenv is a popular tool for creating isolated environments when web scraping with Python. You can install this tool using Python’s package installer, pip.

Terminal Command: pip install virtualenv

After installing virtualenv, create a new directory for your scraping project. Then, shift to working within that directory and create a virtual environment named ‘scraping_env.’

Terminal Commands:

mkdir my_first_scraping_project
cd my_first_scraping_project
virtualenv scraping_env

The last command creates a new directory (virtual environment), ‘scraping_env,’ containing the Python executable and a copy of the ‘pip’ library to facilitate the installation of other packages.

Remember why we had to install a virtual environment? Yes, to keep your projects separate from other system projects. You need to work from within your created virtual environment to do that. So, activate the virtual environment you have created using this command:

Terminal Command: source scraping_env/bin/activate

After running this command, you should realize a terminal prompt change indicating that you are now working within ‘scraping_environ.’ Don’t close the terminal, though. Proceed to the next step.

2. Install Necessary Libraries and Set Up Your Coding Environment

Once you have a virtual environment set up and activated, install the necessary libraries. You must install BeautifulSoup (your web scraping library) and Requests (a library to help you handle HTTP requests or fetch HTML content).

Terminal Command: pip install beautifulsoup4 requests

From here, you are ready to begin web scraping with BeautifulSoup. You can write your web scraping scripts right on the terminal or opt to set up a coding environment.

Using an IDE or code editor to write HTML fetching and parsing scripts enhances your experience by granting you access to features like code completion and syntax highlighting. IDEs also come with built-in tools for debugging and more. Popular IDEs or code editors include PyCharm, Jupyter Notebook, and Visual Studio Code.

3. Inspect the Data Source (Website)

BeautifulSoup is less likely to help if you desire to scrape a dynamic website. It parses HTML content that’s statically available in the web page source. So, use browser developer tools to check whether a page source is static. If you inspect a website page and see raw HTML, then the page is static.

Conversely, if you inspect a source page and realize it has <div> tags with no content or placeholders like ‘{{data }},’ then it is a dynamic web page.

Other than inspecting a website using developer tools, you can check its URL to determine whether it is static or dynamic.

Static URLs mostly end in ‘.asp,’ ‘.php,’ and ‘.html’ extensions. On the other hand, a Dynamic website URL is likely to contain query parameters like ‘?id=65837,’ indicating the use of JavaScript.

4. Fetch HTML Content From a Page

Before extracting specific data from an HTML page, you must download (or fetch) it from your target website. To do that, you’ll use Python’s Request library.

First, make sure your virtual environment is activated. If you have closed the terminal, navigate to your project directory (my_first_scraping_project) and reactivate the virtual environment.

While on the terminal and within your project directory, create a Python file to write the Python script to retrieve HTML content. After creating the Python file, use your text editor of choice to write this piece of code:

Code:

import requests

from bs4 import BeautifulSoup

url = ‘http://example.com’
response = requests.get(URL)
If response.status_code == 200:
    html_content = response.text
else:
    print(“Failed to retrieve the webpage. Status code: response.status_code}”)

Replace ‘http://example.com’ with the target website’s URL. After writing the code, head back to your terminal and run the script. Doing so should fetch the HTML page. How? The requests.get() function sends an HTTP GET request to the URL you specify and stores the server’s response.

The object ‘response’ captures the server’s response. And since the object has a text attribute, the HTML of the website’s page is stored in ‘response.text’ (as a string). Fetching HTML content from a page is successful when the status code is 200; otherwise, you’ll receive a failed message.

5. Parse HTML Using Beautiful Soup and Extract Specific Data

BeautifulSoup scrapes data from HTML pages by creating a BeautifulSoup object from the downloaded HTML content. The object represents the HTML page in a nested data structure called a parse tree, which is why BeautifulSoup is known as an HTML/XML parsing library.

Only after parsing the downloaded HTML page can you extract various data with the help of built-in BeautifulSoup methods.

Here is an example of code that you can add to your script to extract the title and paragraph of a web page after parsing it with BeautifulSoup:

if response.status_code == 200:
            html_content = response.text
            soup = BeautifulSoup(html_content, ‘html.parser’)

#the above code should parse the server’s response for you
#then you can proceed to use various methods to extract particular data
#here is how you can extract the title of the HTML page

title = soup.title.string
print(‘Title:’, title)

#here is another example showing how to extract paragraph text

paragraphs = soup.find.all(‘p’)
for para in paragraphs:
                        print(para.text)

Closing Words

BeautifulSoup is a handy web scraping Python library that allows you to quickly parse and navigate HTML or XML documents without the need for complex code. Whether a beginner or an expert, you’ll find its simplicity and ease of use charming. Moreover, BeautifulSoup integrates seamlessly with other Python libraries and easily handles “broken” HTML.

So, get learning with the help of this blog post and more resources from the BeautifulSoup community. And while at it, remember to scrape websites ethically!