How to Extract Phone Numbers from Internal Web Pages Using BeautifulSoup (python)


How to Extract Phone Numbers from Internal Web Pages Using BeautifulSoup

In web scraping, extracting specific data like phone numbers embedded within webpage links is a common task. This guide will walk you through a Python-based solution using the BeautifulSoup library to scrape and extract phone numbers from internal web pages.

Why Use BeautifulSoup for Web Scraping?

BeautifulSoup is a Python library for parsing HTML and XML documents. It allows you to navigate and extract data from webpages efficiently, making it an excellent choice for web scraping tasks.

The Problem

Suppose you have a dataset with customer IDs and want to extract phone numbers stored in web pages as tel: links. Each customer’s phone number is stored on a unique page, accessible via a URL parameter like this:
https://example.com/client/?id=<customer_id>

For instance, the HTML of a sample webpage looks like this:

<!doctype html>
<html>
<head>
    <title>id 111</title>
</head>
<body>
    <div>
        <div id="contactButton" class="bg-primary-subtle py-2 px-3 rounded-3 text-primary fw-medium" style="cursor: pointer">
            Contact
        </div>
        <div class="d-flex flex-column position-relative mt-2 d-none" id="contactBlock">
            <div id="phone" class="position-absolute end-0 text-nowrap">
                <a href="tel:+77777777777" class="btn btn-lg btn-outline-primary fw-medium"></a>
            </div>
        </div>
    </div>
</body>
</html>

Your goal is to extract the phone number +77777777777 from this HTML structure and add it to a dataframe.

The Solution

Below is a Python script that extracts phone numbers from such pages using BeautifulSoup.

import requests
from bs4 import BeautifulSoup

def get_client_phone(client_id):
    # URL template for client pages
    url = f"https://example.com/client/?id={client_id}"

    # Make a GET request to fetch the webpage
    response = requests.get(url)
    
    # Check for request success
    if response.status_code != 200:
        print(f"Error: HTTP {response.status_code}")
        return None

    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the phone container by its ID
    phone_element = soup.find(id='phone')
    
    if phone_element:
        # Extract the phone link within the container
        phone_link = phone_element.find('a', href=True)
        if phone_link:
            # Remove 'tel:' from the link to get the raw phone number
            phone_number = phone_link['href'].replace('tel:', '')
            return phone_number
    else:
        print("Phone number not found.")
        return None

# Test the function
client_id = 'R_111'
phone_number = get_client_phone(client_id)

if phone_number:
    print(f"Client ID: {client_id}, Phone: {phone_number}")
else:
    print("No phone number retrieved.")

Explanation of the Code

  1. URL Construction
    The URL is dynamically generated using the customer ID.
  2. HTTP Request
    The requests.get() function fetches the webpage content. It’s essential to check for a successful status code before proceeding.
  3. HTML Parsing
    The BeautifulSoup object parses the HTML content, enabling easy navigation of the DOM structure.
  4. Extracting the Phone Number
    • Locate the <div> with the id="phone".
    • Find the <a> tag inside this <div> with an href attribute.
    • Strip the tel: prefix from the link to get the raw phone number.

Expected Output

When executed, the script should print:

Client ID: R_111, Phone: +77777777777

Creating a Dataframe

To build a dataframe with multiple customer IDs, you can extend the script as follows:

import pandas as pd

# List of customer IDs
client_ids = ['R_111', 'R_112', 'R_113']

# Fetch phone numbers
data = {'ID': [], 'Phone': []}
for client_id in client_ids:
    phone = get_client_phone(client_id)
    data['ID'].append(client_id)
    data['Phone'].append(phone)

# Create a dataframe
df = pd.DataFrame(data)
print(df)

Conclusion

By combining Python’s BeautifulSoup library with robust HTML parsing techniques, you can efficiently extract data from webpages. This approach can be adapted to scrape various types of data beyond phone numbers, opening the door to a wide range of web scraping applications.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top