![](https://codingclassic.com/wp-content/uploads/2025/01/Create-responsive-Search-Bar-14-1024x576.png)
How to Extract Phone Numbers from Internal Web Pages Using BeautifulSoup
In web scraping, extracting specific data like phone numbers embedded within webpage links is a common task. This guide will walk you through a Python-based solution using the BeautifulSoup
library to scrape and extract phone numbers from internal web pages.
Why Use BeautifulSoup for Web Scraping?
BeautifulSoup is a Python library for parsing HTML and XML documents. It allows you to navigate and extract data from webpages efficiently, making it an excellent choice for web scraping tasks.
The Problem
Suppose you have a dataset with customer IDs and want to extract phone numbers stored in web pages as tel:
links. Each customer’s phone number is stored on a unique page, accessible via a URL parameter like this:https://example.com/client/?id=<customer_id>
For instance, the HTML of a sample webpage looks like this:
<!doctype html>
<html>
<head>
<title>id 111</title>
</head>
<body>
<div>
<div id="contactButton" class="bg-primary-subtle py-2 px-3 rounded-3 text-primary fw-medium" style="cursor: pointer">
Contact
</div>
<div class="d-flex flex-column position-relative mt-2 d-none" id="contactBlock">
<div id="phone" class="position-absolute end-0 text-nowrap">
<a href="tel:+77777777777" class="btn btn-lg btn-outline-primary fw-medium"></a>
</div>
</div>
</div>
</body>
</html>
Your goal is to extract the phone number +77777777777
from this HTML structure and add it to a dataframe.
The Solution
Below is a Python script that extracts phone numbers from such pages using BeautifulSoup
.
import requests
from bs4 import BeautifulSoup
def get_client_phone(client_id):
# URL template for client pages
url = f"https://example.com/client/?id={client_id}"
# Make a GET request to fetch the webpage
response = requests.get(url)
# Check for request success
if response.status_code != 200:
print(f"Error: HTTP {response.status_code}")
return None
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find the phone container by its ID
phone_element = soup.find(id='phone')
if phone_element:
# Extract the phone link within the container
phone_link = phone_element.find('a', href=True)
if phone_link:
# Remove 'tel:' from the link to get the raw phone number
phone_number = phone_link['href'].replace('tel:', '')
return phone_number
else:
print("Phone number not found.")
return None
# Test the function
client_id = 'R_111'
phone_number = get_client_phone(client_id)
if phone_number:
print(f"Client ID: {client_id}, Phone: {phone_number}")
else:
print("No phone number retrieved.")
Explanation of the Code
- URL Construction
The URL is dynamically generated using the customer ID. - HTTP Request
Therequests.get()
function fetches the webpage content. It’s essential to check for a successful status code before proceeding. - HTML Parsing
TheBeautifulSoup
object parses the HTML content, enabling easy navigation of the DOM structure. - Extracting the Phone Number
- Locate the
<div>
with theid="phone"
. - Find the
<a>
tag inside this<div>
with anhref
attribute. - Strip the
tel:
prefix from the link to get the raw phone number.
- Locate the
Expected Output
When executed, the script should print:
Client ID: R_111, Phone: +77777777777
Creating a Dataframe
To build a dataframe with multiple customer IDs, you can extend the script as follows:
import pandas as pd
# List of customer IDs
client_ids = ['R_111', 'R_112', 'R_113']
# Fetch phone numbers
data = {'ID': [], 'Phone': []}
for client_id in client_ids:
phone = get_client_phone(client_id)
data['ID'].append(client_id)
data['Phone'].append(phone)
# Create a dataframe
df = pd.DataFrame(data)
print(df)
Conclusion
By combining Python’s BeautifulSoup
library with robust HTML parsing techniques, you can efficiently extract data from webpages. This approach can be adapted to scrape various types of data beyond phone numbers, opening the door to a wide range of web scraping applications.