A Practical Approach to Web Crawling (Python)

I’ve recently had several opportunities to build crawlers from my freelance work. During my journey in learning and building these scrapers (albeit simple ones) I have come up with a simple framework (or approach) for building your very own crawler. This article shall outline and explore some of these concepts. Hope it helps! 🙂

Outline

  • Dependencies
  • The 2 Types of Crawling
  • Exception handling
  • Optimization
  • Benchmarking
  • Additional Considerations

Dependencies used

Packages used in this article —

import pandas as pd # data formatting/ reading
import os # operating systems module
import requests # enabled HTTP requests
from requests import exceptions
from requests.api import head
from bs4 import BeautifulSoup as soup # parsing library to help navgiate html
import re # regular expression library
import xlsxwriter # for writing to xlsx file type
import math
import queue
from threading import Thread

Types of Scraping — ‘Blind’ & Targeted

While working with on of these crawlers, I’ve noticed that they tended to fall into 2 broad categories — Blind & Targeted. The following sections will explore each type in further detail with some working code.

Type 1: Blind

  • Starting point — base URL of a website
  • Example use case: Retrieve all internal links from a given website (~conversely, external links)

If you’ve already guessed, Blind scraping refers to the process whereby the scraper is allowed to navigate through the website on its own to retrieve data independently, without any initial form of input. We’ll be running this on the lipsum website — the lorem ipusum generator

global pages # initialize global variable to keep track of pages 
pages = []
base_url = '<http://www.lipsum.com/>'
pages.append(base_url)
def get_all_internal_links(url):
webpage = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
content_soup = soup(webpage.content, 'html.parser')
# retrieve all '<a>' (anchor) tags
for tag in content_soup.find_all('a', href=re.compile(r'(lipsum.com)')):
# some filtering to remove non-url
suffix = get_path(tag['href'])
if suffix is not None and suffix not in pages:
pages.append(suffix)
get_all_internal_links(base_url + suffix)
def get_path(url_dirty):
suffix = url_dirty
if suffix.endswith('/#a'):
suffix = suffix.removesuffix('/#a')
return suffix if len(suffix) != 0 else None
pages = get_all_internal_links(base_url)print("Total internal links: {}".format(len(pages))) # 42
print(pages) # ['<http://www.lipsum.com/>', '<http://hy.lipsum.com/>', '<http://sq.lipsum.com/>', '<http://ar.lipsum.com/>', ...]

Code breakdown

get_all_internal_links: Makes the requests and parses the webpage of the given URL. After which, it will loop through each identified URL and verify whether it had already been found. If the URL had not already been identified, it is added to pages before calling itself recursively until all internal URLs have been identified.

get_path: cleans and returns ‘dirty’ URLs which might lead to duplication if not handled. Both of the following point to the same page:

Type 2: TARGETED

In my experience, clients usually approached me with a specific list of items which they needed detailed information. Take the following scenario —

  • Jacob has a list of 50,000 electronic part numbers and wishes to extract the manufacturer, description and lead time of each part from a given website.

As you can imagine, if this list was 10 or even 100 items long, manually visiting the page for each component could work, but this approach might not be suitable for longer lists (much less 50k). This is where the ‘TARGETED’ approach comes in! From here on out, we will be using the Jacob’s situation to illustrate the different steps (as seen in image below) involved in this process.

There are 3 key steps In this approach:

  1. Extracting search terms from input file
  2. Identifying patterns in URL(s)
  3. Writing to an Output file

Extracting Information from an Input File

The following code snippet illustrates how you might extract the search terms (i.e. part numbers) from a csv/txt file.

# extract search terms from file
def get_search_terms_from_file(file_path):
file_extension = os.path.splitext(file_path)[1]
search_terms = []
if file_extension == '.xlsx': # xlsx file detected
try:
data = pd.read_excel (file_path, engine='openpyxl')
df = pd.DataFrame(data, columns= ['Search Terms']) # Row Header
for idx, search_term in df.iloc[:].iterrows():
search_terms.append(search_term['Search Terms'])
except FileNotFoundError as e:
return None
elif file_extension == '.txt': # text file detected
try:
f = open(file_path, "r")
search_terms = f.read().split(",") # comma separated terms
except FileNotFoundError as e:
# invalid input file
return None
else:
return None
return search_terms

Finding patterns in URL(s)

Now that we have read our input file into a list of search terms, our next step is to identify URL patterns. If the website you’re crawling was designed with any form of structure, it should follow a standard URL naming convention. The pattern might not be immediately obvious, but if you’re willing to spend some time exploring, you never know what you might find. Let’s say Jacobs wishes to extract the part information from the following website:

https://www.digikey.sg/

(This above website belong to a company which distributes electronic components)

With a little exploring around, you might find that a *Product page’*s URL adopts the following naming convention —

https://www.digikey.sg/product-detail/en/<manufacturer_name>/<manufacturer_part_number>/<digikey_part_number>/<digikey_part_id>

Sample Product Page

https://www.digikey.sg/product-detail/en/murata-electronics/7BB-12-9/490-7709-ND/4358149

Where —

  • manufacturer_name = murata-electronics
  • manufacturer_part_number = 7BB-12–9
  • digikey_part_number = 490–7709-ND
  • digikey_part_id = 4358149

But what happens if you only have the manufacturer’s part number to work with? How are we supposed to programmatically construct the correct URL? Fret not. This is where it’s important to spend time tinkering with the URL, to try to find a workable pattern. You might have found that the following URLs all routed to the same page listed above —

#1 Replaced manufacturer_part_number & digikey_part_number with random values

https://www.digikey.sg/product-detail/en/murata-electronics/1/1/4358149

#2 Removed a path segment

https://www.digikey.sg/product-detail/en/murata-electronics/random/4358149

#3 Remove digikey_part_id segmen

https://www.digikey.sg/product-detail/en/murata-electronics/7BB-12-9/490-7709-ND

#4 Found through search bar

https://www.digikey.sg/products/en?keywords=7BB-12-9

As you can seen above, #1-#3 still required an identifier unique to Digikey (i.e. digikey_part_number/digikey_part_id). However #4 shows us that we are able to retrieve the product page with only the manufacturer_part_number. Awesome! This is exactly what we needed, now we are able to write some code to loop through the list of part numbers, making requests directly to the product page.

*Before we move on, it is important to note that a pattern might not always usable at the start. Sometimes you’ll need to exercise some creativity to retrieve additional information from the website (perhaps through another crawling process) before being able to make use of these patterns.

# search_terms => list of part numbers
# headers - list of header names used when filtering scraped data & writing to file
# e.g. ['Part Number','Manufacturer','Lead time','Description']
def get_part_information(headers, search_terms):
part_information_dict = {}
base_url = '<https://www.digikey.sg/products/en?keywords=>'
for search_term in search_terms:
# make request to url
webpage = requests.get(base_url + search_term, headers={'User-Agent': 'Mozilla/5.0'})
# parse page
content_soup = soup(webpage.content, 'html.parser')
table = content_soup.find('table', id='product-overview')
table_rows = table.findAll('tr')
part_information = clean_table_info(headers, table_rows)
# add to dict
part_information_dict[search_term] = part_information
return part_information_dict
# extract relevant information based on headers provide
def clean_table_info(headers, rows):
cleaned_info = {}
# initialize default value
for header in headers:
cleaned_info[header] = "-"
for row in rows:
header = row.find('th').text
if header in headers:
row_info = row.find('td').text.strip()
cleaned_info[header] = row_info
return cleaned_info

Writing to an Output File

The following is a sample of how to export your data into an xlsx file type.

def write_to_file(part_information_dict, headers):
# workbook stuff
workbook = xlsxwriter.Workbook('output_file.xlsx')
worksheet = workbook.add_worksheet()

# params => (col, row, value)
for idx, val in enumerate(headers):
worksheet.write(0, idx, val)
for idx, val in enumerate(part_information_dict):
rn = idx+1
col = 0
for header in headers:
worksheet.write(rn, col, part_information_dict[val][header])
col+=1
workbook.close()

Exceptions

Now that we’ve covered the crawling process, it’s time to turn our attention to the more mundane side of things — Exception handling. While it might not sound all that exciting, it is essential in building a robust web scraper. The last thing you want happening is to go to bed with your crawler running only to wake up the next morning to find out that your code has error-ed out with all the previously crawled data lost.

If you’ve already realised, the code covered above has not handle exceptions all that well, making it susceptible to exiting before completion. I’ve filled in the additional code and necessary exception handling below —

def get_part_information(headers, search_terms):
part_information_dict = {}
retry_q = [] # holds failed search terms
base_url = '<https://www.digikey.sg/products/en?keywords=>'
for search_term in search_terms:
try:
# make request to url
webpage = requests.get(base_url + search_term, headers={'User-Agent': 'Mozilla/5.0'})
webpage.raise_for_status() # throws exception for error code 4xx/5xx
# parse page
content_soup = soup(webpage.content, 'html.parser')
table = content_soup.find('table', id='product-overview')
table_rows = table.findAll('tr')
part_information = clean_table_info(headers, table_rows)
# add to dict
part_information_dict[search_term] = part_information
except requests.exceptions.RequestException as e: # exception from requests
# add in error fallback logic
dead_list.append(search_term)
print(e)
except AttributeError as e:
dead_list.append(search_term)
traceback.print_exc() # prints stack trace
return (part_information_dict, retry_q)

The above shows you how you can go about handling potential exceptions. Whenever an exception occurs, it is caught by the handler and the respective search term which triggered it will be added to a ‘retry queue’ — representing all items whose information was not able to be retrieved due to an exception. From here you could implement an additional re-try logic (not handled in code).

**Do note that the sample code provided does not cover edge cases for the specified website, so if you are looking to crawl this website in particular, you will need to make some modifications to cater to them!

Optimization

As you have probably realised, all the crawling we have handled above has been done synchronously — which means to say that each part number was processed one at a time, waiting for the previous one to complete (or throw an exception) before moving onto the next. While this might be acceptable for smaller sets of data (<1000), it certainly isn’t ideal for larger sets. With each URL taking up to 2 seconds to process, 50,000 would take a whole 28 hours to complete! That’s way too long if you ask me.

Enter Multithreading — a way to parallelise processing. I’m not going to go explain multithreading as there are many existing resources which adequately covers this. The goal of multithreading here is to divide our entire search terms into chunks and handle each chunk Asynchronously. Putting it all together, we get something like this -

# will handle chunks given to it
def get_part_information_concurrent(headers, chunk, queue, retry_queue):
base_url = '<https://www.digikey.sg/products/en?keywords=>'
for search_term in chunk:
try:
# make request to url
webpage = requests.get(base_url + search_term, headers={'User-Agent': 'Mozilla/5.0'})
webpage.raise_for_status() # throws exception for error code 4xx/5xx
# parse page
content_soup = soup(webpage.content, 'html.parser')
table = content_soup.find('table', id='product-overview')
table_rows = table.findAll('tr')
part_information = clean_table_info(headers, table_rows)
# adds retrieved information into queue
queue.put(part_information)
except requests.exceptions.RequestException as e:
# add in error fallback logic
retry_queue.put(search_term)
print(e)
except AttributeError as e:
retry_queue.put(search_term)
traceback.print_exc()
def thread_handler(headers, search_terms, chunk_size):
try:
# chunk size: number of search terms each thread will handle
thread_count = math.ceil(len(search_terms)/chunk_size)
threads_store = []
# store part information
q = queue.Queue()
# fallback logic not implemented
retry_q = queue.Queue()
for segment in range(1, thread_count+1):
# calculates start & end index for each segment
start = (segment-1) * chunk_size
end = segment * chunk_size
# chunk - represents array of search_terms to be handled by thread
chunk = search_terms[start:end]
thread = Thread(target=get_part_information_concurrent, args=(headers, chunk, q, retry_q))
thread.start()
threads_store.append(thread)
# wait for threads to complete
for thread in threads_store:
thread.join()
# initialize workbook
workbook = xlsxwriter.Workbook('output_file.xlsx')
worksheet = workbook.add_worksheet()
# establish row headers
# params => (col, row, value)
for idx, val in enumerate(headers):
worksheet.write(0, idx, val)
# poll queue & write to file
rn = 1
while q.qsize() != 0:
object = q.get()
for col, header in enumerate(headers):
worksheet.write(rn, col, object[header])
rn += 1
except Exception as e:
print(e)
finally:
# close workbook to ensure crawled data is not lost on error
workbook.close()

Benchmarking

To get a rough sensing of how much faster this approach is, we are going to measure the time taken for both approaches with 100 search terms over 3 cycles with Chunk size = 20 (performance will vary with chunk size).

Synchronous approach — 100s | Asynchronous approach — 21s

As you can see, a multithreaded approach has great improved our overall run time of the scraper. You can play around with the chunk size to better optimize your programme! But do note that there will be a limit to how much you can optimize your crawler based on the hardware you are running it from.

Additional Considerations

  • Ethical Scraping — While multithreading will help speed up your scraping process, this will definitely put a larger load on the targeted site. Please do consider the frequency of requests and exercise some consideration! Your IP might also get blacklisted if you are indiscriminate with your scraping activity.
  • Multi-Worker Nodes — What we’ve discussed above has demonstrated a ‘single node’ approach, where our crawler runs on only one system (our own). However, as the complexity of the use case grows, you might need to turn to begin adopting other approaches as well to maintain and even improve the reliability as well as the speed of your crawler. And a Multi-Worker node approach can help with that.

There you have it, a simple practical guide on web crawling, hopefully some of you have made it this far. If you’ve enjoyed this article or have found it useful to you, I would appreciate if you left a tasty clap cause that helps the algo ;) Feel free to ask any questions, or leave suggestions 😊

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store