A generic bisection class for any function

A generic bisection class for any function

Bisection on a monotonic function

·

4 min read

The built-in Python bisect library works on a list. We can reload the __getitem__ method of an object to mimic the behaviour of a list so it can be used for bisect module off the shelf.

Input: a function that returns True for all n<=n0 (n0 is unknown value we want to find) and False otherwise.

Module

Following class wraps up the function and searches the maximum n that makes function(n)==True.

import bisect

class FunctionBisector(object):
  '''
  Generic List style class for built-in bisect module and any function.

  function: A predicate that decides a given number n is True or False.
            There is a latent integer n0. When n <= n0, it returns True.
            Else, it returns False.

  Essentially, this allows a bisection on an array like
  [True, True, True, ... False, False]
  without actually constructing it.
  '''
  def __init__(self, function):
      self.function = function

  def __len__(self):
      return self.length

  def __getitem__(self, k):
    if self.function(k):
      return 0
    else:
      return 1

  def search(self, left, right):
    self.length = right
    first_one_index = bisect.bisect_left(self, 1, left, right)
    return first_one_index - 1

Application on scraping

In web scraping project, we can often construct actual URLs from a template with increasing sequence number, such as post ID.

For example, https://www.discuss.com.hk/viewthread.php?tid={n} is such a template.

Suppose the n's are consecutive. How can we efficiently find the maximum n that is valid?

We can do this within logarithm time. The idea are two folds:

  1. Find a loose right bound. We can start from any small n and test if the URL is valid. Increase n by an exponetial sequence like 100, 200, 400, 800, .... In this way we can reach a number larger than max ID in log time, and use that as the right
  2. Run a bisection algorithm between leftand right . We need to find an n such that is_valid_id(n)==True and is_valid_id(n+1)==False. This step right fits the module defined in previous section.

Max ID test class

Note that each website may need different definition of URL validity. Even when the HTTP layer returns 200, the page could still be invalid and one needs to check the returned content to determine validity.

To make this logic more generalizable, we define a Website class that builds on our generic function bisector. Also attached can find a quick test case of the methods.

import requests

class Website():
  template = ''
  increase_base = 1000

  def http_get_by_id(self, n):
    url = template.format(n=n)
    return requests.get(url)

  def is_valid_content(self, text):
    # Some website returns 200 with a valid page but shows a human readable
    # message to indicate page is invalid. One needs to implement a 
    # predicate function here.
    return True

  def is_valid_id(self, n):
    r = self.http_get_by_id(n)
    if r.status_code != 200:
      valid = False
    else:
      valid = self.is_valid_content(r.text)
    print(f'ID {n} is valid: {valid}')
    return valid

  def search_max_id(self, start_id, stop_id):
    # Search the max_id between the range: [start_id, stop_id)
    # start_id must be valid
    right = start_id
    increase = self.increase_base
    # Exponentially increase the possible upper bound
    while self.is_valid_id(right) and (right + increase) < stop_id:
      right += increase
      increase *= 2
      print(f'Exponentially increase right to {right}')

    # Bisection
    left = start_id
    fb = FunctionBisector(self.is_valid_id)
    return fb.search(left, right)

class TestWebsite(Website):
  def is_valid_id(self, n):
    if n <= 12345:
      return True
    else:
      return False

tw = TestWebsite()
tw.search_max_id(10, 1000000)

Test output:

Exponentially increase right to 1010
Exponentially increase right to 3010
Exponentially increase right to 7010
Exponentially increase right to 15010
12345

Application on HK Discuss

We can derive the following class with content checking:

class DiscussHK(Website):
  template = 'https://www.discuss.com.hk/viewthread.php?tid={n}'

  def is_valid_content(self, text):
    if '找不到內容,請返回。' in text:
      return False
    else:
      return True

discuss_hk = DiscussHK()
print(discuss_hk.is_valid_id(70924275))
print(discuss_hk.is_valid_id(30924275))

Result:

ID 70924275 is valid: False
False
ID 30924275 is valid: True
True

Now we can finally use it to find the max ID:

discuss_hk.search_max_id(30924275, 70924275)

Result:

ID 30924275 is valid: True
Exponentially increase right to 30925275
ID 30925275 is valid: True
(...omitted output)
ID 30987275 is valid: True
Exponentially increase right to 31051275
ID 31051275 is valid: True
Exponentially increase right to 31179275
ID 31179275 is valid: False
ID 31051775 is valid: False
ID 30988025 is valid: True
(...omitted output)
ID 31051772 is valid: True
ID 31051774 is valid: True
31051774

It is easy to verify that 31051774 is a valid ID as of this writing (2023-09-08).

Notes

It is easy to implement the bisection from scratch, but I like to make use of built-in functions. The generic bisection over a function is useful when the possible ID space is large. By using our module, we do not have to pre-compute the range of a function over entire domain.

Did you find this article valuable?

Support HU, Pili by becoming a sponsor. Any amount is appreciated!