The built-in Python bisect
library works on a list. We can reload the __getitem__
method of an object to mimic the behaviour of a list so it can be used for bisect
module off the shelf.
Input: a function
that returns True for all n<=n0
(n0
is unknown value we want to find) and False otherwise.
Module
Following class wraps up the function and searches the maximum n
that makes function(n)==True
.
import bisect
class FunctionBisector(object):
'''
Generic List style class for built-in bisect module and any function.
function: A predicate that decides a given number n is True or False.
There is a latent integer n0. When n <= n0, it returns True.
Else, it returns False.
Essentially, this allows a bisection on an array like
[True, True, True, ... False, False]
without actually constructing it.
'''
def __init__(self, function):
self.function = function
def __len__(self):
return self.length
def __getitem__(self, k):
if self.function(k):
return 0
else:
return 1
def search(self, left, right):
self.length = right
first_one_index = bisect.bisect_left(self, 1, left, right)
return first_one_index - 1
Application on scraping
In web scraping project, we can often construct actual URLs from a template with increasing sequence number, such as post ID.
For example, https://www.discuss.com.hk/viewthread.php?tid={n}
is such a template.
Suppose the n's are consecutive. How can we efficiently find the maximum n
that is valid?
We can do this within logarithm time. The idea are two folds:
- Find a loose right bound. We can start from any small
n
and test if the URL is valid. Increasen
by an exponetial sequence like100, 200, 400, 800, ...
. In this way we can reach a number larger than max ID in log time, and use that as theright
- Run a bisection algorithm between
left
andright
. We need to find ann
such thatis_valid_id(n)==True
andis_valid_id(n+1)==False
. This step right fits the module defined in previous section.
Max ID test class
Note that each website may need different definition of URL validity. Even when the HTTP layer returns 200, the page could still be invalid and one needs to check the returned content to determine validity.
To make this logic more generalizable, we define a Website
class that builds on our generic function bisector. Also attached can find a quick test case of the methods.
import requests
class Website():
template = ''
increase_base = 1000
def http_get_by_id(self, n):
url = template.format(n=n)
return requests.get(url)
def is_valid_content(self, text):
# Some website returns 200 with a valid page but shows a human readable
# message to indicate page is invalid. One needs to implement a
# predicate function here.
return True
def is_valid_id(self, n):
r = self.http_get_by_id(n)
if r.status_code != 200:
valid = False
else:
valid = self.is_valid_content(r.text)
print(f'ID {n} is valid: {valid}')
return valid
def search_max_id(self, start_id, stop_id):
# Search the max_id between the range: [start_id, stop_id)
# start_id must be valid
right = start_id
increase = self.increase_base
# Exponentially increase the possible upper bound
while self.is_valid_id(right) and (right + increase) < stop_id:
right += increase
increase *= 2
print(f'Exponentially increase right to {right}')
# Bisection
left = start_id
fb = FunctionBisector(self.is_valid_id)
return fb.search(left, right)
class TestWebsite(Website):
def is_valid_id(self, n):
if n <= 12345:
return True
else:
return False
tw = TestWebsite()
tw.search_max_id(10, 1000000)
Test output:
Exponentially increase right to 1010
Exponentially increase right to 3010
Exponentially increase right to 7010
Exponentially increase right to 15010
12345
Application on HK Discuss
We can derive the following class with content checking:
class DiscussHK(Website):
template = 'https://www.discuss.com.hk/viewthread.php?tid={n}'
def is_valid_content(self, text):
if '找不到內容,請返回。' in text:
return False
else:
return True
discuss_hk = DiscussHK()
print(discuss_hk.is_valid_id(70924275))
print(discuss_hk.is_valid_id(30924275))
Result:
ID 70924275 is valid: False
False
ID 30924275 is valid: True
True
Now we can finally use it to find the max ID:
discuss_hk.search_max_id(30924275, 70924275)
Result:
ID 30924275 is valid: True
Exponentially increase right to 30925275
ID 30925275 is valid: True
(...omitted output)
ID 30987275 is valid: True
Exponentially increase right to 31051275
ID 31051275 is valid: True
Exponentially increase right to 31179275
ID 31179275 is valid: False
ID 31051775 is valid: False
ID 30988025 is valid: True
(...omitted output)
ID 31051772 is valid: True
ID 31051774 is valid: True
31051774
It is easy to verify that 31051774 is a valid ID as of this writing (2023-09-08).
Notes
It is easy to implement the bisection from scratch, but I like to make use of built-in functions. The generic bisection over a function is useful when the possible ID space is large. By using our module, we do not have to pre-compute the range of a function over entire domain.