Some time ago I happened upon a class of algorithms called
streaming algorithms. They piqued my curiosity greatly, and I implemented
several, initially in C++. I have since started to implement some of them in other languages (currently only python and go). I decided
that I'd next implement a bloom filter in python. A generic bloom filter is incredibly easy to write up, provided you have a few
underlying libraries. It turns out that there is not a standard bit array library for python. I was kind of surprised, since python
has a lot of built in libraries. I had to write one for C++ (ok, I didn't have to, but I'd rather write my own that force
a dependency on something like Boost), and so to get bloom filter working, I'd need one for python as well. I spend about 10 minutes
writing this up, so don't be surprised if there are some lingering issues. If you find one, I'd love to hear about it. I based this off
of the one I wrote in C++, but this is much simpler. The C++ variant allows for each element to be more than 1 bit. You can have elements
of size n-bits, which is nice for some of the other forms of bloom filters (like counting bloom filters). Without further ado, here is the
code:
code:
You can find the rest of the source (like test files and the license) in the repo on github.
import array
from math import log
class BitArray(object):
def __init__(self, bits):
# L = 4 bytes minimum, but could be 8 bytes. No need to waste all that space
# so lets look up the size of L on this machine
self.elem_size = int(log(array.array('L').itemsize * 8, 2))
self.elem_mask = int(pow(2, self.elem_size) - 1)
size = bits >> self.elem_size;
if bits & self.elem_mask:
size += 1
self.bits = array.array('L', (0,) * size)
def set(self, bit):
# first look up the entry in the array
array_index = bit >> self.elem_size
# find position of bit in the individual element
bit_position = bit & self.elem_mask
# generate a mask to apply to the array element
mask = 1 << bit_position
# apply the mask
self.bits[array_index] |= mask
def clear(self, bit):
# first look up the entry in the array
array_index = bit >> self.elem_size
# find position of bit in the individual element
bit_position = bit & self.elem_mask
# generate a mask to apply to the array element,
# but invert it since we are clearing
mask = 1 << bit_position
mask = ~mask
self.bits[array_index] &= mask
def bit(self, bit):
# first look up the entry in the array
array_index = bit >> self.elem_size
# find position of bit in the individual element
bit_position = bit & self.elem_mask
# generate a mask to apply to the array element
mask = 1 << bit_position
return 1 if self.bits[array_index] & mask > 0 else 0
You can find the rest of the source (like test files and the license) in the repo on github.
Bloom Filter
Bit Array
Python