Determine whether the element exists in the collection (using Bloom filter)

Tao song is still the same 2020-11-13 06:21:48
determine element exists collection using

Original published in ：

today , Let's talk Bloom Filter, It is 1970 Year by year Bloom Invented , When determining whether an element exists in a collection , It has a wide range of applications .

In the front of the written interview questions analysis , We introduced bitmap Principle ：

Use bitmap We can judge whether a nonnegative integer exists in the set , We need to 512M To mark all nonnegative integers (4 Byte unsigned integer ), as follows ：

stay bitmap in , If you want to mark n, It's actually at the n individual bit Bit mark 1, there hash Function is actually hash(x)=x, There will never be hash Conflict .

therefore , For a given set of nonnegative integers S for , Traversable S The nonnegative integer in , And mark them one by one bitmap in . For a given nonnegative integer k, Direct access to BitMap[k] Value , According to this value is 0 still 1, You can judge k Does it exist in S in .

However , If the target object is a string , What shall I do? ？  Obviously not hash(x)=x such hash function , Instead, we should transform the string into a non negative integer , as follows ：

You can see , adopt hash function , Strings can also be marked to bitmap in . and , You can choose a reasonable hash function , To control hash The range of values , There is no need to let hash It's worth as much as 4294967295, So as to save space greatly .

therefore , For a given set of strings S for , Traversable S String in , And put the corresponding hash Values are marked to bitmap in . For a given string str, Direct access to hash(str) Corresponding bitmap Value , According to this value is 0 still 1, You can judge str Does it exist in S in .

However , When string sets S When there are more elements in , After mapping to a nonnegative integer , There must be hash Conflict , And it can't be avoided , such as , May appear hash("xyz") and hash("abc") The values of are just 4, So it's impossible to judge "xyz" Does it exist in S in , What shall I do? ？

Since once hash Easy to conflict , Then consider introducing many times hash, The probability of multiple simultaneous conflicts is much lower , This is it. Bloom Filter Thought ：

therefore , For a given set of strings S for , Traversable S String in , And put each string corresponding to hash1、hash2 and hash3 The values of are marked with bitmap in . For a given string str, Direct access to hash1(str)、hash2(str) and hash3(str) Corresponding bitmap value , When this 3 individual bitmap Any one of the values is not 1 when , You know str It must not exist in S in . And when this 3 individual bitmap Values are 1 when , Think str Exist in S in ( Not at all 100% accuracy ).

You can see , Above Bloom Filter There will be a certain rate of miscalculation , The false positive rate can be analyzed . For a given set S for , Just choose the right one hash Number of functions m and bitmap The length of n, You can make Bloom Filter Its accuracy is quite reliable , Specific mathematical analysis can refer to Mr. Wu Jun's 《 The beauty of Mathematics 》.

introduce Bloom Filter after , Can tag strings , And it saves a lot of money bitmap Length and space , Just at the expense of some accuracy . In the related applications of big data , Will often see Bloom Filter The figure of .