BIP-0158: allow filters to define values for P and M, reparameterize default filter

2025-07-07 12:45:17 +00:00 · 2018-05-30 17:18:24 -07:00 · 2018-05-30 17:18:24 -07:00 · 1c2ed6dce3
commit 1c2ed6dce3
parent 4a85759f02
1 changed files with 39 additions and 26 deletions
--- a/bip-0158.mediawiki
+++ b/bip-0158.mediawiki
@ -65,11 +65,14 @@ For each block, compact filters are derived containing sets of items associated
 with the block (eg. addresses sent to, outpoints spent, etc.). A set of such
 data objects is compressed into a probabilistic structure called a
 ''Golomb-coded set'' (GCS), which matches all items in the set with probability
-1, and matches other items with probability <code>2^(-P)</code> for some integer
+1, and matches other items with probability <code>2^(-P)</code> for some
-parameter <code>P</code>.
+integer parameter <code>P</code>. We also introduce parameter <code>M</code>
 which allows filter to uniquely tune the range that items are hashed onto
 before compressing. Each defined filter also selects distinct parameters for P
 and M.
 At a high level, a GCS is constructed from a set of <code>N</code> items by:
-# hashing all items to 64-bit integers in the range <code>[0, N * 2^P)</code>
+# hashing all items to 64-bit integers in the range <code>[0, N * M)</code>
 # sorting the hashed values in ascending order
 # computing the differences between each value and the previous one
 # writing the differences sequentially, compressed with Golomb-Rice coding
@ -80,9 +83,13 @@ The following sections describe each step in greater detail.
 The first step in the filter construction is hashing the variable-sized raw
 items in the set to the range <code>[0, F)</code>, where <code>F = N *
-2^P</code>. Set membership queries against the hash outputs will have a false
+M</code>. Customarily, <code>M</code> is set to <code>2^P</code>. However, if
-positive rate of <code>2^(-P)</code>. To avoid integer overflow, the number of
+one is able to select both Parameters independently, then more optimal values
-items <code>N</code> MUST be <2^32 and <code>P</code> MUST be <=32.
+can be
 selected<ref>https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845</ref>.
 Set membership queries against the hash outputs will have a false positive rate
 of <code>2^(-P)</code>. To avoid integer overflow, the
 number of items <code>N</code> MUST be <2^32 and <code>M</code> MUST be <2^32.
 The items are first passed through the pseudorandom function ''SipHash'', which
 takes a 128-bit key <code>k</code> and a variable-sized byte vector and produces
@ -104,9 +111,9 @@ result.
 hash_to_range(item: []byte, F: uint64, k: [16]byte) -> uint64:
    return (siphash(k, item) * F) >> 64
-hashed_set_construct(raw_items: [][]byte, P: uint, k: [16]byte) -> []uint64:
+hashed_set_construct(raw_items: [][]byte, k: [16]byte, M: uint) -> []uint64:
    let N = len(raw_items)
-    let F = N << P
+    let F = N * M
    let set_items = []
@ -197,8 +204,8 @@ with Golomb-Rice coding. Finally, the bit stream is padded with 0's to the
 nearest byte boundary and serialized to the output byte vector.
 <pre>
-construct_gcs(L: [][]byte, P: uint, k: [16]byte) -> []byte:
+construct_gcs(L: [][]byte, P: uint, k: [16]byte, M: uint) -> []byte:
-    let set_items = hashed_set_construct(L, P, k)
+    let set_items = hashed_set_construct(L, k, M)
    set_items.sort()
@ -224,8 +231,8 @@ against the reconstructed values. Note that querying does not require the entire
 decompressed set be held in memory at once.
 <pre>
-gcs_match(key: [16]byte, compressed_set: []byte, target: []byte, P: uint, N: uint) -> bool:
+gcs_match(key: [16]byte, compressed_set: []byte, target: []byte, P: uint, N: uint, M: uint) -> bool:
-    let F = N << P
+    let F = N * M
    let target_hash = hash_to_range(target, F, k)
    stream = new_bit_stream(compressed_set)
@ -260,6 +267,8 @@ against the decompressed GCS contents. See
 This BIP defines one initial filter type:
 * Basic (<code>0x00</code>)
  * <code>M = 784931</code>
  * <code>P = 19</code>
 ==== Contents ====
@ -271,24 +280,27 @@ items for each transaction in a block:
 ==== Construction ====
-Both the basic and extended filter types are constructed as Golomb-coded sets
+The basic type is constructed as Golomb-coded sets with the following
-with the following parameters.
+parameters.
-The parameter <code>P</code> MUST be set to <code>20</code>. This value was
+The parameter <code>P</code> MUST be set to <code>19</code>, and the parameter
-chosen as simulations show that it minimizes the bandwidth utilized, considering
+<code>M</code> MUST be set to <code>784931</code>. Analysis has shown that if
-both the expected number of blocks downloaded due to false positives and the
+one is able to select <code>P</code> and <code>M</code> independently, then
-size of the filters themselves. The code along with a demo used for the
+setting <code>M=1.497137 * 2^P</code> is close to optimal
-parameter tuning can be found
+<ref>https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845</ref>.
-[https://github.com/Roasbeef/bips/blob/83b83c78e189be898573e0bfe936dd0c9b99ecb9/gcs_light_client/gentestvectors.go here].
+
 Empirical analysis also shows that was chosen as these parameters minimize the
 bandwidth utilized, considering both the expected number of blocks downloaded
 due to false positives and the size of the filters themselves. 
 The parameter <code>k</code> MUST be set to the first 16 bytes of the hash of
 the block for which the filter is constructed. This ensures the key is
 deterministic while still varying from block to block.
 Since the value <code>N</code> is required to decode a GCS, a serialized GCS
-includes it as a prefix, written as a CompactSize. Thus, the complete
+includes it as a prefix, written as a <code>CompactSize</code>. Thus, the
-serialization of a filter is:
+complete serialization of a filter is:
-* <code>N</code>, encoded as a CompactSize
+* <code>N</code>, encoded as a <code>CompactSize</code>
 * The bytes of the compressed filter itself
 ==== Signaling ====
@ -311,7 +323,8 @@ though it requires implementation of the new filters.
 We would like to thank bfd (from the bitcoin-dev mailing list) for bringing the
 basis of this BIP to our attention, Greg Maxwell for pointing us in the
-direction of Golomb-Rice coding and fast range optimization, and Pedro
+direction of Golomb-Rice coding and fast range optimization, Pieter Wullie for
 his analysis of optimal GCS parameters, and Pedro
 Martelletto for writing the initial indexing code for <code>btcd</code>.
 We would also like to thank Dave Collins, JJ Jeffrey, and Eric Lombrozo for
@ -363,8 +376,8 @@ easier to understand.
 === Golomb-Coded Set Multi-Match ===
 <pre>
-gcs_match_any(key: [16]byte, compressed_set: []byte, targets: [][]byte, P: uint, N: uint) -> bool:
+gcs_match_any(key: [16]byte, compressed_set: []byte, targets: [][]byte, P: uint, N: uint, M: uint) -> bool:
-    let F = N << P
+    let F = N * M
    // Map targets to the same range as the set hashes.
    let target_hashes = []