1
0
mirror of https://github.com/bitcoin/bips.git synced 2026-05-11 16:51:51 +00:00

script restoration: fix MUL cost to account to round up B to word boundary.

Julian points out that the implementation does this, which improves accuracy
for the case of small B (since the term is multiplied: for normal OP_ADD etc
we don't bother, since the difference is very bounded).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This commit is contained in:
Rusty Russell
2026-03-29 14:33:12 +10:30
parent 977342a943
commit 32035058b4

View File

@@ -9,7 +9,7 @@
Assigned: ? Assigned: ?
License: BSD-3-Clause License: BSD-3-Clause
Discussion: https://groups.google.com/g/bitcoindev/c/GisTcPb8Jco/m/8znWcWwKAQAJ Discussion: https://groups.google.com/g/bitcoindev/c/GisTcPb8Jco/m/8znWcWwKAQAJ
Version: 0.1.0 Version: 0.2.1
Requires: Varops BIP Requires: Varops BIP
</pre> </pre>
@@ -624,6 +624,7 @@ Work in progress:
==Changelog== ==Changelog==
* 0.2.1: 2023-03-27: fix OP_MUL cost to round length(B) up
* 0.2.0: 2025-02-21: change costs to match those in varops budget * 0.2.0: 2025-02-21: change costs to match those in varops budget
* 0.1.0: 2025-09-27: first public posting * 0.1.0: 2025-09-27: first public posting
@@ -659,18 +660,22 @@ using multiple instructions).
For multiplication, the steps break down like so: For multiplication, the steps break down like so:
# Allocate and zero the result: cost = (length(A) + length(B)) * 2 (ZEROING) # Allocate and zero the result: cost = (length(A) + length(B)) * 2 (ZEROING)
# For each word in A: # For each word in A:
#* Multiply by each word in B, into a scratch vector: cost = 6 * length(B) (ARITH) #* Multiply by each word in B, into a scratch vector: cost = 6 * ((length(B) + 7) / 8) * 8 (ARITH)
#* Sum scratch vector at the word offset into the result: cost = 6 * length(B) (ARITH) #* Sum scratch vector at the word offset into the result: cost = 6 * ((length(B) + 7) / 8) * 8 (ARITH)
We increase the length of B here to the next word boundary, using
"((length(B) + 7) / 8) * 8", as the multiplication below makes the
difference of that from the simple "length(B)" significant.
Note: we do not assume Karatsuba, Toom-Cook or other optimizations. Note: we do not assume Karatsuba, Toom-Cook or other optimizations.
The theoretical cost is: (length(A) + length(B)) * 2 + (length(A) + 7) / 8 * length(B) * 12. The theoretical cost is: (length(A) + length(B)) * 2 + (length(A) + 7) / 8 * ((length(B) + 7) / 8) * 8 * 12.
However, benchmarking reveals that the inner loop overhead (branch However, benchmarking reveals that the inner loop overhead (branch
misprediction, cache effects on small elements) is undercosted by the misprediction, cache effects on small elements) is undercosted by the
theoretical model. A 2.25× multiplier on the quadratic term accounts for theoretical model. A 2.25× multiplier on the quadratic term accounts for
this, giving a cost of: (length(A) + length(B)) * 3 + (length(A) + 7) / 8 * this, giving a cost of: (length(A) + length(B)) * 3 + (length(A) + 7) / 8 *
length(B) * 27. ((length(B) + 7) / 8) * 8 * 27.
This is slightly asymmetric: in practice an implementation usually finds that This is slightly asymmetric: in practice an implementation usually finds that
CPU pipelining means choosing B as the larger operand is optimal. CPU pipelining means choosing B as the larger operand is optimal.