[nasm:nasm-2.15.xx] phash: bloat the hashes somewhat, reducing the likelihood of false positives

nasm-bot for H. Peter Anvin (Intel) hpa at zytor.com
Mon Jul 27 13:27:03 PDT 2020


Commit-ID:  671f204ed1c0ba4d7834124f0cadd5cf792ffde6
Gitweb:     http://repo.or.cz/w/nasm.git?a=commitdiff;h=671f204ed1c0ba4d7834124f0cadd5cf792ffde6
Author:     H. Peter Anvin (Intel) <hpa at zytor.com>
AuthorDate: Thu, 9 Jul 2020 23:39:58 -0700
Committer:  H. Peter Anvin (Intel) <hpa at zytor.com>
CommitDate: Mon, 27 Jul 2020 13:24:59 -0700

phash: bloat the hashes somewhat, reducing the likelihood of false positives

Set the hash size scaling constant to 1.6, signifying 3.2 times the
hash load. This both reduces the convergence time and makes it less
likely (< 25%) that a non-entry will require a secondary comparison,
and after all, in most of our use cases non-entries are by far the
more common.

Signed-off-by: H. Peter Anvin (Intel) <hpa at zytor.com>


---
 asm/pptok.pl     | 2 +-
 perllib/phash.ph | 6 ++++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/asm/pptok.pl b/asm/pptok.pl
index 5498cb46..de1fac22 100755
--- a/asm/pptok.pl
+++ b/asm/pptok.pl
@@ -223,7 +223,7 @@ if ($what eq 'c') {
     # Put a large value in unused slots.  This makes it extremely unlikely
     # that any combination that involves unused slot will pass the range test.
     # This speeds up rejection of unrecognized tokens, i.e. identifiers.
-    print OUT "#define UNUSED_HASH_ENTRY (65535/3)\n";
+    print OUT "\n#define UNUSED_HASH_ENTRY (65535/3)\n";
 
     print OUT "    static const int16_t hash1[$n] = {\n";
     for ($i = 0; $i < $n; $i++) {
diff --git a/perllib/phash.ph b/perllib/phash.ph
index 644e13c7..133a28b0 100644
--- a/perllib/phash.ph
+++ b/perllib/phash.ph
@@ -145,8 +145,10 @@ sub gen_perfect_hash($) {
 
     # Minimal power of 2 value for N with enough wiggle room.
     # The scaling constant must be larger than 0.5 in order for the
-    # algorithm to ever terminate.
-    my $room = int(scalar(@keys)*0.8);
+    # algorithm to ever terminate. The higher the scaling constant,
+    # the more space does the hash take up, but the less likely is it
+    # that an invalid token will require a string comparison.
+    my $room = int(scalar(@keys)*1.6);
     $n = 1;
     while ($n < $room) {
 	$n <<= 1;


More information about the Nasm-commits mailing list