Ilya Zakharevich on Sun, 18 Aug 2024 15:17:57 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
[PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64 |
On a contemporary low-midlevel CPU (13th Gen Intel(R) Core(TM) i5-13500T), with timeit(s,f) = my(t=gettime(), o=f()); if(#o, o=Str("\t",o)); printf("%s%7.3f%s\n",s,gettime()/1000,o); my(c, L=2*10^9); for(k=8,19, c=0; timeit(Strprintf("10^%d\t",k), (()->forprime(p=10^k,10^k+L,c++);c))) one can get the following timings¹⁾ ²⁾ (of sieving an interval of length 2 billions starting at given numbers) in seconds: 2.16-2-beta | patched and -p 3.2G | found default | -p 370M | no␣options | ar=900K | ar=24M | primes ————————————————————————————————————————————————————————————————————————— 10^8 8.045 7.768 | 5.200 4.885 5.016 | 97125071 10^9 7.828 7.826 | 5.005 4.744 4.828 | 93602003 10^10 7.503 7.494 | 6.902 6.668 8.150 | 86503340 10^11 7.281 7.306 | 6.513 6.291 7.979 | 78934825 10^12 7.915 7.917 | 6.219 5.991 7.812 | 72381054 10^13 140.761 9.682 | 6.058 5.804 7.713 | 66815381 10^14 148.895 13.751 | 6.244 5.728 7.633 | 62036118 10^15 157.631 24.830 | 7.093 6.344 7.587 | 57901748 10^16 163.787 56.966 | 8.792 7.498 7.588 | 54290341 10^17 170.536 151.720 | 13.064 10.115 7.672 | 51100445 10^18 177.302 177.388 | 26.452 17.955 8.401 | 48254877 10^19 185.912 185.914 | 79.479 48.287 10.218 | 45708757 ¹) The overhead of my(c); for(k=1,66000000,c++) is 3.9 sec. Throwing in another “c++” increases it to 6.9 sec. In other words: the timings at start of the table are “essentially 0”, and the timings below 7 should be characterized as “practically free”. ²⁾ Currently, the patch handles only forprime() in the range 2^32 .. 2^64. Here “default” means stock 2.16.2-beta without any argument. In the next column, -p 370000000 is the optimized value of the count of used pre-calculate primes (used in sieving). — This is the break-even point between two possible algorithms used by forprime() with the stock (buggy) code of forprime(). (I did discuss it in my message of a couple days ago.) The opposite happens with -p 3200000000: then all the rows in the table above correspond to the sieving-method used by forprime(). (We do not show what happens with this option and the stock 2.16.2-beta: with this parameter the sqrt-growth observed with -p 370M up to 10^17 just continues to the end of the table. For the patched version we show only the results of -p 3200000000: precalculating primes up to 3.2 billions. We show the timings with the default arena size (now 512K — was set to 32K before due to a bug), as well as with size of the arena tuned to the size of L2 cache³⁾ and/or L3 cache. ³⁾ lscpu shows the L2 size as 1.3M per core. I found that arena works slightly better if it assumes 900K-cache. (Code/data caching conflicting???) Although on this CPU, L3 cache is shared between cores, on an idle machine this value works the best (of the values above 1.3M). With the patch, one can tune the arena size by install(set_optimize,lLDG); set_optimize(5, 24<<20); SUMMARY: with this patch, the overhead of forprime in the range 2^32..2^64 is a constant overhead of “1.5 increments (as in c++)” per a found prime, plus the overhead of “√(N/10^19) increments” for sieving near N. EXTRAPOLATION: It is natural to assume that switching the 2^64 barrier would increase the sqrt()-like overhead about 6 times (3 times due to 3-step “naive remainder” for 2-words BigInt⸣s, and 2 times due to the need to use the “honest-64bit⁴⁾ % processor instructions”. ⁴⁾ Vs. the 64/32-bit remainder which is used by the patch most of the time. Given that near 10^22 forprime()⸣s overhead is about 300 “increments”, with this patch the break-even point between two strategies of doing forprime() moves from 370 millions to 150 billions. CONCLUSION: there is every sense in a framework for storing the pre-calculated prime numbers way above 2^32 — this can improve the speed of forprime() by “up to” more-than-an-order of magnitude⁵⁾ in the range 2^64 .. 2^70. The only practical way I know is the “primediffs-with-hiccups” scheme — as it was⁶⁾ actually used in PARI for decades. ⁵⁾ This is the expected advantage above-but-close-to 2^64. It would gradually decrease due to sqrt()-overhead (the competing method has almost constant overhead). ⁶⁾ For unfathomable (-to-me) reasons, this functionality was removed from PARI. ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ I will comment on this patch in a separate email. (It is a draft only.) Hope this helps, Ilya --- src/language/forprime.c-ini 2024-07-09 15:42:01.000000000 -0700 +++ src/language/forprime.c 2024-08-18 02:27:53.952923105 -0700 @@ -173,9 +173,10 @@ typedef struct { ulong arena; double power; double cutoff; + ulong bigarena; } cache_model_t; -static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF }; +static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF, 0 }; /* Assume that some calculation requires a chunk of memory to be accessed often in more or less random fashion (as in sieving). @@ -282,7 +283,8 @@ good_arena_size(ulong slow2_size, ulong /* Use as in install(set_optimize,lLDG) \\ Through some M too? set_optimize(2,1) \\ disable dependence on limit - \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff + \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff, + \\ 5: cache size (typically whole L2 or L3) in bytes to use in forprime() \\ 2,3,4 are in units of 0.001 { time_primes_arena(ar,limit) = \\ ar = arena size in K @@ -312,6 +314,9 @@ set_optimize(long what, GEN g) case 4: ret = (long)(cache_model.cutoff * 1000); break; + case 5: + ret = (long)(cache_model.bigarena); + break; default: pari_err_BUG("set_optimize"); break; @@ -324,25 +329,46 @@ set_optimize(long what, GEN g) case 2: slow2_in_roots = (double)val / 1000.; break; case 3: cache_model.power = (double)val / 1000.; break; case 4: cache_model.cutoff = (double)val / 1000.; break; + case 5: cache_model.bigarena = val; break; } } return ret; } +#if 1 +#define rem_half(a,b) \ +__extension__ ({ ulong __arg1 = (a); unsigned int __value, __hiremainder, __arg2 = (b); \ + __asm__ ("divl %k4" \ + : "=a" /* %eax */ (__value), "=d" /* %edx */ (__hiremainder) \ + : "ka" /* %eax */ (__arg1), "kd" /* %edx */ (__arg1>>32), "kmr" (__arg2)); \ + __hiremainder; \ +}) /* call only if the return must fit into 32 bits!!! */ +# define HAVE_rem_half +#endif /* !( 1 ) */ + /* s is odd; prime (starting from 3 = known_primes[2]), terminated by a 0 byte. * Checks n odd numbers starting at 'start', setting bytes to 0 (composite) * or 1 (prime), starting at data */ static void sieve_chunk(pari_prime *known_primes, ulong s, byteptr data, ulong n) { - ulong p, cnt = n-1, start = s; + ulong p, cnt = n-1, start = s, swtch; pari_prime *q; memset(data, 0, n); start >>= 1; /* (start - 1)/2 */ start += n; /* Corresponds to the end */ /* data corresponds to start, q runs over primediffs */ +#ifdef HAVE_rem_half +// printf(" \tst=%lu\tn=%lu\n", start, n); fflush(stdout); + swtch = (2*start)/((((unsigned long)1)<<33)-1); /* For p less than swtch, the operation % + * taken below has quotient <2^32. */ +// printf("sw=%lu\tst=%lu\tn=%lu\n", swtch, start, n); fflush(stdout); + + for (q = known_primes + 1, p = 3; p && p < swtch; p = *++q) +#else for (q = known_primes + 1, p = 3; p; p = *++q) +#endif { /* first odd number >= start > p and divisible by p = last odd number <= start + 2p - 2 and 0 (mod p) = p + last number <= start + p - 2 and 0 (mod 2p) @@ -351,6 +377,15 @@ sieve_chunk(pari_prime *known_primes, ul long off = cnt - ((start+(p>>1)) % p); while (off >= 0) { data[off] = 1; off -= p; } } +#ifdef HAVE_rem_half +// printf(" \tp=%lu\n", p); fflush(stdout); + for (; p; p = *++q) + { + long off = cnt - rem_half((start+(p>>1)), p); + while (off >= 0) { data[off] = 1; off -= p; } + } +// printf(" \tp=%lu\n", p); fflush(stdout); +#endif } static void @@ -404,6 +439,8 @@ initprimes0(ulong maxnum, long *lenp, pa * thus we do not include it in fixed_to_cache */ asize = good_arena_size((ulong)(rootnum * slow2_in_roots), remains+1, 0, &cache_model) - 1; +// printf("...Arena size=%lu", asize); + /* enough room on the stack ? */ alloced = (((byteptr)avma) <= ((byteptr)bot) + asize); p = (byteptr)(alloced? pari_malloc(asize+1): stack_malloc(asize+1)); @@ -493,7 +530,7 @@ optimize_chunk(ulong a, ulong b) { /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large * as to force recalculating too often). */ - ulong chunk = 0x80000UL; + ulong chunk = (cache_model.bigarena ? cache_model.bigarena : 0x80000UL)<<4; /* bigarena is in bytes, we use bits, and only odds */ ulong tmp = (b - a) / chunk + 1; if (tmp == 1) @@ -724,6 +761,11 @@ static void sieve_block(ulong a, ulong b, ulong maxpos, unsigned char* sieve) { ulong i, lim = usqrt(b), sz = (b-a) >> 1; +#ifdef HAVE_rem_half + ulong swtch = a>>32; /* If p > swtch, then a/p < 2^32 */ +// printf(" a=%lu\tb=%lu\tsw=%lu\n", a, b, swtch); +#endif /* defined HAVE_rem_half */ + (void)memset(sieve, 0, maxpos+1); for (i = 2;; i++) { /* p is odd */ @@ -731,6 +773,11 @@ sieve_block(ulong a, ulong b, ulong maxp if (p > lim) break; /* solve a + 2k = 0 (mod p) */ +#ifdef HAVE_rem_half + if (p > swtch) + r = rem_half(a, p); + else +#endif /* defined HAVE_rem_half */ r = a % p; if (r == 0) k = 0;
--- src/language/forprime.c-ini 2024-07-09 15:42:01.000000000 -0700 +++ src/language/forprime.c 2024-08-18 02:27:53.952923105 -0700 @@ -173,9 +173,10 @@ typedef struct { ulong arena; double power; double cutoff; + ulong bigarena; } cache_model_t; -static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF }; +static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF, 0 }; /* Assume that some calculation requires a chunk of memory to be accessed often in more or less random fashion (as in sieving). @@ -282,7 +283,8 @@ good_arena_size(ulong slow2_size, ulong /* Use as in install(set_optimize,lLDG) \\ Through some M too? set_optimize(2,1) \\ disable dependence on limit - \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff + \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff, + \\ 5: cache size (typically whole L2 or L3) in bytes to use in forprime() \\ 2,3,4 are in units of 0.001 { time_primes_arena(ar,limit) = \\ ar = arena size in K @@ -312,6 +314,9 @@ set_optimize(long what, GEN g) case 4: ret = (long)(cache_model.cutoff * 1000); break; + case 5: + ret = (long)(cache_model.bigarena); + break; default: pari_err_BUG("set_optimize"); break; @@ -324,25 +329,46 @@ set_optimize(long what, GEN g) case 2: slow2_in_roots = (double)val / 1000.; break; case 3: cache_model.power = (double)val / 1000.; break; case 4: cache_model.cutoff = (double)val / 1000.; break; + case 5: cache_model.bigarena = val; break; } } return ret; } +#if 1 +#define rem_half(a,b) \ +__extension__ ({ ulong __arg1 = (a); unsigned int __value, __hiremainder, __arg2 = (b); \ + __asm__ ("divl %k4" \ + : "=a" /* %eax */ (__value), "=d" /* %edx */ (__hiremainder) \ + : "ka" /* %eax */ (__arg1), "kd" /* %edx */ (__arg1>>32), "kmr" (__arg2)); \ + __hiremainder; \ +}) /* call only if the return must fit into 32 bits!!! */ +# define HAVE_rem_half +#endif /* !( 1 ) */ + /* s is odd; prime (starting from 3 = known_primes[2]), terminated by a 0 byte. * Checks n odd numbers starting at 'start', setting bytes to 0 (composite) * or 1 (prime), starting at data */ static void sieve_chunk(pari_prime *known_primes, ulong s, byteptr data, ulong n) { - ulong p, cnt = n-1, start = s; + ulong p, cnt = n-1, start = s, swtch; pari_prime *q; memset(data, 0, n); start >>= 1; /* (start - 1)/2 */ start += n; /* Corresponds to the end */ /* data corresponds to start, q runs over primediffs */ +#ifdef HAVE_rem_half +// printf(" \tst=%lu\tn=%lu\n", start, n); fflush(stdout); + swtch = (2*start)/((((unsigned long)1)<<33)-1); /* For p less than swtch, the operation % + * taken below has quotient <2^32. */ +// printf("sw=%lu\tst=%lu\tn=%lu\n", swtch, start, n); fflush(stdout); + + for (q = known_primes + 1, p = 3; p && p < swtch; p = *++q) +#else for (q = known_primes + 1, p = 3; p; p = *++q) +#endif { /* first odd number >= start > p and divisible by p = last odd number <= start + 2p - 2 and 0 (mod p) = p + last number <= start + p - 2 and 0 (mod 2p) @@ -351,6 +377,15 @@ sieve_chunk(pari_prime *known_primes, ul long off = cnt - ((start+(p>>1)) % p); while (off >= 0) { data[off] = 1; off -= p; } } +#ifdef HAVE_rem_half +// printf(" \tp=%lu\n", p); fflush(stdout); + for (; p; p = *++q) + { + long off = cnt - rem_half((start+(p>>1)), p); + while (off >= 0) { data[off] = 1; off -= p; } + } +// printf(" \tp=%lu\n", p); fflush(stdout); +#endif } static void @@ -404,6 +439,8 @@ initprimes0(ulong maxnum, long *lenp, pa * thus we do not include it in fixed_to_cache */ asize = good_arena_size((ulong)(rootnum * slow2_in_roots), remains+1, 0, &cache_model) - 1; +// printf("...Arena size=%lu", asize); + /* enough room on the stack ? */ alloced = (((byteptr)avma) <= ((byteptr)bot) + asize); p = (byteptr)(alloced? pari_malloc(asize+1): stack_malloc(asize+1)); @@ -493,7 +530,7 @@ optimize_chunk(ulong a, ulong b) { /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large * as to force recalculating too often). */ - ulong chunk = 0x80000UL; + ulong chunk = (cache_model.bigarena ? cache_model.bigarena : 0x80000UL)<<4; /* bigarena is in bytes, we use bits, and only odds */ ulong tmp = (b - a) / chunk + 1; if (tmp == 1) @@ -724,6 +761,11 @@ static void sieve_block(ulong a, ulong b, ulong maxpos, unsigned char* sieve) { ulong i, lim = usqrt(b), sz = (b-a) >> 1; +#ifdef HAVE_rem_half + ulong swtch = a>>32; /* If p > swtch, then a/p < 2^32 */ +// printf(" a=%lu\tb=%lu\tsw=%lu\n", a, b, swtch); +#endif /* defined HAVE_rem_half */ + (void)memset(sieve, 0, maxpos+1); for (i = 2;; i++) { /* p is odd */ @@ -731,6 +773,11 @@ sieve_block(ulong a, ulong b, ulong maxp if (p > lim) break; /* solve a + 2k = 0 (mod p) */ +#ifdef HAVE_rem_half + if (p > swtch) + r = rem_half(a, p); + else +#endif /* defined HAVE_rem_half */ r = a % p; if (r == 0) k = 0;