Ilya Zakharevich on Fri, 30 Aug 2024 07:36:18 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64 (try III) |
⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ REMARK ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ The main body of this message was already sent to this list, but all previous attempts were lost. The differences w.r.t. the first version: • gives a link to a conjectural 5x speedup of forprime-via-nextprime. • emphasizes that <<4 corrects a bug. ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ On Sun, Aug 18, 2024 at 06:17:51AM -0700, Ilya Zakharevich wrote: > ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ I will comment on this patch in a separate email. > timeit(s,f) = my(t=gettime(), o=f()); if(#o, o=Str("\t",o)); printf("%s%7.3f%s\n",s,gettime()/1000,o); > my(c, L=2*10^9); for(k=8,19, c=0; timeit(Strprintf("10^%d\t",k), (()->forprime(p=10^k,10^k+L,c++);c))) > > one can get the following timings¹⁾ ²⁾ (of sieving an interval of length 2 > billions starting at given numbers) in seconds: > > 2.16-2-beta | patched and -p 3.2G | found > default | -p 370M | no␣options | ar=900K | ar=24M | primes > ————————————————————————————————————————————————————————————————————————— > 10^8 8.045 7.768 | 5.200 4.885 5.016 | 97125071 > 10^9 7.828 7.826 | 5.005 4.744 4.828 | 93602003 > 10^10 7.503 7.494 | 6.902 6.668 8.150 | 86503340 > 10^11 7.281 7.306 | 6.513 6.291 7.979 | 78934825 > 10^12 7.915 7.917 | 6.219 5.991 7.812 | 72381054 > 10^13 140.761 9.682 | 6.058 5.804 7.713 | 66815381 > 10^14 148.895 13.751 | 6.244 5.728 7.633 | 62036118 > 10^15 157.631 24.830 | 7.093 6.344 7.587 | 57901748 > 10^16 163.787 56.966 | 8.792 7.498 7.588 | 54290341 > 10^17 170.536 151.720 | 13.064 10.115 7.672 | 51100445 > 10^18 177.302 177.388 | 26.452 17.955 8.401 | 48254877 > 10^19 185.912 185.914 | 79.479 48.287 10.218 | 45708757 • This is a part of a large data-collection effort. Done separately for various ranges of the arena size “ar”, like forstep(d=20,40,5,my(s=(1+2001\/d)*10^6/16);set_optimize(5,s);print("\n\n\t",s);my(c,L=2*10^9);for(k=8,19,c=0;timeit(Strprintf("10^%d\t",k),(()->forprime(p=10^k,10^k+L,c++);c)))) • The speed up to 10^12 does not depend much on the arena size — even down to 32K arenas. (It seems that processor’s predictive auto-prefetching is working fine!) • The speed is quite consistent if one does not change “the environment”. However, relocations of the PARI’s stack can lead to variations as large as 15%. • The arenas are allocaated on the PARI’s stack. So with larger arenas, one needs to increase stack (I do not know why, but it seems that extra 25% overhead w.r.t. “ar” is required; so for 24M arena I need allocatemem(30*10^6) ). > Here “default” means stock 2.16.2-beta without any argument. In the > next column, -p 370000000 is the optimized value of the count of used > pre-calculate primes (used in sieving). — This is the break-even point — the same as “crossover point”. > between two possible algorithms used by forprime() with the stock > (buggy) code of forprime(). (I did discuss it in my message of a > couple days ago.) > The opposite happens with -p 3200000000: then all the rows in the > table above correspond to the sieving-method used by forprime(). Two methods used by PARI are “sieve-by-a-chunk-of-the-given-range” and “slightly optimized²⁾ nextprime()”. ²⁾ This means pre-sieving using primes≤7 (i.e., mod 210). Earlier this year, I wrote that (my conjecture is that) pre-sieving up to primes≤10⁶ would lead³⁾ to 5x speedup. http://megrez.math.u-bordeaux.fr/archives/pari-dev-2401/msg00049.html ³⁾ Such a speedup would decrease the crossover point 25x down. Still, sieving is going to be ∼3.5 times quicker above 2^64, and the advantage would keep up to 2^67. > (It is a draft only.) This means: • The “#if 1” chunk should go into the assembler part of the kernel. • The printf()⸣s below are commented. In the final version, they should be removed. > --- src/language/forprime.c-ini 2024-07-09 15:42:01.000000000 -0700 > +++ src/language/forprime.c 2024-08-18 02:27:53.952923105 -0700 ………… > +#if 1 > +#define rem_half(a,b) \ > +__extension__ ({ ulong __arg1 = (a); unsigned int __value, __hiremainder, __arg2 = (b); \ > + __asm__ ("divl %k4" \ > + : "=a" /* %eax */ (__value), "=d" /* %edx */ (__hiremainder) \ > + : "ka" /* %eax */ (__arg1), "kd" /* %edx */ (__arg1>>32), "kmr" (__arg2)); \ > + __hiremainder; \ > +}) /* call only if the return must fit into 32 bits!!! */ > +# define HAVE_rem_half > +#endif /* !( 1 ) */ > /* s is odd; prime (starting from 3 = known_primes[2]), terminated by a 0 byte. > * Checks n odd numbers starting at 'start', setting bytes to 0 (composite) > * or 1 (prime), starting at data */ This (not mine!) comment is complete bonkers. So are many comments in the body of the related functions. They mention bytes (both instead of words and of bits), primediff; the meaning of bytes 0 and 1 are inverted. Moreover, it seems that some code assumes that prime=3 is at offset 2, some that it is at offset 3. > @@ -493,7 +530,7 @@ optimize_chunk(ulong a, ulong b) > { > /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large > * as to force recalculating too often). */ > - ulong chunk = 0x80000UL; > + ulong chunk = (cache_model.bigarena ? cache_model.bigarena : 0x80000UL)<<4; /* bigarena is in bytes, we use bits, and only odds */ > ulong tmp = (b - a) / chunk + 1; > > if (tmp == 1) This should better be reimplemented as the default value 0x800000UL for “bigarena”. !!! Moreover: PAY ATTENTION to the added “<<4”. THIS was the bug leading !!! to the major part of the speedup of the patch I discuss. Hope this helps, Ilya