Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^6

Ilya Zakharevich on Fri, 30 Aug 2024 07:36:18 +0200

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64 (try III)

To: Pari Developers Mailing List <pari-dev@pari.math.u-bordeaux.fr>
Subject: Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64 (try III)
From: Ilya Zakharevich <nospam-abuse@ilyaz.org>
Date: Tue, 20 Aug 2024 11:20:54 -0700
Delivery-date: Fri, 30 Aug 2024 07:36:18 +0200
In-reply-to: <ZsH0fuX68GJ4oDyO@debian.attlocal.net>
References: <ZsH0fuX68GJ4oDyO@debian.attlocal.net> <ZsTehanrqnA0Xz1h@debian.attlocal.net>

   ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ REMARK ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜
   
   The main body of this message was already sent to this list, but
   all previous attempts were lost.

   The differences w.r.t. the first version:
     • gives a link to a conjectural 5x speedup of forprime-via-nextprime.
     • emphasizes that <<4 corrects a bug.

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜

On Sun, Aug 18, 2024 at 06:17:51AM -0700, Ilya Zakharevich wrote:
> ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ I will comment on this patch in a separate email.

>     timeit(s,f) = my(t=gettime(), o=f()); if(#o, o=Str("\t",o)); printf("%s%7.3f%s\n",s,gettime()/1000,o);
>     my(c, L=2*10^9); for(k=8,19, c=0; timeit(Strprintf("10^%d\t",k), (()->forprime(p=10^k,10^k+L,c++);c)))
> 
> one can get the following timings¹⁾ ²⁾ (of sieving an interval of length 2
> billions starting at given numbers) in seconds:
> 
>             2.16-2-beta    |       patched and -p 3.2G       |   found
>     	default  | -p 370M | no␣options  | ar=900K  | ar=24M |    primes
> —————————————————————————————————————————————————————————————————————————
> 10^8      8.045      7.768 |      5.200      4.885     5.016 |  97125071
> 10^9      7.828      7.826 |      5.005      4.744     4.828 |  93602003
> 10^10     7.503      7.494 |      6.902      6.668     8.150 |  86503340
> 10^11     7.281      7.306 |      6.513      6.291     7.979 |  78934825
> 10^12     7.915      7.917 |      6.219      5.991     7.812 |  72381054
> 10^13   140.761      9.682 |      6.058      5.804     7.713 |  66815381
> 10^14   148.895     13.751 |      6.244      5.728     7.633 |  62036118
> 10^15   157.631     24.830 |      7.093      6.344     7.587 |  57901748
> 10^16   163.787     56.966 |      8.792      7.498     7.588 |  54290341
> 10^17   170.536    151.720 |     13.064     10.115     7.672 |  51100445
> 10^18   177.302    177.388 |     26.452     17.955     8.401 |  48254877
> 10^19   185.912    185.914 |     79.479     48.287    10.218 |  45708757

• This is a part of a large data-collection effort.  Done separately for
  various ranges of the arena size “ar”, like
      forstep(d=20,40,5,my(s=(1+2001\/d)*10^6/16);set_optimize(5,s);print("\n\n\t",s);my(c,L=2*10^9);for(k=8,19,c=0;timeit(Strprintf("10^%d\t",k),(()->forprime(p=10^k,10^k+L,c++);c))))

• The speed up to 10^12 does not depend much on the arena size — even
  down to 32K arenas.  (It seems that processor’s predictive
  auto-prefetching is working fine!) 

• The speed is quite consistent if one does not change “the
  environment”.  However, relocations of the PARI’s stack can lead to
  variations as large as 15%.

• The arenas are allocaated on the PARI’s stack.  So with larger
  arenas, one needs to increase stack (I do not know why, but it seems
  that extra 25% overhead w.r.t. “ar” is required; so for 24M arena I
  need
         allocatemem(30*10^6)
  ).
  
> Here “default” means stock 2.16.2-beta without any argument.  In the
> next column, -p 370000000 is the optimized value of the count of used
> pre-calculate primes (used in sieving). — This is the break-even point

     — the same as “crossover point”.

> between two possible algorithms used by forprime() with the stock
> (buggy) code of forprime().  (I did discuss it in my message of a
> couple days ago.)

> The opposite happens with -p 3200000000: then all the rows in the
> table above correspond to the sieving-method used by forprime().

Two methods used by PARI are “sieve-by-a-chunk-of-the-given-range” and
“slightly optimized²⁾ nextprime()”.

    ²⁾ This means pre-sieving using primes≤7 (i.e., mod 210).  Earlier
       this year, I wrote that (my conjecture is that) pre-sieving up
       to primes≤10⁶ would lead³⁾ to 5x speedup.

            http://megrez.math.u-bordeaux.fr/archives/pari-dev-2401/msg00049.html

    ³⁾ Such a speedup would decrease the crossover point 25x down.
       Still, sieving is going to be ∼3.5 times quicker above 2^64,
       and the advantage would keep up to 2^67.
       
>   (It is a draft only.)

This means:

• The “#if 1” chunk should go into the assembler part of the kernel.

• The printf()⸣s below are commented.  In the final version, they
  should be removed.

> --- src/language/forprime.c-ini	2024-07-09 15:42:01.000000000 -0700
> +++ src/language/forprime.c	2024-08-18 02:27:53.952923105 -0700

  …………

> +#if 1
> +#define rem_half(a,b) \
> +__extension__ ({ ulong __arg1 = (a); unsigned int __value, __hiremainder, __arg2 = (b); \
> +   __asm__ ("divl %k4" \
> +        : "=a" /* %eax */ (__value), "=d" /* %edx */ (__hiremainder) \
> +        : "ka" /* %eax */ (__arg1), "kd" /* %edx */ (__arg1>>32), "kmr" (__arg2)); \
> +   __hiremainder; \
> +})				/* call only if the return must fit into 32 bits!!! */
> +#  define HAVE_rem_half
> +#endif	/* !( 1 ) */ 

>  /* s is odd; prime (starting from 3 = known_primes[2]), terminated by a 0 byte.
>   * Checks n odd numbers starting at 'start', setting bytes to 0 (composite)
>   * or 1 (prime), starting at data */

This (not mine!) comment is complete bonkers.  So are many comments in
the body of the related functions.  They mention bytes (both instead
of words and of bits), primediff; the meaning of bytes 0 and 1 are
inverted.

Moreover, it seems that some code assumes that prime=3 is at offset 2,
some that it is at offset 3. 

> @@ -493,7 +530,7 @@ optimize_chunk(ulong a, ulong b)
>  {
>    /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large
>     * as to force recalculating too often). */
> -  ulong chunk = 0x80000UL;
> +  ulong chunk = (cache_model.bigarena ? cache_model.bigarena : 0x80000UL)<<4; /* bigarena is in bytes, we use bits, and only odds */
>    ulong tmp = (b - a) / chunk + 1;
>  
>    if (tmp == 1)

This should better be reimplemented as the default value 0x800000UL
for “bigarena”.

!!!  Moreover: PAY ATTENTION to the added “<<4”.  THIS was the bug leading
!!!  to the major part of the speedup of the patch I discuss.

Hope this helps,
Ilya

References:
- [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64
  - From: Ilya Zakharevich <nospam-abuse@ilyaz.org>

Prev by Date: [PATCH 2.16.2-beta] compressed input
Next by Date: [pre-announcement of a PATCH 2.16.1-alpha] 90x speedup of forprime() above 2^64
Previous by thread: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64
Next by thread: [PATCH 2.16.2-beta] compressed input
Index(es):
- Date
- Thread