On Mon, Jun 05, 2023 at 03:47:11PM +0200, Loďc Grenié wrote:
> I've pushed a "loic-parsum" to pari git. It does not change the number
> of
> threads (I still think it's not optimal right now), however it addresses
> most
> of the problems I've illustrated before, should not hurt performance, as
> far
> as I can tell, and passes the tests involving parsum (export, parallel,
> programming).
>
> The drawback is that it exports one more function (+1 line in paripriv,
> and
> +7 non-empty lines in src/functions/programming/parsum) and substitutes
> two functions by a longer one (+14 non-empty lines in
> src/language/eval.c).
>
> Could you consider it for inclusion, eventually modified?
Sure. Do you have some tests where it makes a difference ?
My gp here is configured as follows:
parisizemax = 2000003072, primelimit = 500000, nbthreads = 16
? default(threadsizemax,)
%1 = 1000000000
With a7bed2c7d7 (patch not applied):
? parsum(a=1,1000000000,a)
cpu time = 1min, 1,997 ms, real time = 23,720 ms.
%1 = 500000000500000000
? parsum(a=1,10000,print(a);setrand(a);matrix(10^3,2*10^2,i,j,random(100)))
(lots of "increasing stack size")
cpu time = 4min, 17,486 ms, real time = 47,195 ms.
%2 = cancelled
? parsum(a=1,1000,setrand(a);matrix(10^4,10^3,i,j,random(100)))
(lots of "increasing stack size")
*** at top-level: parsum(a=1,1000,print(a);setrand(a);matrix(10^
*** ^----------------------------------------------
*** parsum: the thread stack overflows !
current stack size: 1000000000 (953.674 Mbytes)
[hint] you can increase 'threadsizemax' using default()
*** Break loop: type 'break' to go back to GP prompt
With 54a9a89f9f (patch applied)
? parsum(a=1,1000000000,a)
cpu time = 1min, 54,142 ms, real time = 7,633 ms.
%1 = 500000000500000000
? parsum(a=1,10000,setrand(a);matrix(10^3,2*10^2,i,j,random(100)))
cpu time = 8min, 7,098 ms, real time = 33,973 ms.
%2 = cancelled
? parsum(a=1,1000,setrand(a);matrix(10^4,10^3,i,j,random(100)))
(lots of "increasing stack size")
cpu time = 53min, 8,210 ms, real time = 3min, 45,107 ms.
%3 = cancelled
The first sum has lots of easy elements. The copying and central summing
takes most of the time, and ultimately prevents parallelism on the unpatched
pari (there are less than 3 threads used, in mean).
The second has a reasonable number of moderately large objects. The parallelism
is not very good on the unpatched pari (less than 5.5 threads used in mean).
The third has too large objects.
In each case, I can do better with parfor. However, since parsum exists, I expect
that it performs relatively well in relatively reasonable situations (it's clear that if
I sum objects that all fall in a single thread, then parallelism will be bad, however
this is not the situation I'm presenting here: all the threads are roughly equal and
the objects are not awful: either very simple or relatively large).
I still think that the number of threads should also be modified for small
number of objects (if I want to compute
parsum(a=1,20,bnfinit(x^(a+50)-2).reg), I fail to understand why I should do
it using vecsum(parvector) -- even though it works).
Best,
Loïc