David Cleaver on Sat, 31 Jan 2026 01:11:45 +0100


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Question about parallel execution...


Hello,

I've created a function, find_n, that performs calculations over a range of inputs [a,b].

I've updated the code to work with parfor, where:
- it sets the PARI/GP nbthreads to num_threads
- it runs num_threads copies of the function find_n
- the function find_n uses forstep so that each instance only works on its set of inputs in [a,b]

The function works as expected and produces the expected output, whether single threaded or multi-threaded.

However, the runtime (both cpu time and real time) of the parallel code is not working as I would expect.

For example, I have a 56 physical core computer (dual 28-core processors), and:
When I run 1 instance of find_n, it takes about 8.9 real-time seconds to run.
When I run with 2 instances of find_n, it takes about 9.2 real-time seconds to run.
When I run with 5 instances of find_n, it takes about 6 real-time seconds to run.
When I run with 10 instances of find_n, it takes about 7.5 real-time seconds to run.
When I run with 20 instances of find_n, it takes about 7.5 real-time seconds to run.
When I run with 30 instances of find_n, it takes about 9.4 real-time seconds to run.
When I run with 40 instances of find_n, it takes about 9.8 real-time seconds to run.
When I run with 50 instances of find_n, it takes about 9.5 real-time seconds to run.

I thought splitting the task into N parts would reduce the runtime by about 1/N, but I'm not seeing that in the actual runtimes.

The processor load indicates that it is using 100% of N cores during the full run, but the overall runtime isn't improving.

I've read that there is overhead in the multi-thread interface, but I was hoping that running only nbthreads instances would help mitigate that.

I'm seeing the same behavior on two different Windows 64-bit computers, one with 24 cores and the other with 56 cores.

I've tried Pari64-2-15-4-pthread.exe, Pari64-2-17-1-pthread.exe, and gppthread64-2-17-3.exe, they all show the same behavior.

I'm wondering if I'm doing something wrong in the code?

Could there be a function I'm using that is not multi-thread friendly?

I've created a small example function that works similarly to a bigger function I'm working on.

I've also included the output and runtimes from this sample function down below.

Can someone help me review this code and see why I might not be seeing a speed up with a large number of cores?

Thanks for any help or insights you can provide!

-David C.


find_n(h, threads, a, b) = {
  my(j,k,m,x,y,z);
  forstep(k = a + h - 1, b, threads,
    x = digits(k);
    z = vector(#x);
    for(m=2, k-1,
      for (j = 1, #x, z[j] = x[j]*m);
      y = "";
      for (j = 1, #z, y = concat(y, z[j]));
      y = eval(y); /* convert the string to a number */
      y = (y\1000000)%10000;
      if(y == 9999, print("Thread ",h," found ",k," and ",m," : ",y));
    );
  );
}

export(find_n)

find_n_in_range(num_threads, a, b) = {
  /* set num threads here */
  default(nbthreads, num_threads);
  parfor(i = 1, num_threads, find_n(i, num_threads, a, b));
}


? find_n_in_range(1, 10, 1000)
Thread 1 found 334 and 333 : 9999
Thread 1 found 335 and 333 : 9999
Thread 1 found 336 and 333 : 9999
Thread 1 found 337 and 333 : 9999
Thread 1 found 338 and 333 : 9999
Thread 1 found 339 and 333 : 9999
time = 8,907 ms.


? find_n_in_range(2, 10, 1000)
Thread 1 found 334 and 333 : 9999
Thread 1 found 336 and 333 : 9999
Thread 1 found 338 and 333 : 9999
Thread 2 found 335 and 333 : 9999
Thread 2 found 337 and 333 : 9999
Thread 2 found 339 and 333 : 9999
cpu time = 9,422 ms, real time = 9,257 ms.


? find_n_in_range(5, 10, 1000)
Thread 1 found 335 and 333 : 9999
Thread 5 found 334 and 333 : 9999
Thread 3 found 337 and 333 : 9999
Thread 5 found 339 and 333 : 9999
Thread 2 found 336 and 333 : 9999
Thread 4 found 338 and 333 : 9999
cpu time = 94 ms, real time = 6,068 ms.


? find_n_in_range(10, 10, 1000)
Thread 8 found 337 and 333 : 9999
Thread 6 found 335 and 333 : 9999
Thread 9 found 338 and 333 : 9999
Thread 7 found 336 and 333 : 9999
Thread 5 found 334 and 333 : 9999
Thread 10 found 339 and 333 : 9999
cpu time = 63 ms, real time = 7,545 ms.


? find_n_in_range(20, 10, 1000)
Thread 5 found 334 and 333 : 9999
Thread 7 found 336 and 333 : 9999
Thread 10 found 339 and 333 : 9999
Thread 9 found 338 and 333 : 9999
Thread 6 found 335 and 333 : 9999
Thread 8 found 337 and 333 : 9999
cpu time = 15 ms, real time = 7,578 ms.


? find_n_in_range(30, 10, 1000)
Thread 27 found 336 and 333 : 9999
Thread 29 found 338 and 333 : 9999
Thread 25 found 334 and 333 : 9999
Thread 30 found 339 and 333 : 9999
Thread 28 found 337 and 333 : 9999
Thread 26 found 335 and 333 : 9999
cpu time = 564 ms, real time = 9,485 ms.


? find_n_in_range(40, 10, 1000)
Thread 6 found 335 and 333 : 9999
Thread 10 found 339 and 333 : 9999
Thread 8 found 337 and 333 : 9999
Thread 7 found 336 and 333 : 9999
Thread 9 found 338 and 333 : 9999
Thread 5 found 334 and 333 : 9999
cpu time = 359 ms, real time = 9,878 ms.


? find_n_in_range(50, 10, 1000)
Thread 26 found 335 and 333 : 9999
Thread 28 found 337 and 333 : 9999
Thread 30 found 339 and 333 : 9999
Thread 25 found 334 and 333 : 9999
Thread 29 found 338 and 333 : 9999
Thread 27 found 336 and 333 : 9999
cpu time = 267 ms, real time = 9,503 ms.