Mike Day on Sun, 14 Jul 2019 18:30:02 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: Using local variable for memoizing in recursion |
Thanks. Quite a lot to study. I wondered about your definition of a vector within a loop, as I’d assumed its creation would be expensive in time and space, but I’m happy to accept that its sort-of-implied loop is more efficient than an explicit for loop. Cheers, Mike Please reply to mike_liz.day@tiscali.co.uk. Sent from my iPad > On 14 Jul 2019, at 17:11, Karim Belabas <Karim.Belabas@math.u-bordeaux.fr> wrote: > > * Mike Day [2019-07-14 15:03]: >> I've just noticed an edge-condition failure in Karim's version (below): >> need to replace >> vecsum(t[a-1]); \\ at end of function by >> vecsum(t[kx]); \\ was problem when a=1 ! > > Good catch :-) > >> I'd accepted Dmitry's assertion that execution of a particular line took >> half the execution time, so sought a way to avoid the calculation. > > The assertion is plausible: the innermost loop is going to take 100% of CPU > time and contains only 2 lines > > ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */ > t[k2][j]=t[kx][j-1]+t[ky][j]); > > which look of comparable cost (see below: more atomic steps in the first one > but tiny inputs compared to the large addition and assignment in the second > line). > >> I still wonder, though, how I can _examine_ a function's performance, >> line by line, in order to optimise by hand/eye/brain... > > I don't see an easy way in GP (esp. under Windows). The following usually > works but is cumbersome: replace in the program the instruction block > to be timed by a loop repeating the same block enough times that an > individual timing become meaningfull, and surround the loop by something like > > gettime(); > for(cnt = 1, MAXCNT, ...); > T += gettime(); > > Make sure the loop operates on dummy variables (say, my_t) so as to have no > effect on the program flow. Then print T / MAXCNT and compare. One must also > check than an empty loop for(cnt = 1, MAXCNT, ) takes negligible time [will be > the case here] or take that into account also... > > What I sometimes do is have a look at gprof [external profiling tool] and > decide from there. Just did it: in this case, it's hard to identify what takes > time from the C function names: > > < gprof excerpt, ran on the original pnkv_3(10000,2000) > > index % time self children called name > [3] 92.0 3.05 3.64 20010027+1 closure_eval <cycle 2> [3] > 0.94 0.53 60014000/60014000 gadd [4] > 0.05 0.35 20008000/20008000 change_compo [7] > 0.01 0.25 20008003/20008003 changelex [9] > 0.07 0.17 20007999/20007999 gmod [10] > 0.12 0.07 40016003/80028023 gunclone_deep [8] > 0.19 0.00 120008001/120008001 check_array_index [12] > 0.09 0.09 40011998/40011998 copylex [13] > 0.14 0.00 80002000/80002000 closure_castgen [16] > 0.10 0.00 40016010/40016010 stoi [22] > 0.05 0.00 39996001/39996001 gsub [28] > > (N.B. The timings from gprof are usually inaccurate but give a rough idea > of relative cost; the number of calls is more useful.) > > A line such as > > ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */ > > translates to 2 gadd, 1 gsub, 1 gmod and 1 changelex. It is run ~ 2.10^7 times > (the number of 'gmod' in gprof output, but also ~ n * k from direct code > inspection) > > The other line in the innermost loop > > t[k2][j]=t[kx][j-1]+t[ky][j]); > > translates to 1 gadd, 1 gsub, 1 change_compo (=) and a 6 check_array_index > ([] construct). Note that this latter gadd operates on much larger inputs than > the preceding ones [ or the gsub and gmod ] from the other line but it's > impossible to estimate that only from gprof's output. > > > My version returns this: > > < gprof excerpt, ran on my pnk_4(10000,2000) > > index % time self children called name > [3] 91.0 3.06 3.09 40018025+1 closure_eval <cycle 2> [3] > 0.96 0.25 19988002/19990001 gadd [4] > 0.02 0.31 20038000/20038000 changelex [6] > 0.17 0.07 40055999/40074019 gunclone_deep [8] > 0.19 0.00 79972003/79972003 closure_castgen [10] > 0.04 0.13 19998000/19998000 geq [11] > 0.14 0.00 40028027/40028027 pari_stack_alloc [13] > 0.03 0.10 19998000/19998000 gsub1e [16] > 0.11 0.00 79982004/79982004 check_array_index [18] > 0.10 0.00 19988002/19988002 gsub [19] > 0.08 0.00 39986017/39986017 stoi [22] > > and 100% of the CPU time is now concentrated on the following three lines: > > if (!(ky--), ky = k1); > if (j == 1, t[ky][1] > , t[kx][j-1] + t[ky][j])); > > The reasons it's a little faster: > 1) I removed the assignments to the t[k+2] accumulator using the > t[a] = vector() construct [which replaces t[a] in one atomic step > instead of changing in place the t[a][j], which wouldn't work since the loop > still needs the values) > 2) I removed many operations on small operands; letting the implicit loop > in vector() do some of them in a faster way (+ and assignment). > >> And while we're at it, is Dmitry's use of a vector of vectors, t, to be >> preferred to my earlier use of a matrix, also t, with similar dimensions? > > A vector of vectors will be a little faster but it should not make a major > difference. What made a difference is to replace a for() loop by a vector() > construct in the innermost loop and removing the "deep" assignments > t[k2][j] = ... > (we still have an equivalent of the global copy t[kx] = t[k2] in the t[a] = > vector(...)) > > Cheers, > > K.B. > -- > Karim Belabas, IMB (UMR 5251) Tel: (+33) (0)5 40 00 26 17 > Universite de Bordeaux Fax: (+33) (0)5 40 00 21 23 > 351, cours de la Liberation http://www.math.u-bordeaux.fr/~kbelabas/ > F-33405 Talence (France) http://pari.math.u-bordeaux.fr/ [PARI/GP] > `