Karim Belabas on Sun, 14 Jul 2019 18:12:01 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: Fwd: Re: Using local variable for memoizing in recursion |
* Mike Day [2019-07-14 15:03]: > I've just noticed an edge-condition failure in Karim's version (below): > need to replace > vecsum(t[a-1]); \\ at end of function by > vecsum(t[kx]); \\ was problem when a=1 ! Good catch :-) > I'd accepted Dmitry's assertion that execution of a particular line took > half the execution time, so sought a way to avoid the calculation. The assertion is plausible: the innermost loop is going to take 100% of CPU time and contains only 2 lines ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */ t[k2][j]=t[kx][j-1]+t[ky][j]); which look of comparable cost (see below: more atomic steps in the first one but tiny inputs compared to the large addition and assignment in the second line). > I still wonder, though, how I can _examine_ a function's performance, > line by line, in order to optimise by hand/eye/brain... I don't see an easy way in GP (esp. under Windows). The following usually works but is cumbersome: replace in the program the instruction block to be timed by a loop repeating the same block enough times that an individual timing become meaningfull, and surround the loop by something like gettime(); for(cnt = 1, MAXCNT, ...); T += gettime(); Make sure the loop operates on dummy variables (say, my_t) so as to have no effect on the program flow. Then print T / MAXCNT and compare. One must also check than an empty loop for(cnt = 1, MAXCNT, ) takes negligible time [will be the case here] or take that into account also... What I sometimes do is have a look at gprof [external profiling tool] and decide from there. Just did it: in this case, it's hard to identify what takes time from the C function names: < gprof excerpt, ran on the original pnkv_3(10000,2000) > index % time self children called name [3] 92.0 3.05 3.64 20010027+1 closure_eval <cycle 2> [3] 0.94 0.53 60014000/60014000 gadd [4] 0.05 0.35 20008000/20008000 change_compo [7] 0.01 0.25 20008003/20008003 changelex [9] 0.07 0.17 20007999/20007999 gmod [10] 0.12 0.07 40016003/80028023 gunclone_deep [8] 0.19 0.00 120008001/120008001 check_array_index [12] 0.09 0.09 40011998/40011998 copylex [13] 0.14 0.00 80002000/80002000 closure_castgen [16] 0.10 0.00 40016010/40016010 stoi [22] 0.05 0.00 39996001/39996001 gsub [28] (N.B. The timings from gprof are usually inaccurate but give a rough idea of relative cost; the number of calls is more useful.) A line such as ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */ translates to 2 gadd, 1 gsub, 1 gmod and 1 changelex. It is run ~ 2.10^7 times (the number of 'gmod' in gprof output, but also ~ n * k from direct code inspection) The other line in the innermost loop t[k2][j]=t[kx][j-1]+t[ky][j]); translates to 1 gadd, 1 gsub, 1 change_compo (=) and a 6 check_array_index ([] construct). Note that this latter gadd operates on much larger inputs than the preceding ones [ or the gsub and gmod ] from the other line but it's impossible to estimate that only from gprof's output. My version returns this: < gprof excerpt, ran on my pnk_4(10000,2000) > index % time self children called name [3] 91.0 3.06 3.09 40018025+1 closure_eval <cycle 2> [3] 0.96 0.25 19988002/19990001 gadd [4] 0.02 0.31 20038000/20038000 changelex [6] 0.17 0.07 40055999/40074019 gunclone_deep [8] 0.19 0.00 79972003/79972003 closure_castgen [10] 0.04 0.13 19998000/19998000 geq [11] 0.14 0.00 40028027/40028027 pari_stack_alloc [13] 0.03 0.10 19998000/19998000 gsub1e [16] 0.11 0.00 79982004/79982004 check_array_index [18] 0.10 0.00 19988002/19988002 gsub [19] 0.08 0.00 39986017/39986017 stoi [22] and 100% of the CPU time is now concentrated on the following three lines: if (!(ky--), ky = k1); if (j == 1, t[ky][1] , t[kx][j-1] + t[ky][j])); The reasons it's a little faster: 1) I removed the assignments to the t[k+2] accumulator using the t[a] = vector() construct [which replaces t[a] in one atomic step instead of changing in place the t[a][j], which wouldn't work since the loop still needs the values) 2) I removed many operations on small operands; letting the implicit loop in vector() do some of them in a faster way (+ and assignment). > And while we're at it, is Dmitry's use of a vector of vectors, t, to be > preferred to my earlier use of a matrix, also t, with similar dimensions? A vector of vectors will be a little faster but it should not make a major difference. What made a difference is to replace a for() loop by a vector() construct in the innermost loop and removing the "deep" assignments t[k2][j] = ... (we still have an equivalent of the global copy t[kx] = t[k2] in the t[a] = vector(...)) Cheers, K.B. -- Karim Belabas, IMB (UMR 5251) Tel: (+33) (0)5 40 00 26 17 Universite de Bordeaux Fax: (+33) (0)5 40 00 21 23 351, cours de la Liberation http://www.math.u-bordeaux.fr/~kbelabas/ F-33405 Talence (France) http://pari.math.u-bordeaux.fr/ [PARI/GP] `