Ilya Zakharevich on Tue, 8 Apr 2003 02:49:24 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
[PATCH CVS] prime sieving again |
This patch: a) Optimizes sieve_chunk() function so that gcc 2.8.1 can fit both loops into registers; b) Protects sieve_chunk against inlining (which breaks most of the benefits of "a"); [The speedup of sieving on Athlon is circa 1.5x.] c) Updates the "slowdown model" to the new algorithm (note that up to maximal possible 32bit primelimit everything fits into 64K L1 cache of Athlon; thus the model is not used with the default values of parameters); d) Updates some comments. This is as good as one can get with primes with the current restricted API of PARI. Until t_EXT is implemented (e.g., by incorporation of my patch), one is limited to what can be calculated by 15sec of sieving on contemporary hardware. When t_EXT is included, one could access primelist via an array-like interface; in my guestimate, on a contemporary hardware it is very easy to e.g., calculate 1e9 primes about 1e21 in a day or so. Enjoy, Ilya --- ./src/basemath/arith2.c-good Mon Mar 10 13:03:54 2003 +++ ./src/basemath/arith2.c Tue Apr 8 02:34:08 2003 @@ -119,70 +119,92 @@ initprimes1(ulong size, long *lenp, long /* Timing in ms (Athlon/850; reports 512K of secondary cache; looks like there is 64K of quickier cache too). - The algorithm of allocation starts to work regularly from - 2pi(sqrt(lim)); we skip or double-quote what is less than this. + arena| 30m 100m 300m 1000m 2000m <-- primelimit + ================================================= + 16K 1.1053 1.1407 1.2589 1.4368 1.6086 + 24K 1.0000 1.0625 1.1320 1.2443 1.3095 + 32K 1.0000 1.0469 1.0761 1.1336 1.1776 + 48K 1.0000 1.0000 1.0254 1.0445 1.0546 + 50K 1.0000 1.0000 1.0152 1.0345 1.0464 + 52K 1.0000 1.0000 1.0203 1.0273 1.0362 + 54K 1.0000 1.0000 1.0812 1.0216 1.0281 + 56K 1.0526 1.0000 1.0051 1.0144 1.0205 + 58K 1.0000 1.0000 1.0000 1.0086 1.0123 + 60K 0.9473 0.9844 1.0051 1.0014 1.0055 + 62K 1.0000 0.9844 0.9949 0.9971 0.9993 + 64K 1.0000 1.0000 1.0000 1.0000 1.0000 + 66K 1.2632 1.2187 1.2183 1.2055 1.1953 + 68K 1.4211 1.4844 1.4721 1.4425 1.4188 + 70K 1.7368 1.7188 1.7107 1.6767 1.6421 + 72K 1.9474 1.9531 1.9594 1.9023 1.8573 + 74K 2.2105 2.1875 2.1827 2.1207 2.0650 + 76K 2.4211 2.4219 2.4010 2.3305 2.2644 + 78K 2.5789 2.6250 2.6091 2.5330 2.4571 + 80K 2.8421 2.8125 2.8223 2.7213 2.6380 + 84K 3.1053 3.1875 3.1776 3.0819 2.9802 + 88K 3.5263 3.5312 3.5228 3.4124 3.2992 + 92K 3.7895 3.8438 3.8375 3.7213 3.5971 + 96K 4.0000 4.1093 4.1218 3.9986 3.9659 + 112K 4.3684 4.5781 4.5787 4.4583 4.6115 + 128K 4.7368 4.8750 4.9188 4.8075 4.8997 + 192K 5.5263 5.7188 5.8020 5.6911 5.7064 + 256K 6.0000 6.2187 6.3045 6.1954 6.1033 + 384K 6.7368 6.9531 7.0405 6.9181 6.7912 + 512K 7.3158 7.5156 7.6294 7.5000 7.4654 + 768K 9.1579 9.4531 9.6395 9.5014 9.1075 + 1024K 10.368 10.7497 10.9999 10.878 10.8201 + 1536K 12.579 13.3124 13.7660 13.747 13.4739 + 2048K 13.737 14.4839 15.0509 15.151 15.1282 + 3076K 14.789 15.5780 16.2993 16.513 16.3365 + + Now the same number relative to the model + + (1 + 0.36*sqrt(primelimit)/arena) * (arena <= 64 ? 1.05 : (arena-64)**0.38) + + [SLOW2_IN_ROOTS = 0.36, ALPHA = 0.38] + + arena| 30m 100m 300m 1000m 2000m <-- primelimit + ================================================= + 16K 1.014 0.9835 0.9942 0.9889 1.004 + 24K 0.9526 0.9758 0.9861 0.9942 0.981 + 32K 0.971 0.9939 0.9884 0.9849 0.9806 + 48K 0.9902 0.9825 0.996 0.9945 0.9885 + 50K 0.9917 0.9853 0.9906 0.9926 0.9907 + 52K 0.9932 0.9878 0.9999 0.9928 0.9903 + 54K 0.9945 0.9902 1.064 0.9939 0.9913 + 56K 1.048 0.9924 0.9925 0.993 0.9921 + 58K 0.9969 0.9945 0.9909 0.9932 0.9918 + 60K 0.9455 0.9809 0.9992 0.9915 0.9923 + 62K 0.9991 0.9827 0.9921 0.9924 0.9929 + 64K 1 1 1 1 1 + 66K 1.02 0.9849 0.9857 0.9772 0.9704 + 68K 0.8827 0.9232 0.9176 0.9025 0.8903 + 70K 0.9255 0.9177 0.9162 0.9029 0.8881 + 72K 0.9309 0.936 0.9429 0.9219 0.9052 + 74K 0.9715 0.9644 0.967 0.9477 0.9292 + 76K 0.9935 0.9975 0.9946 0.9751 0.9552 + 78K 0.9987 1.021 1.021 1.003 0.9819 + 80K 1.047 1.041 1.052 1.027 1.006 + 84K 1.052 1.086 1.092 1.075 1.053 + 88K 1.116 1.125 1.133 1.117 1.096 + 92K 1.132 1.156 1.167 1.155 1.134 + 96K 1.137 1.177 1.195 1.185 1.196 + 112K 1.067 1.13 1.148 1.15 1.217 + 128K 1.04 1.083 1.113 1.124 1.178 + 192K 0.9368 0.985 1.025 1.051 1.095 + 256K 0.8741 0.9224 0.9619 0.995 1.024 + 384K 0.8103 0.8533 0.8917 0.9282 0.9568 + 512K 0.7753 0.8135 0.8537 0.892 0.935 + 768K 0.8184 0.8638 0.9121 0.9586 0.9705 + 1024K 0.8241 0.8741 0.927 0.979 1.03 + 1536K 0.8505 0.9212 0.9882 1.056 1.096 + 2048K 0.8294 0.8954 0.9655 1.041 1.102 - arena| 10m 30m 100m 300m 1000m 2000m 4000m - ================================================================= - 1K 360 - 1.5 250 1150 - 2K 210 840 - 3K 180 660 3160 - 4K 160 580 2530 11930 "36680" "84730" "262850" - 12K 150 460 1730 5590 26010 65500 "184520" - 20K 150 430 1500 4900 25450 56420 143220 - ... - 64K 140 410 1400 4360 22150 32850 122820 - - Timing relative to 64K: - - 20K 1.071 1.049 1.071 1.124 1.149 1.718 1.166 - 28K 1.071 1.000 1.043 1.067 1.115 1.256 1.326 - 36K 1.000 1.024 1.029 1.034 0.946 1.075 1.087 - 44K 1.000 1.049 1.007 1.016 1.006 1.056 0.974 - 52K 1.000 0.976 1.007 1.009 1.023 1.035 0.910 - 56K 0.929 1.024 1.000 1.002 1.023 1.022 0.893 - 58K 1.143 0.976 1.000 1.002 0.995 1.016 0.888 - 60K 1.000 1.024 1.000 0.998 1.016 1.010 0.865 - 62K 1.000 1.000 1.000 0.995 1.023 1.004 0.922 - 64K 1.000 1.000 1.000 1.000 1.000 1.000 1.000 - 66K 1.071 1.049 1.014 0.993 1.025 0.995 0.938 - 68K 1.071 1.098 1.086 1.050 0.939 0.991 0.926 - 70K 1.214 1.171 1.164 1.124 1.061 1.023 0.860 - 72K 1.214 1.244 1.221 1.181 1.070 1.081 0.852 - 76K 1.357 1.415 1.371 1.305 1.154 1.260 0.841 - 84K 1.643 1.610 1.614 1.546 1.340 1.426 0.954 - 92K 1.857 1.878 1.850 1.773 1.475 1.640 1.063 - 96K 2.000 2.073 2.057 2.016 1.380 1.710 1.127 - 128K 2.214 2.341 2.371 2.328 1.624 2.118 1.429 - 192K 2.500 2.707 2.743 2.729 1.901 2.495 1.626 - 256K 2.786 2.902 2.979 2.977 2.077 2.680 1.791 - 384K 3.143 3.293 3.371 3.392 2.344 3.050 1.906 - 512K 3.286 3.488 3.521 3.537 2.445 3.441 2.130 - 768K 3.857 4.171 4.264 4.319 2.977 4.045 2.490 - 1024K 4.429 4.707 4.879 4.947 3.435 4.900 2.941 - 1536K 5.071 5.659 6.029 6.197 4.393 6.237 3.662 - 2048K 5.357 6.195 6.593 6.995 4.857 6.886 4.039 - 3072K 5.786 6.488 7.029 7.356 5.264 7.425 4.402 - - Values after 92K are from different run... Matters much for 4000m - one, when swapping starts; also 1000m run looks not very much - reproducible... - - The stats for small arena lead to the value of ARENA_IN_ROOTS for i386. - - Similar data for Sparc leads to 10. - - TODO: need to create macro for (small) OVERHEAD_100_IN_ROOTS where the - overhead of subdivision is 100%; likewise one needs to estimate - the overhead of having the arena larger than the cache size. Then - one can intelligently optimize the arena size taking both effects - into the account... */ #ifndef SLOW2_IN_ROOTS /* SLOW2_IN_ROOTS below 3: some slowdown starts to be noticable - * when things fit into the cache. + * when things fit into the cache on Sparc. * XXX The choice of 2.6 gives a slowdown of 1-2% on UltraSparcII, * but makes calculations for "maximum" of 436273009 (not applicable anymore) * fit into 256K cache (still common for some architectures). @@ -190,7 +212,7 @@ initprimes1(ulong size, long *lenp, long * One may change it when small caches become uncommon, but the gain * is not going to be very noticable... */ # ifdef i386 /* gcc defines this? */ -# define SLOW2_IN_ROOTS 0.4 +# define SLOW2_IN_ROOTS 0.36 # else # define SLOW2_IN_ROOTS 2.6 # endif @@ -204,8 +226,8 @@ initprimes1(ulong size, long *lenp, long # endif #endif -#define CACHE_ALPHA (0.29) /* Cache performance model parameter */ -#define CACHE_CUTOFF (0.055) /* Cache performance not smooth here */ +#define CACHE_ALPHA (0.38) /* Cache performance model parameter */ +#define CACHE_CUTOFF (0.018) /* Cache performance not smooth here */ static double slow2_in_roots = SLOW2_IN_ROOTS; @@ -248,17 +270,17 @@ good_arena_size(ulong slow2_size, ulong 1 + slow2_size/arena due to initialization overhead; - max(1, 2.33 * overhead^0.29 ) due to footprint > cache size. + max(1, 4.63 * overhead^0.38 ) due to footprint > cache size. [The latter is hard to substantiate theoretically, but this function describes benchmarks pretty close; it does not hurt that one can minimize it explicitly too ;-). The switch between - diffenent choices of max() happens whe overhead=0.055.] + diffenent choices of max() happens whe overhead=0.018.] Thus the problem is minimizing (1 + slow2_size/arena)*overhead**0.29. This boils down to F=((X+A)/(X+B))X^alpha, X=overhead, B = (1 - fixed_to_cache/cache_arena), A = B + - slow2_size/cache_arena, alpha = 0.29, and X>=0.055, X>-B. It + slow2_size/cache_arena, alpha = 0.38, and X>=0.018, X>-B. It turns out that the minimization problem is not very nasty. (As usual with minimization problems which depend on parameters, different values of A,B lead to different algorithms. However, @@ -266,8 +288,8 @@ good_arena_size(ulong slow2_size, ulong algorithms work are explicitly calculatable.) Thus we need to find the rightmost root of (X+A)*(X+B) - - alpha(A-B)X to the right of 0.055 (if such exists and is below - Xmax). Then we manually check the remaining region [0, 0.055]. + alpha(A-B)X to the right of 0.018 (if such exists and is below + Xmax). Then we manually check the remaining region [0, 0.018]. Since we cannot trust the purely-experimental cache-hit slowdown function, as a sanity check always prefer fitting into the @@ -393,7 +415,7 @@ set_optimize(long what, GEN g) } static void -sieve_chunk(byteptr known_primes, ulong start, byteptr data, ulong cnt) +sieve_chunk(byteptr known_primes, ulong s, byteptr data, ulong count) { /* start must be odd; prime differences (starting from (5-3)=2) start at known_primes[2], are terminated by a 0 byte; @@ -402,22 +424,39 @@ sieve_chunk(byteptr known_primes, ulong starting at data to 0 or 1 depending on whether the corresponding odd number is prime or not */ ulong p; - byteptr q, start_sieve, end_sieve = data + cnt; + byteptr q; + register byteptr write_to = data; /* Better code with gcc 2.8.1 */ + register ulong cnt = count; /* Better code with gcc 2.8.1 */ + register ulong start = s; /* Better code with gcc 2.8.1 */ + register ulong delta = 1; /* Better code with gcc 2.8.1 */ memset(data, 0, cnt); + start >>= 1; /* (start - 1)/2 */ + start += cnt; /* Corresponds to the end */ + cnt -= 1; /* data corresponds to start. q runs over primediffs. */ /* Don't care about DIFFPTR_SKIP: false positives provide no problem */ - for (q = known_primes + 2, p = 3; *q; p += *q++) { + for (q = known_primes + 1, p = 3; delta; delta = *++q, p += delta) { /* first odd number which is >= start > p and divisible by p = last odd number which is <= start + 2p - 1 and 0 (mod p) = p + the last even number which is <= start + p - 1 and 0 (mod p) = p + the last even number which is <= start + p - 2 and 0 (mod p) - = p + start + p - 2 - (start + p - 2) % 2p. */ - start_sieve = data - (((start+p-2) % (p << 1)) >> 1) + p - 1; - for (; start_sieve < end_sieve; start_sieve += p) *start_sieve = 1; + = p + start + p - 2 - (start + p - 2) % 2p + = start + 2(p - 1 - ((start-1)/2 + (p-1)/2) % p). */ + long off = cnt - ((start+(p>>1)) % p); + + while (off >= 0) { + write_to[off] = 1; + off -= p; + } } } +/* Do not inline sieve_chunk()! It fits into registers in ix86 non-inlined */ +extern void (*sieve_chunk_p)(byteptr known_primes, ulong s, byteptr data, ulong count); +void (*sieve_chunk_p)(byteptr known_primes, ulong s, byteptr data, ulong count) + = sieve_chunk; + /* Here's the workhorse. This is recursive, although normally the first recursive call will bottom out and invoke initprimes1() at once. (Not static; might conceivably be useful to someone in library mode) */ @@ -436,6 +475,10 @@ initprimes0(ulong maxnum, long *lenp, ul maxnum |= 1; /* make it odd. */ /* Checked to be enough up to 40e6, attained at 155893 */ + /* Due to multibyte representation of large gaps, this estimate will + be broken by large enough maxnum. However, assuming exponential + distribution of gaps with the average log(n), we are safe up to + circa exp(-256/log(1/0.09)) = 1.5e46. OK with LONG_BITS <= 128. ;-) */ size = (long) (1.09 * maxnum/log((double)maxnum)) + 145; p1 = (byteptr) gpmalloc(size); rootnum = (ulong) sqrt((double)maxnum); /* cast it back to a long */ @@ -480,7 +523,7 @@ initprimes0(ulong maxnum, long *lenp, ul was_delta = *p_prime_above; *p_prime_above = 0; /* Put a 0 sentinel for sieve_chunk */ - sieve_chunk(p1, curlow, p, asize); + (*sieve_chunk_p)(p1, curlow, p, asize); *p_prime_above = was_delta; /* Restore */ p[asize] = 0; /* Put a 0 sentinel for ZZZ */