Answer Key
SEE BODY TAG
SEE BODY TAG
SEE BODY TAG
Upload answers in one file, and each program separately, to the course website under "Exam 1".

A particular program has been parallelized, and the following times have been measured for it:
# of processes problem size setup time working time cleanup time 1 1e6 76 ms 84 ms 49 ms 4 92 ms 26 ms 56 ms 12 97 ms 44 ms 54 ms 1 100e6 70 ms 8000 ms 50 ms 4 90 ms 2000 ms 55 ms 12 100 ms 884 ms 53 ms 1 10,000e6 70 ms 825,000 ms 50 ms 4 90 ms 209,000 ms 55 ms 12 108 ms 72,902 ms 49 ms [6 pts] Assume that setup time and cleanup time are roughly constant. Calculate the speedup and efficiency of the parallel program as the number of processors increases from 1 to 12, for a problem size of 100e6 (100 million). Make a table similar to Table 2.4 in the textbook.
[6 pts] Assume that setup time and cleanup time are roughly constant. Calculate the speedup and efficiency of the parallel program as the problem size increases from 1e6 (1 million) to 10,000e6 (10 billion). Make a table similar to Table 2.5 in the textbook.
[4 pts] Estimate the total running time for the program if 16 processors were available, and if 24 processors were available, for each of the three problem sizes shown.
[4 pts] Does this program appear to be strongly scalable, weakly scalable, or neither?

A research paper described an experiment with a sevendimensional hypercube interconnect.
[2 pts] How many processors can be connected, using a sevendimensional hypercube?
[2 pts] How many connecting wires must each switch in the interconnect support, and what is the bisection width?
[2 pts] Is a hypercube a direct or indirect interconnect?
[4 pts] How many switches would be required, to connect the same number of processors in an Omega network, and what is the bisection width of this network?
[4 pts] How many switches would be required, to connect the same number of processors in a crossbar network, and what is the bisection width of this network?
[4 pts] What is the maximum distance, in number of switches, between two processors connected by a sevendimensional hypercube?
 [6 pts]
Describe Flynn's taxonomy and how the acronym "SPMD" relates to it. Be specific. Give a (generic) example of a computer system that represents each element in the taxonomy.
 [6 pts]
Is OpenMPI used with sharedmemory or distributedmemory systems or both? Give an example that demonstrates your answer.
 [6 pts]
Describe how cache coherence can be maintained, in both a sharedmemory system and a distributedmemory system.
 [20 pts]
Write a program that uses MPI_Send and MPI_Recv to calculate the "hailstone sequence" (sometimes called the "Collatzconjecture sequence") for a number of starting values. The sequence is defined by a recurrence relation:
next n = n / 2 if n is even
next n = 3 * n +1 if n is odd
for any integer greater than 0.Collatz' conjecture is that this sequence always converges to the value "1", for any starting integer.
Your program's root process must choose random starting integers, one per process, and send them to the other processes. Each process uses its received integer as the starting point, counts the number of steps until the hailstone sequence reaches 1, then returns that as the sequence length.
The root process must receive the lengths from each process, and display a report of the chosen starting values and their sequence lengths, along with the mean length and its standard deviation. The calculations for mean length and for standard deviation are shown in the serial version of the program, below. They depend on summing up the sequencelengths, and the squares of the sequencelengths, as shown below.
Limit the range of starting values
For testing convenience, you can limit the range of starting values by using this statement in place of the one in the program below:
starts[rank] = <my_range> * (double)random() / (double)RAND_MAX;
[4 pts extra credit] For extra credit, use MPI_ANY_SOURCE at the root to receive messages in any order from the minions.
An example of the parallel output, and a serial version of the program, are shown below. You need only parallelize the serial program. (HOWEVER, if you find it easier to rewrite the program yourself as a parallel program, you are free to do so.)
$> mpicc Wall o hailstonempi hailstonempi.c lm $> $> # maximum start value is 10: $> mpirun np 4 ./hailstonempi root/caterpillar reports start=3, length=7 1: process 1 reports start=4, length=2 2: process 3 reports start=8, length=3 3: process 2 reports start=7, length=16 4 samples: mean 7.000, s.d. 5.523 $> $> mpirun np 4 ./hailstonempi root/caterpillar reports start=672269183, length=430 1: process 3 reports start=1060641648, length=87 2: process 1 reports start=1047747167, length=237 3: process 2 reports start=2098411702, length=313 4 samples: mean 266.750, s.d. 124.484 $>
/* the hailstone sequence / Collatz conjecture  20200925 */ #include <stdio.h> #include <stdlib.h> // random()/srandom() #include <time.h> // time() #include <math.h> // sqrt() long unsigned next_hailstone(long unsigned n) { return (n % 2) ? 3 * n + 1 : n / 2; } long unsigned sequence(long unsigned n) { long unsigned length = 0; while (n != 1) { length++; n = next_hailstone(n); //printf("next n: %lu\n", n); } return length; } int main(int argc, char **argv) { srandom(time(NULL)); int myrank = 0, worldsize = 1; long unsigned starts[ worldsize ]; for (int rank = 0; rank < worldsize; rank++) starts[rank] = random(); /* Optionally limit the range of starting values with:  starts[rank] = <my_range> * (double)random() / (double)RAND_MAX; */ long unsigned lengths[ worldsize ]; long unsigned sum = 0, sumsq = 0; for (int rank = 0; rank < worldsize; rank++) { lengths[rank] = sequence(starts[rank]); sum += lengths[rank]; sumsq += lengths[rank] * lengths[rank]; printf("sequence starting with %lu: length %lu\n", starts[rank], lengths[rank]); } double mean = (double)sum / (double)worldsize; double stdev = sqrt(worldsize * sumsq  sum * sum) / (double)worldsize; printf("%d samples: %.3lf mean, %.3lf s.d.\n", worldsize, mean, stdev); return 0; }
 [12 pts]
Write a program that uses collective communication. The root process must read in a text file, distribute the text to the minion processes, then collect the letter counts and combine them.
The size of the file can be broadcast to all processes first, using MPI_Bcast(), then the file's contents can be distributed using MPI_Scatter(). Results can be collected using either MPI_Gather() or MPI_Reduce() — each approach has strengths and weaknesses. "Choose wisely."
[4 pts] Is your program strongly scalable, weakly scalable, or both? Explain your answer
Here are some input files for testing: Jabberwocky.txt, 969 B
 DefenseCode_Unix_WildCards_Gone_Wild.txt, 21KB
 AliceinWonderland.txt, 164KB
 jargonfile2000.txt, 2.3MB
 WebsterUnabridgeddictionary.Gutenberg.txt, 44MB
Here is a serial version of this program:
// Character counter // 20200925 #include <stdio.h> #include <stdlib.h> // malloc() void count_chars(unsigned *counts, char const *contents, long int fsize) { for (char const *cp = contents; cp < (contents + fsize); cp++) counts[(unsigned)*cp]++; } int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "usage: %s <sourcefile>\n", argv[0]); return 1; } FILE *h = fopen(argv[1], "r"); fseek(h, 0L, SEEK_END); long int fsize = ftell(h); printf("%s size: %ld\n", argv[1], fsize); char *contents = malloc(fsize * sizeof(char)); rewind(h); fread(contents, 1, fsize, h); fclose(h); unsigned charcounts[256]; for (int i = 0; i < 256; i++) charcounts[i] = 0; count_chars(charcounts, contents, fsize); for (int i = 0; i < 256; i++) if (charcounts[i] > 0) printf("0x%02x (%3d) :%7u  %c\n", i, i, charcounts[i], i); return 0; }

This serial program uses a "Monte Carlo" simulation to estimate the value of pi. Pi is defined as 4 times the ratio of a circle's area to the area of the enclosing square. The idea is illustrated on the right. The user starts the program by specifying how many "darts" to throw, i.e. how many samples to collect.
Outputs from the program are shown below. This approximation converges slowly, so very many samples are desirable. An animated demonstration is available by running this Python program: pidarts.py. (It will also demonstrate the value of efficiency....) The compiled C version is much faster (but doesn't do graphical output).
[12 pts] Parallelize this program to use OpenMPI. This is an "embarrassingly parallelizable" program and should be fairly easy. Also answer the following questions about your program:
[2 pts] What approach — send/receive, broadcast/reduce, scatter/gather, a combination — did you use for your program?
[2 pts] What process(es) use each of the "helper" functions init_PRNGs(), random_coordinate(), and significant_digits()?
[2 pts] How does the accuracy of the approximation depend on the number of terms used: linear dependence? logarithmic? exponential? Explain your choice.
[2 pts] How does the performance of the program depend on the number of processes used: linear dependence? logarithmic? exponential? Explain your choice.
Outputs from running pidarts
$> $> time ./pidartsserial 1e7 total darts: 10000000, in circle: 7852189 5 significant digits approx.: 3.14088 C : 3.14159265358979311600 manual: 3.14159265358979323846 real 0m0.885s user 0m0.883s sys 0m0.001s $> $> time mpirun np 1 ./pidartsmpi 1e7 total darts: 10000001, in circle: 7854387 5 significant digits approx.: 3.14175 C : 3.14159265358979311600 manual: 3.14159265358979323846 real 0m1.159s user 0m1.011s sys 0m0.105s $> $> time ./pidartsserial 1e8 total darts: 100000000, in circle: 78544388 5 significant digits approx.: 3.14178 C : 3.14159265358979311600 manual: 3.14159265358979323846 real 0m8.238s user 0m8.233s sys 0m0.004s $> $> time mpirun np 1 ./pidartsmpi 1e8 total darts: 100000000, in circle: 78540680 6 significant digits approx.: 3.141627 C : 3.14159265358979311600 manual: 3.14159265358979323846 real 0m8.475s user 0m8.336s sys 0m0.098s $> $> time mpirun np 4 ./pidartsmpi 1e8 total darts: 100000000, in circle: 78543945 5 significant digits approx.: 3.14176 C : 3.14159265358979311600 manual: 3.14159265358979323846 real 0m2.553s user 0m9.062s sys 0m0.196s $>
Serial version of pidarts.c
/*  Areabased approximation of pi */ #include <stdio.h> #include <stdlib.h> // strtoul(), random_r() #include <time.h> // time() #include <math.h> // sqrt(), # define M_PI 3.14159265358979323846 // long double only suppports 35 digits #define PI35 3.1415926535897932384626433832795029L // . #define PRNG_BUFSIZE 8 // used in randomenumber generation // Generate a random value between 1 and +1: double random_coordinate(struct random_data *rand_state) { int rn; random_r(rand_state, &rn); return (1.0 + 2.0 * (double)rn/(double)RAND_MAX); } // // Calculate the number of pi's digits that are "correct" (and add 3 extra). int significant_digits(long double pi) { long double diff = fabsl(PI35  pi); long double logdiff = log10(diff); return logdiff + 2; } // int main(int argc, char **argv) { int worldsize = 1, myrank = 0; // MPIoriented variables, needed here too /*  Use a fancy pseudorandom number generator that is "threadsafe", meaning  it can be called by multiple processes and they all get independent values. */ char rand_state_buf[PRNG_BUFSIZE]; struct random_data rand_state; srandom(time(NULL) + myrank); // different seed for each process initstate_r(random(), rand_state_buf, PRNG_BUFSIZE, &rand_state); long unsigned ndarts = (long unsigned)( (argc >= 2) ? atof(argv[1]) : 1e6 ) ; long unsigned incircle = 0; for (long unsigned i = 0; i < ndarts; i++) { double x = random_coordinate(&rand_state); double y = random_coordinate(&rand_state); if (1.0 >= sqrt(x*x + y*y)) incircle++; } printf("total darts: %lu, in circle: %lu\n", ndarts, incircle); long double pi = 4.0L * (long double)incircle / (long double)ndarts; int signif = significant_digits(pi); printf("%d significant digits\n", signif); printf("approx.: %.*Lf\n C : %.20Lf\n manual: %.20Lf\n", signif, pi, (long double)M_PI, (long double)PI35); return 0; } //