SEE BODY TAG
Name:

SEE BODY TAG

SEE BODY TAG

SEE BODY TAG

Write answers or programs to address each of these problems. Include comments in each program with your name, the question number, and the date.

Upload answers in one file, and each program separately, to the course website under "Exam 1".

1. A particular program has been parallelized, and the following times have been measured for it:

# of processes problem size setup time working time cleanup time
1 1e6 76 ms 84 ms 49 ms
4 92 ms 26 ms 56 ms
12 97 ms 44 ms 54 ms
1 100e6 70 ms 8000 ms 50 ms
4 90 ms 2000 ms 55 ms
12 100 ms 884 ms 53 ms
1 10,000e6 70 ms 825,000 ms 50 ms
4 90 ms 209,000 ms 55 ms
12 108 ms 72,902 ms 49 ms

[6 pts] Assume that setup time and cleanup time are roughly constant. Calculate the speedup and efficiency of the parallel program as the number of processors increases from 1 to 12, for a problem size of 100e6 (100 million). Make a table similar to Table 2.4 in the textbook.

[6 pts] Assume that setup time and cleanup time are roughly constant. Calculate the speedup and efficiency of the parallel program as the problem size increases from 1e6 (1 million) to 10,000e6 (10 billion). Make a table similar to Table 2.5 in the textbook.

[4 pts] Estimate the total running time for the program if 16 processors were available, and if 24 processors were available, for each of the three problem sizes shown.

[4 pts] Does this program appear to be strongly scalable, weakly scalable, or neither?

2. A research paper described an experiment with a seven-dimensional hypercube interconnect.

[2 pts] How many processors can be connected, using a seven-dimensional hypercube?

[2 pts] How many connecting wires must each switch in the interconnect support, and what is the bisection width?

[2 pts] Is a hypercube a direct or indirect interconnect?

[4 pts] How many switches would be required, to connect the same number of processors in an Omega network, and what is the bisection width of this network?

[4 pts] How many switches would be required, to connect the same number of processors in a crossbar network, and what is the bisection width of this network?

[4 pts] What is the maximum distance, in number of switches, between two processors connected by a seven-dimensional hypercube?

3. [6 pts]

Describe Flynn's taxonomy and how the acronym "SPMD" relates to it. Be specific. Give a (generic) example of a computer system that represents each element in the taxonomy.

4. [6 pts]

Is OpenMPI used with shared-memory or distributed-memory systems or both? Give an example that demonstrates your answer.

5. [6 pts]

Describe how cache coherence can be maintained, in both a shared-memory system and a distributed-memory system.

6. [20 pts]

Write a program that uses MPI_Send and MPI_Recv to calculate the "hailstone sequence" (sometimes called the "Collatz-conjecture sequence") for a number of starting values. The sequence is defined by a recurrence relation:
next n = n / 2  if n is even
next n = 3 * n +1  if n is odd
for any integer greater than 0.

Collatz' conjecture is that this sequence always converges to the value "1", for any starting integer.

Your program's root process must choose random starting integers, one per process, and send them to the other processes. Each process uses its received integer as the starting point, counts the number of steps until the hailstone sequence reaches 1, then returns that as the sequence length.

The root process must receive the lengths from each process, and display a report of the chosen starting values and their sequence lengths, along with the mean length and its standard deviation. The calculations for mean length and for standard deviation are shown in the serial version of the program, below. They depend on summing up the sequence-lengths, and the squares of the sequence-lengths, as shown below.

#### Limit the range of starting values

For testing convenience, you can limit the range of starting values by using this statement in place of the one in the program below:

[4 pts extra credit] For extra credit, use MPI_ANY_SOURCE at the root to receive messages in any order from the minions.

An example of the parallel output, and a serial version of the program, are shown below. You need only parallelize the serial program. (HOWEVER, if you find it easier to rewrite the program yourself as a parallel program, you are free to do so.)

```\$> mpicc -Wall -o hailstone-mpi hailstone-mpi.c -lm
\$>
\$> # maximum start value is 10:
\$> mpirun -np 4  ./hailstone-mpi
root/caterpillar reports start=3, length=7
1:  process 1 reports start=4, length=2
2:  process 3 reports start=8, length=3
3:  process 2 reports start=7, length=16

4 samples: mean 7.000, s.d. 5.523
\$>
\$> mpirun -np 4  ./hailstone-mpi
root/caterpillar reports start=672269183, length=430
1:  process 3 reports start=1060641648, length=87
2:  process 1 reports start=1047747167, length=237
3:  process 2 reports start=2098411702, length=313
4 samples: mean 266.750, s.d. 124.484
\$>
```
7. [12 pts]

Write a program that uses collective communication. The root process must read in a text file, distribute the text to the minion processes, then collect the letter counts and combine them.

The size of the file can be broadcast to all processes first, using MPI_Bcast(), then the file's contents can be distributed using MPI_Scatter(). Results can be collected using either MPI_Gather() or MPI_Reduce() — each approach has strengths and weaknesses. "Choose wisely."

Here are some input files for testing:

Here is a serial version of this program:

8. This serial program uses a "Monte Carlo" simulation to estimate the value of pi. Pi is defined as 4 times the ratio of a circle's area to the area of the enclosing square. The idea is illustrated on the right. The user starts the program by specifying how many "darts" to throw, i.e. how many samples to collect.

Outputs from the program are shown below. This approximation converges slowly, so very many samples are desirable. An animated demonstration is available by running this Python program: pi-darts.py. (It will also demonstrate the value of efficiency....) The compiled C version is much faster (but doesn't do graphical output).

[12 pts] Parallelize this program to use OpenMPI. This is an "embarrassingly parallelizable" program and should be fairly easy. Also answer the following questions about your program:

[2 pts] What approach — send/receive, broadcast/reduce, scatter/gather, a combination — did you use for your program?

[2 pts] What process(es) use each of the "helper" functions init_PRNGs(), random_coordinate(), and significant_digits()?

[2 pts] How does the accuracy of the approximation depend on the number of terms used: linear dependence? logarithmic? exponential? Explain your choice.

[2 pts] How does the performance of the program depend on the number of processes used: linear dependence? logarithmic? exponential? Explain your choice.

#### Outputs from running pi-darts

```\$>
\$> time ./pi-darts-serial 1e7
total darts: 10000000, in circle: 7852189
5 significant digits
approx.: 3.14088
C   : 3.14159265358979311600
manual: 3.14159265358979323846

real    0m0.885s
user    0m0.883s
sys     0m0.001s
\$>
\$> time mpirun -np 1 ./pi-darts-mpi 1e7
total darts: 10000001, in circle: 7854387
5 significant digits
approx.: 3.14175
C   : 3.14159265358979311600
manual: 3.14159265358979323846

real    0m1.159s
user    0m1.011s
sys     0m0.105s
\$>
\$> time ./pi-darts-serial 1e8
total darts: 100000000, in circle: 78544388
5 significant digits
approx.: 3.14178
C   : 3.14159265358979311600
manual: 3.14159265358979323846

real    0m8.238s
user    0m8.233s
sys     0m0.004s
\$>
\$> time mpirun -np 1 ./pi-darts-mpi 1e8
total darts: 100000000, in circle: 78540680
6 significant digits
approx.: 3.141627
C   : 3.14159265358979311600
manual: 3.14159265358979323846

real    0m8.475s
user    0m8.336s
sys     0m0.098s
\$>
\$> time mpirun -np 4 ./pi-darts-mpi 1e8
total darts: 100000000, in circle: 78543945
5 significant digits
approx.: 3.14176
C   : 3.14159265358979311600
manual: 3.14159265358979323846

real    0m2.553s
user    0m9.062s
sys     0m0.196s
\$>
```