SEE BODY TAG
Name:
Answer Key

SEE BODY TAG

SEE BODY TAG

 

SEE BODY TAG


Write answers or programs to address each of these problems. Include comments in each program with your name, the question number, and the date.

Upload answers in one file, and each program separately, to the course website under "Exam 1".

  1. A particular program has been parallelized, and the following times have been measured for it:

    # of processes problem size setup time working time cleanup time
    1 1e6 76 ms 84 ms 49 ms
    4 92 ms 26 ms 56 ms
    12 97 ms 44 ms 54 ms
    1 100e6 70 ms 8000 ms 50 ms
    4 90 ms 2000 ms 55 ms
    12 100 ms 884 ms 53 ms
    1 10,000e6 70 ms 825,000 ms 50 ms
    4 90 ms 209,000 ms 55 ms
    12 108 ms 72,902 ms 49 ms

    [6 pts] Assume that setup time and cleanup time are roughly constant. Calculate the speedup and efficiency of the parallel program as the number of processors increases from 1 to 12, for a problem size of 100e6 (100 million). Make a table similar to Table 2.4 in the textbook.

    [6 pts] Assume that setup time and cleanup time are roughly constant. Calculate the speedup and efficiency of the parallel program as the problem size increases from 1e6 (1 million) to 10,000e6 (10 billion). Make a table similar to Table 2.5 in the textbook.

    [4 pts] Estimate the total running time for the program if 16 processors were available, and if 24 processors were available, for each of the three problem sizes shown.

    [4 pts] Does this program appear to be strongly scalable, weakly scalable, or neither?

  2. A research paper described an experiment with a seven-dimensional hypercube interconnect.

    [2 pts] How many processors can be connected, using a seven-dimensional hypercube?

    [2 pts] How many connecting wires must each switch in the interconnect support, and what is the bisection width?

    [2 pts] Is a hypercube a direct or indirect interconnect?

    [4 pts] How many switches would be required, to connect the same number of processors in an Omega network, and what is the bisection width of this network?

    [4 pts] How many switches would be required, to connect the same number of processors in a crossbar network, and what is the bisection width of this network?

    [4 pts] What is the maximum distance, in number of switches, between two processors connected by a seven-dimensional hypercube?

  3. [6 pts]

    Describe Flynn's taxonomy and how the acronym "SPMD" relates to it. Be specific. Give a (generic) example of a computer system that represents each element in the taxonomy.

  4. [6 pts]

    Is OpenMPI used with shared-memory or distributed-memory systems or both? Give an example that demonstrates your answer.

  5. [6 pts]

    Describe how cache coherence can be maintained, in both a shared-memory system and a distributed-memory system.

  6. [20 pts]

    Write a program that uses MPI_Send and MPI_Recv to calculate the "hailstone sequence" (sometimes called the "Collatz-conjecture sequence") for a number of starting values. The sequence is defined by a recurrence relation:
         next n = n / 2  if n is even
         next n = 3 * n +1  if n is odd
    for any integer greater than 0.

    Collatz' conjecture is that this sequence always converges to the value "1", for any starting integer.

    Your program's root process must choose random starting integers, one per process, and send them to the other processes. Each process uses its received integer as the starting point, counts the number of steps until the hailstone sequence reaches 1, then returns that as the sequence length.

    The root process must receive the lengths from each process, and display a report of the chosen starting values and their sequence lengths, along with the mean length and its standard deviation. The calculations for mean length and for standard deviation are shown in the serial version of the program, below. They depend on summing up the sequence-lengths, and the squares of the sequence-lengths, as shown below.

    Limit the range of starting values

    For testing convenience, you can limit the range of starting values by using this statement in place of the one in the program below:

            starts[rank] = <my_range> * (double)random() / (double)RAND_MAX;
    

    [4 pts extra credit] For extra credit, use MPI_ANY_SOURCE at the root to receive messages in any order from the minions.

    An example of the parallel output, and a serial version of the program, are shown below. You need only parallelize the serial program. (HOWEVER, if you find it easier to rewrite the program yourself as a parallel program, you are free to do so.)

    $> mpicc -Wall -o hailstone-mpi hailstone-mpi.c -lm
    $>
    $> # maximum start value is 10:
    $> mpirun -np 4  ./hailstone-mpi
    root/caterpillar reports start=3, length=7
     1:  process 1 reports start=4, length=2
     2:  process 3 reports start=8, length=3
     3:  process 2 reports start=7, length=16
    
    4 samples: mean 7.000, s.d. 5.523
    $>
    $> mpirun -np 4  ./hailstone-mpi
    root/caterpillar reports start=672269183, length=430
     1:  process 3 reports start=1060641648, length=87
     2:  process 1 reports start=1047747167, length=237
     3:  process 2 reports start=2098411702, length=313
    4 samples: mean 266.750, s.d. 124.484
    $>
    
    /* the hailstone sequence / Collatz conjecture
    |   2020-09-25
    */
    #include <stdio.h>
    #include <stdlib.h>     // random()/srandom()
    #include <time.h>       // time()
    #include <math.h>       // sqrt()
    
    long unsigned next_hailstone(long unsigned n)
    {
        return (n % 2) ? 3 * n + 1 : n / 2;
    }
    
    long unsigned sequence(long unsigned n)
    {
        long unsigned length = 0;
        while (n != 1) {
            length++;
            n = next_hailstone(n);
            //printf("next n: %lu\n", n);
        }
        return length;
    }
    
    int main(int argc, char **argv)
    {
        srandom(time(NULL));
    
        int myrank = 0, worldsize = 1;
    
        long unsigned starts[ worldsize ];
        for (int rank = 0; rank < worldsize; rank++)
            starts[rank] = random();
            /* Optionally limit the range of starting values with:
            | starts[rank] = <my_range> * (double)random() / (double)RAND_MAX;
            */
    
        long unsigned lengths[ worldsize ];
        long unsigned sum = 0, sumsq = 0;
        for (int rank = 0; rank < worldsize; rank++) {
            lengths[rank] = sequence(starts[rank]);
            sum += lengths[rank];
            sumsq += lengths[rank] * lengths[rank];
    
            printf("sequence starting with %lu: length %lu\n",
                starts[rank], lengths[rank]);
        }
    
        double mean = (double)sum / (double)worldsize;
        double stdev = sqrt(worldsize * sumsq - sum * sum) / (double)worldsize;
        printf("%d samples: %.3lf mean, %.3lf s.d.\n", worldsize, mean, stdev);
    
        return 0;
    }
    
  7. [12 pts]

    Write a program that uses collective communication. The root process must read in a text file, distribute the text to the minion processes, then collect the letter counts and combine them.

    The size of the file can be broadcast to all processes first, using MPI_Bcast(), then the file's contents can be distributed using MPI_Scatter(). Results can be collected using either MPI_Gather() or MPI_Reduce() — each approach has strengths and weaknesses. "Choose wisely."

    [4 pts] Is your program strongly scalable, weakly scalable, or both? Explain your answer

    Here are some input files for testing:

    Here is a serial version of this program:

    // Character counter
    // 2020-09-25
    #include <stdio.h>
    #include <stdlib.h>     // malloc()
    
    void count_chars(unsigned *counts, char const *contents, long int fsize)
    {
        for (char const *cp = contents; cp < (contents + fsize); cp++)
            counts[(unsigned)*cp]++;
    }
    
    int main(int argc, char **argv)
    {
        if (argc != 2) {
            fprintf(stderr, "usage: %s <source-file>\n", argv[0]);
            return 1;
        }
        FILE *h = fopen(argv[1], "r");
        fseek(h, 0L, SEEK_END);
        long int fsize = ftell(h);
        printf("%s size: %ld\n", argv[1], fsize);
        char *contents = malloc(fsize * sizeof(char));
    
        rewind(h);
        fread(contents, 1, fsize, h);
        fclose(h);
    
        unsigned charcounts[256];
        for (int i = 0; i < 256; i++)
            charcounts[i] = 0;
    
        count_chars(charcounts, contents, fsize);
    
        for (int i = 0; i < 256; i++)
            if (charcounts[i] > 0)
                printf("0x%02x (%3d) :%7u - %c\n", i, i, charcounts[i], i);
    
        return 0;
    }
    
  8. This serial program uses a "Monte Carlo" simulation to estimate the value of pi. Pi is defined as 4 times the ratio of a circle's area to the area of the enclosing square. The idea is illustrated on the right. The user starts the program by specifying how many "darts" to throw, i.e. how many samples to collect.

    Outputs from the program are shown below. This approximation converges slowly, so very many samples are desirable. An animated demonstration is available by running this Python program: pi-darts.py. (It will also demonstrate the value of efficiency....) The compiled C version is much faster (but doesn't do graphical output).

    [12 pts] Parallelize this program to use OpenMPI. This is an "embarrassingly parallelizable" program and should be fairly easy. Also answer the following questions about your program:

    [2 pts] What approach — send/receive, broadcast/reduce, scatter/gather, a combination — did you use for your program?

    [2 pts] What process(es) use each of the "helper" functions init_PRNGs(), random_coordinate(), and significant_digits()?

    [2 pts] How does the accuracy of the approximation depend on the number of terms used: linear dependence? logarithmic? exponential? Explain your choice.

    [2 pts] How does the performance of the program depend on the number of processes used: linear dependence? logarithmic? exponential? Explain your choice.

    Outputs from running pi-darts

    $>
    $> time ./pi-darts-serial 1e7
    total darts: 10000000, in circle: 7852189
    5 significant digits
    approx.: 3.14088
       C   : 3.14159265358979311600
     manual: 3.14159265358979323846
    
    real    0m0.885s
    user    0m0.883s
    sys     0m0.001s
    $>
    $> time mpirun -np 1 ./pi-darts-mpi 1e7
    total darts: 10000001, in circle: 7854387
    5 significant digits
    approx.: 3.14175
       C   : 3.14159265358979311600
     manual: 3.14159265358979323846
    
    real    0m1.159s
    user    0m1.011s
    sys     0m0.105s
    $>
    $> time ./pi-darts-serial 1e8
    total darts: 100000000, in circle: 78544388
    5 significant digits
    approx.: 3.14178
       C   : 3.14159265358979311600
     manual: 3.14159265358979323846
    
    real    0m8.238s
    user    0m8.233s
    sys     0m0.004s
    $>
    $> time mpirun -np 1 ./pi-darts-mpi 1e8
    total darts: 100000000, in circle: 78540680
    6 significant digits
    approx.: 3.141627
       C   : 3.14159265358979311600
     manual: 3.14159265358979323846
    
    real    0m8.475s
    user    0m8.336s
    sys     0m0.098s
    $>
    $> time mpirun -np 4 ./pi-darts-mpi 1e8
    total darts: 100000000, in circle: 78543945
    5 significant digits
    approx.: 3.14176
       C   : 3.14159265358979311600
     manual: 3.14159265358979323846
    
    real    0m2.553s
    user    0m9.062s
    sys     0m0.196s
    $>
    

    Serial version of pi-darts.c

    /*
    | Area-based approximation of pi
    */
    #include <stdio.h>
    #include <stdlib.h> // strtoul(), random_r()
    #include <time.h>   // time()
    #include <math.h>   // sqrt(),    # define M_PI 3.14159265358979323846
    
    // long double only suppports 35 digits
    #define PI35 3.1415926535897932384626433832795029L
    //           -.---|----|----|----|----|----|----|
    
    #define PRNG_BUFSIZE 8  // used in randome-number generation
    
    // Generate a random value between -1 and +1:
    double random_coordinate(struct random_data *rand_state)
    {
        int rn;
        random_r(rand_state, &rn);
        return (-1.0 + 2.0 * (double)rn/(double)RAND_MAX);
    }
    //-----------------------------------------------------------------------
    
    // Calculate the number of pi's digits that are "correct" (and add 3 extra).
    int significant_digits(long double pi)
    {
        long double diff = fabsl(PI35 - pi);
        long double logdiff = -log10(diff);
        return logdiff + 2;
    }
    //-----------------------------------------------------------------------
    
    
    int main(int argc, char **argv)
    {
        int worldsize = 1, myrank = 0;  // MPI-oriented variables, needed here too
    
        /*
        | Use a fancy pseudo-random number generator that is "thread-safe", meaning
        | it can be called by multiple processes and they all get independent values.
        */
        char rand_state_buf[PRNG_BUFSIZE];
        struct random_data rand_state;
        srandom(time(NULL) + myrank);   // different seed for each process
        initstate_r(random(), rand_state_buf, PRNG_BUFSIZE, &rand_state);
    
        long unsigned ndarts = (long unsigned)( (argc >= 2) ? atof(argv[1]) : 1e6 ) ;
    
        long unsigned incircle = 0;
        for (long unsigned i = 0; i < ndarts; i++) {
            double x = random_coordinate(&rand_state);
            double y = random_coordinate(&rand_state);
            if (1.0 >= sqrt(x*x + y*y))
                incircle++;
        }
        printf("total darts: %lu, in circle: %lu\n", ndarts, incircle);
    
        long double pi = 4.0L * (long double)incircle / (long double)ndarts;
    
        int signif = significant_digits(pi);
        printf("%d significant digits\n", signif);
        printf("approx.: %.*Lf\n   C   : %.20Lf\n manual: %.20Lf\n",
            signif, pi, (long double)M_PI, (long double)PI35);
    
        return 0;
    }
    //-----------------------------------------------------------------------