CompSci 215, Advanced Python

Web Search - B


This lab completes the program that searches for specific phrases within webpages.

Extending the solution

The search terms

As mentioned in the last lab, the desired search terms are in a text file. These terms are regular expressions, although most of them are very simple ones.

  1. Save these lines in a file named "searchterms.txt".
    taxes
    (S|s)enator
    shooting
    accident(s)?/w
    traffic
    
  1. command-line options


    Get the search terms

    The first version of main() created default values for "sourceDir" and "destDir", then checked for command-line options that would change those defaults. This is shown on the right.

    Now add another default value for a file that contains the search terms:

    Then add another if statement to look for a "-s" option and file name:


  2. The function get_searchterms() opens and reads the supplied filename. It takes each line as a desired term and builds a regular expression ("regex") that matches the term and some surrounding, contextual contents.

    The faculty member who wanted this requested 120 characters before the term (assuming there are that many), and 1500 characters after the term — this provides the context for understanding why the term was found. These values are parameterized for convenient adjustment later.

    The function returns a dictionary containing all the regexes, keyed by the desired terms.

    Add this function to your program. It can go before or or after the read_URL_and_save() function.

    "get_searchterms()" function

    As before, the import statement is placed in the function for convenience; feel free to move it to the script's beginning.

  3. "find_matches()" function


    Find matches

    The next step is to actually find the matches of these search terms, within each of the URLs. Add another function that does this, acting on the urlText and the dictionary of search terms. It returns another dictionary, containing lists of matches (including the contexts) that are keyed by the search terms.

    Locate this function somewhere appropriate — before the main() function, before or after get_searchterms().

  4. Add this code to the end of the main() function. It calls find_matches() and then displays a simple report of the results.

     


  5. Report results

    Now add these lines to the main() function. The two open() statements open two files for writing results to; the two function calls at the end of main() produce a readable text file of all the matches, and a spreadsheet (".csv") file summarizing the results.

  6. Add the writeReport() function, shown here. Locate it somewhere suitable within the file.

    "writeReport()" function

  7. Also add the writeCSV() function, shown here. Locate it somewhere suitable within the file.

    "writeCSV()" function


  8. Try the solution

    Finally, try out the solution by running the script. Besides the displayed output, you should find two new output files — "report.txt" and "summary.csv". You should see output similar to this (although different in detail):

    $ 
    $ ./websearch-B4.py  -i testinputs/ -o testoutputs/ -s searchterms.txt 
    taxes 
         .{,120}taxes.{,1500}
    (S|s)enator 
         .{,120}(S|s)enator.{,1500}
    shooting 
         .{,120}shooting.{,1500}
    accident(s)?/w 
         .{,120}accident(s)?/w.{,1500}
    traffic 
         .{,120}traffic.{,1500}
    local.txt
    http://thetimes-tribune.com/ encoded as utf_8
        length: 226217
    (S|s)enator
        0 matches in http://thetimes-tribune.com/
    accident(s)?/w
        0 matches in http://thetimes-tribune.com/
    shooting
        0 matches in http://thetimes-tribune.com/
    taxes
        3 matches in http://thetimes-tribune.com/
    traffic
        1 matches in http://thetimes-tribune.com/
    http://www.dailyitem.com/ encoded as utf_8
        length: 725759
        
    
    $ 
    $ ls -l report.txt  summary.csv 
    -rw-rw-r-- 1 bobmon bobmon 100589 Jan  5 17:23 report.txt
    -rw-rw-r-- 1 bobmon bobmon   2566 Jan  5 17:23 summary.csv
    $ 
    

    You can open the "report.txt" file in a text editor and inspect it. You can open "summary.csv" with a spreadsheet program — try it. You can also open it with a text editor.


Conclusion

Congratulations, you have completed a solution to the faculty member's original request. To recap: you read the lists of desired URLs from input files, opened those URLs and saved the webpages into output files; then you searched for occurrences of search terms (that could be regular expressions) within each webpage; and produced report files of the search results.