2010 October 26 / |j|d|a|v|i|s|@|c|a|r|l|e|t|o|n|.|e|d|u

Assignment: Profiling

Carleton College CS 201, Prof. Joshua R. Davis

Introduction

A profiler is a program that observes the execution of your program and reports timing data about that execution. For example, if your program is running more slowly than you'd like, a profiler can tell you exactly which function in your program is responsible for the biggest share of the running time, so that you can focus your optimization effort on that function. Profiling data are not perfect; the timing system doesn't have infinite precision, and profiling itself imposes an overhead cost, which can disturb the results. Nevertheless, profiling gives you valuable information about the efficiency of your program. And it's much easier than the hand-timing that we did in Assignment: Efficiency.

In this short assignment, you will use a built-in Python profiler to compare the performance of four different Dictionary implementations. You will not submit any Python files; instead you will submit a short report on paper. You may work with one partner of your choice. Your work is due Wednesday at the start of class.

Profile

You'll be comparing the following four Dictionary implementations, which you can find on the course web site or in your previous homework.

dictionary_list.py implements Dictionary as a Python list of key-value pairs.
dictionary_bstn.py implements Dictionary atop an unbalanced binary search tree.
dictionary_avltn.py implements Dictionary atop an AVL-balanced binary search tree.
dictionary_table.py implements Dictionary as a hash table.

Let's just test the performance of __setitem__. Write a Python program that performs __setitem__ some large number of times on dictionaries of all four types. In order to keep the results comparable, the same keys and values should be used in all dictionaries. Also, some of the dictionaries rely on hashing, while others don't; in order to keep the results comparable, maybe your keys should all be integers, which (in a typical hash) don't really get hashed.

Have you written your program? Let's say it's called profiling.py (although I don't care, because you're not handing it in). In the Terminal, instead of typing

python profiling.py

to run your program, type

python -m cProfile profiling.py

This activates the cProfile profiler that comes with Python. It prints out a table of timing information. The right-most column indicates which function or method is being described. The column most relevant to us is cumtime. This reports the total amount of time spent in that function or method, including time spent in inner function or method calls. That is, it tells you the total cost of using the function or method. What do the tottime and percall columns report? See Python Profilers for more information.

In general, larger timing tests are better than small ones, because the various sources of timing imprecision tend to balance each other out. Design your timing tests to be as large as is practical.

What happens if you try to run the same tests on Python's built-in dictionary?

Write Your Report

I recommend that you type your report. You should describe your testing procedure briefly — that is, what exactly did you ask your dictionaries to do? You should include graphs of the timing data, or at least cleaned-up tables of the data. Use the graphs and/or data to compare the various dictionary implementations. Also, for each dictionary, compare its empirical performance (shown in the data) to its theoretical performance (deduced in class from knowing how the dictionary works). Are there any surprises, or are things pretty much as you'd expect?

An unusual feature of the hash table-based Dictionary is the time spent in resizing. In the profiling data, is the cumulative resizing time less than the cumulative __setitem__ time? How much of the latter is taken up by the former? Does this make sense?

In my profiling data there is a row labeled {execfile}. It has a large cumulative time, a small total time, and weird percall times. Is this the case in your data? What is this row about?

Submit Your Work

Submit your report to me on paper. It will be graded according to these criteria:

There are informative graphs and/or clear tables of timing data (3 points).
The discussion is well-written and insightful (3 points).