2010 October 26 / |j|d|a|v|i|s|@|c|a|r|l|e|t|o|n|.|e|d|u
Carleton College CS 201, Prof. Joshua R. Davis
A profiler is a program that observes the execution of your program and reports timing data about that execution. For example, if your program is running more slowly than you'd like, a profiler can tell you exactly which function in your program is responsible for the biggest share of the running time, so that you can focus your optimization effort on that function. Profiling data are not perfect; the timing system doesn't have infinite precision, and profiling itself imposes an overhead cost, which can disturb the results. Nevertheless, profiling gives you valuable information about the efficiency of your program. And it's much easier than the hand-timing that we did in Assignment: Efficiency.
In this short assignment, you will use a built-in Python profiler to compare the performance of four different Dictionary
implementations. You will not submit any Python files; instead you will submit a short report on paper. You may work with one partner of your choice. Your work is due Wednesday at the start of class.
You'll be comparing the following four Dictionary
implementations, which you can find on the course web site or in your previous homework.
dictionary_list.py
implements Dictionary
as a Python list of key-value pairs.
dictionary_bstn.py
implements Dictionary
atop an unbalanced binary search tree.
dictionary_avltn.py
implements Dictionary
atop an AVL-balanced binary search tree.
dictionary_table.py
implements Dictionary
as a hash table.
Let's just test the performance of __setitem__
. Write a Python program that performs __setitem__
some large number of times on dictionaries of all four types. In order to keep the results comparable, the same keys and values should be used in all dictionaries. Also, some of the dictionaries rely on hashing, while others don't; in order to keep the results comparable, maybe your keys should all be integers, which (in a typical hash) don't really get hashed.
Have you written your program? Let's say it's called profiling.py
(although I don't care, because you're not handing it in). In the Terminal, instead of typing
to run your program, typepython profiling.py
This activates thepython -m cProfile profiling.py
cProfile
profiler that comes with Python. It prints out a table of timing information. The right-most column indicates which function or method is being described. The column most relevant to us is cumtime
. This reports the total amount of time spent in that function or method, including time spent in inner function or method calls. That is, it tells you the total cost of using the function or method. What do the tottime
and percall
columns report? See Python Profilers for more information.
In general, larger timing tests are better than small ones, because the various sources of timing imprecision tend to balance each other out. Design your timing tests to be as large as is practical.
What happens if you try to run the same tests on Python's built-in dictionary?
I recommend that you type your report. You should describe your testing procedure briefly — that is, what exactly did you ask your dictionaries to do? You should include graphs of the timing data, or at least cleaned-up tables of the data. Use the graphs and/or data to compare the various dictionary implementations. Also, for each dictionary, compare its empirical performance (shown in the data) to its theoretical performance (deduced in class from knowing how the dictionary works). Are there any surprises, or are things pretty much as you'd expect?
An unusual feature of the hash table-based Dictionary
is the time spent in resizing. In the profiling data, is the cumulative resizing time less than the cumulative __setitem__
time? How much of the latter is taken up by the former? Does this make sense?
In my profiling data there is a row labeled {execfile}
. It has a large cumulative time, a small total time, and weird percall times. Is this the case in your data? What is this row about?
Submit your report to me on paper. It will be graded according to these criteria: