Wednesday, March 19, 2008

Program Testing For The Sake Of Learning

A number of years ago, I used to compete in the TopCoder on-line programming contests. As an educator, I was fascinated by the TopCoder environment, less by the contests themselves than by the so-called Practice Rooms. These were archives of problems from old contests, where you could submit solutions and run them against the test data used in the actual contest.

I was amazed by how hard many people would work in the Practice Rooms. I wished I could get my students to work half that hard! At first, I thought these people were motivated by the idea of doing better in the contest. And, indeed, that was probably what got most of them to come to the Practice Rooms in the first place. But then I noticed that many of the people working hard in the Practice Rooms competed only rarely, if at all. Something else was going on.

Eventually, it dawned on me that the secret lay in the ease of running system tests. As teachers, we know that feedback is crucial to learning, and that quick feedback is better than slow feedback. These programmers were getting feedback within a few seconds, while they were still “in the moment”. The experience of this tight, nearly instantaneous feedback loop was so powerful that it kept them coming back for more. The burning question then was could I re-create this experience in the classroom?

Why test?

Of course, program testing has been part of CS education forever. However, there are several different reasons for testing, and which reason or reasons the teacher has in mind will strongly affect how testing plays out in any particular class.

Here are several common reasons for testing:

Improved software quality. Ironically, the number one reason for using testing in the real world is the least important reason to use it in the classroom. Most student programs, especially in early courses, will never be used for real. What matters in these programs is not that students complete the program correctly, but what they learn in the process.

Easier grading. Grading programs can be hard, especially if you have a lot of students. Naturally, many CS educators seek to automate this process as much as possible. Many systems exist that will run student programs against a set of instructor tests and record the results. These results may used as a guide for human grading, or the results may entirely determine the student grade. When testing is performed as part of the grading process, it is common for the actual test cases to be hidden from the students.

To learn about testing. Of course, testing is a vital part of real-world software development. Therefore, it is important to expose students to the testing process so they have some idea of how to do it. However, care must be taken here, especially in early courses. Hand-crafting good test cases is hard work. If students are forced to fight through a heavy-weight testing process on toy problems, including making up their own tests, the lesson they frequently take away is that testing is much more trouble than it's worth.

Testing as design. In recent years, test-first and test-driven development have begun to spread from agile programming into the classroom. Advocates view such testing as part of the design process. By writing tests before writing code, students are forced to consider what the behavior of the program should be, rather than diving in without a clear idea of what the program is supposed to do.

Another way

All of these reasons are important, but none of them quite capture what I was seeing in the TopCoder practice rooms. There, people were becoming addicted to quick, easy-to-run tests as a powerful learning experience. This suggests viewing program testing as a way to improve learning. Not learning about testing, but learning about the content being tested.

In an attempt to re-create the relevant parts of the TopCoder experience in the classroom, I adopted the following process. With most programming assignments, I distribute some testing code and test cases. Students then run these tests while they are developing their programs, and they turn in their test results along with their code.

Note that this is an extremely lightweight process. All I'm doing is adding maybe 20 lines of test-driver code to the code I distribute for each assignment, and I can usually cut-and-paste that code from the previous assignment, with only a few tweaks. I then add maybe 50-100 test cases, either hardwired into the code or in a separate text file. I usually generate perhaps 5 of the tests by hand, and the rest randomly. Usually this process adds about 20-30 minutes to the time it takes me to develop an assignment.

I think CS teachers are often scared away by ideas like this because they imagine a fancy system with a server where students submit their programs, and the server runs the tests and records the results into a database. That all sounds like a lot of work, so teachers put it off until “next year”. In contrast, what I'm suggesting could probably be added without much hassle to an assignment that is going out tomorrow.

Similarly, teachers are sometimes scared away by the need to come up with the test suite, thinking it will take too much time to come up with a set of high-quality tests. But that's not what I'm doing. Experience has shown that large numbers of random tests work essentially as well as smaller numbers of hand-crafted tests, but at a fraction of the cost. (See, for example, the wonderful QuickCheck testing tool.) Besides, I'm not trying to guarantee that my test suite will catch all possible bugs, which is a hopeless task anyway. Instead, I'm providing a low-cost test suite that is good enough to gain most of the benefits I'm looking for.

Benefits

And just what are those benefits? Well, since I started using this kind of testing in my homeworks, the general quality of submissions has gone up substantially. Partly, this is because of a drastic decrease in the number of submissions that are complete nonsense. When I did not require tests, students would write code that seemed correct to them at the time, and turn it in without testing it. Now, they can still turn in complete nonsense, but at least they'll know that they are doing so. Usually, however, when they see that they're failing every test, they'll go back and at least try to fix their code. In addition, many students who would have gotten the problem almost right without testing now get it completely right with testing.

Which leads me to the second benefit—there has been a noticeable improvement in the students' debugging skills. As one student put it, “[The testers] also taught us how to troubleshoot our problems. Without those test cases, many of us would have turned in products that were not even close to complete. This would have meant for worse grades and harder grading for the instructor. We also would not have been able to use and develop our troubleshooting skills. When we don't know that something is wrong, it is very hard to try to test for the failures.

However, the main benefit is that the students are learning more. They are getting needed feedback while they are still “in the moment”, and can immediately apply that feedback, which is all great for learning. Contrast this, for example, to feedback that is received after an assignment has been turned in. Often such feedback comes days or weeks later. Students often don't even look at such feedback, but even if they do, they have often lost the context that would allow them to incorporate the feedback into their learning. Even with an automated system that gives feedback instantly, if students do not have an opportunity to resubmit, then they will usually not bother to apply the feedback, and so will miss out on a substantial portion of the benefit to their learning.

As another student put it, “The automated testers are much more than just a means of checking the block to ensure the program works. They function almost as an instructor themselves, correcting the student when he or she makes a mistake, and reaffirming the student's success when he or she succeeds. Late at night, several hours before the assignment is due, this pseudo-instructor is a welcome substitute.

Grading

I have the students turn in the results of their testing, which makes grading significantly easier. Failed tests give me clues about where to look in their programs, while passed tests tell me that I can focus more on issues such as style or efficiency than on correctness.

But this is merely a fringe benefit. Note that I only use the test results to help guide my grading, not to determine grades automatically. I worry that when grading becomes the primary goal of testing, it can interfere with the learning that is my primary goal. For example, testing when used for grading often breaks the tight feedback loop that is where my students learn the most.

Also, when testing is used for grading, the instructor's test suite is often kept hidden from the students. In contrast, I can make my test suites public, which saves me all kinds of headaches. I'm not worried that a student might hardcode all the test cases into their code just to pass all the tests, because I'll see that when I look at their code. (In fact, this happens about once a year.) Students like having the test cases public because, if they don't understand something in the problem statement, they can often answer their own questions by looking at the test cases.

Admittedly, sometimes students take this too far. I occasionally catch them poring over the test cases like trying to read tea leaves, instead of coming to ask me a 30-second question.

Crawl-Walk-Run

I am not advocating that this approach to testing should be used everywhere in the curriculum. Among other things, I agree that students should have the experience of creating their own test cases. The question is when.

I see my approach as being most useful in beginning programming courses, and in courses that are not nominally about programming. For example, it works particularly well in Algorithms.

In other programming courses, however, I believe that a crawl-walk-run approach is appropriate. The “crawl” step is about attitude rather than skills; it is to convince students that testing is valuable. I believe my approach does this, especially if you occasionally leave out the tests and explicitly ask students to compare the experiences of developing with or without tests.

The “walk” step might be to have students generate their own tests, but using instructor-supplied scaffolding. The “run” step might be to have students generate both the tests and the scaffolding. I admit, however, that I have not taught the courses in which those steps would be most appopriate.

8 comments:

rgrig said...

This term I'm giving the lectures for an undergrad course. It took me one week to develop the server (no database though, just plain files) and overall I feel as that saved a lot of my time.

The server is used for much of the grading (70%). I believe that some of your concerns do not apply.

...testing when used for grading often breaks the tight feedback loop that is where my students learn the most.

Students get to submit as often as they like before the deadline with no penalty.

testing when used for grading often breaks the tight feedback loop that is where my students learn the most.

Just like in TC, examples are public and tests are hidden. The examples have the benefit that it's easier to see blunders. The (hidden) tests have the advantage that they make students think hard about the problem.

Even if the grading is automatic I still look at what students submit and provide (public) feedback. (I noticed that most of them have use Math.pow(2,k) instead of (1<<k) so I explained why each expression does; I noticed that some appear to use the server as a compiler so I explained how to compile locally and how to come up with your own tests. And so on.)

There are still problems:
1. Students ask few questions on the forums.
2. I believe that some got discouraged and don't even try to submit; all those that try do well.
3. I think I (still) have the tendency to pick problems that are too hard.

Chris Okasaki said...

rgrig,

That sounds pretty close in spirit. I assume that when a student submits something, they get feedback immediately? What form does that feedback take for the hidden tests? That is, do they get to see the input and expected output for the hidden tests that they missed?

A server is probably a better solution than what I'm doing. Among other thins, a server lets you think about doing things like tracking the most common errors.

But I wanted to point out to people to don't have the one week, or the knowhow, to develop such a server, that they shouldn't let that stop them, because you can get by just fine with a little bit of testing code manually added to each assignment.

Harald Korneliussen said...

When I was a student, I used to do that, too: Go to topcoder and practice in the practice rooms. I considered participating in the contests, but since they so rarely fitted in with my schedule, I just timed myself on some of the practice tasks instead.

But what was really fun was to go over other people's old solutions, and see if you could find errors in them that hadn't been spotted yet. I found a lot, and I think I learned a lot from it...

Perhaps it's too negatively competitive to reward students for breaking each other's input with corner cases, but it really is very enlightening. One learns to look more at what could go wrong in a program, instead of being blinded by what one wants it to do.

Chris Okasaki said...

Harald,

You're absolutely right. The ability to look at other people's solutions is an important part of the TC practice rooms that I have not tried to replicate. And not just one person's solution, but several, written in different styles, perhaps in different languages. And not being quite sure that each solution is correct, needing to look at each one critically.

All this can be great for learning, but it's obviously awkward when grades are involved.

Chris League said...

I'm not so familiar with TopCoder, and apart from some half-hearted work with QuickCheck, I'm not sure I have the full-blown unit-testing religion either. But this seems like a very useful application.

I've been toying with an idea to provide a similar kind of immediate, automatic feedback, and just today I posted a mockup of a tool (vaporware for the moment) to help beginners practice the rudiments of programming. (Flash+audio required to see my short presentation about the idea.)

I'd be thrilled to get some feedback on the idea from you, Chris, and the great community you're building here.

Chris Okasaki said...

Chris,

I recommend looking at Turing's Craft. They don't do exactly what you propose, but they may be close enough to be useful. I've never used their tool, but their demo looks pretty good.

rgrig said...

Chris,

The feedback is immediate. The students see their output (and the reference) only for examples. For tests they only get a score proportional to how many passed (and I tend to keep "1 point = 1 test"). Here's what a student says:

[...] i myself thought that i had a working program when i first submitted it only to have it pass 3 tests [out of 20], it took some trial and error before i was able to fix the errors and end up with a satisfactory grade. But it also helped me better understand how the program worked and what was being asked.

Ben said...

As a graduate of 3 of Dr. Okasaki's classes I can say that I've found the testing very useful. Something I don't believe he necessarily mentioned is that with the output from the tester readily at hand, students are far better informed when they come in for questions. This includes questions among their peers. Knowing the specific input that is crashing your program combined with the output allows the student to analyze their code and come up with possible reasons for bugs. If they aren't able to solve the problem on their own, they can at least speak intelligently about what the problem is, and are often more easily coached towards the correct solution.

However, as a student I also witnessed a downside to this approach, albeit it only applies to students who are less interested in their learning than their grade. A common effect of this approach is for peers to communicate "quick fixes" based on what test cases are being failed. EX Hey Tom its Bill, I can't pass test case 12. Bill you forgot such and such base case in your recursion, (or worse yet) Bill just copy this into your code.
While learning oriented students can use the test cases to coach each other, it is more often used to fix the problem without understanding the cause and learning to fix it on your own.