09 April 2013

EAST meetup - talking metrics

Yesterday's EAST meetup was focused on test metrics, starting with us watching Cem Kaner's talk from CAST 2012 and then continuing with an open discussion on the webinar and metrics in general.

Cem's talk
Cem talked about the problem to setup valid measurements for software testing and the work he presented on threats to validity in measurements seemed very interesting (you can read more about that in the slides). He also talked about problems with qualitative measurements:

  • Credibility, why should I believe you
  • Confirmability, if someone else analyzed this, would that person reach the same conclusion
  • Representativeness, does this accurately represent what happens in the big picture
  • Transferability, would this transfer to another similar scenarios/organizations

So far so good.

But Cem also, as I interpreted it, implied that all quantitative measurements we know of are crap but if a manager ask for a certain number you should provide it. In this case I agree we testers in general need to improve our ability to present information but, as I will come back to, I strongly disagree with providing bad metrics even after presenting the risks with them.

How do you measure test in your company
Everyone was asked: How do you measure test in your company. The most common answer was happy/sad/neutral smilies in green/red/yellow, which was quite interesting since it relates closely to emotions (at least as it's presented) rather than "data".

The meaning of the faces varied though:

  • Express progress, start sad and end happy (="done")
  • Express feelings so far, first two weeks were really messy so we use a sad face even if it looks more promising now
  • Express feelings going forward, good progress so far but we just found an area that seems really messy so sad face (estimate)
In most cases some quantitative measures were used as input but wasn't reported.

My personal favorite was an experiment Johan Åtting talked about, where a smiley, representing the "mood" so far (refers to item two in the list above), is put on a scale, representing perceived progress (see picture). Seemed like a straight forward and nice visual way to represent both progress and general gut feeling. The measurements of progress in this case were solely qualitative if I understood correctly but would work if you prefer a quantitative way of measuring it as well.


Also a couple of interesting stories, both from the great mind of Morgan. First was an example of a bad measurement with managers basing their bonuses on lead time for solved bug reports. This was fine as long as developers had to prioritize among incoming reports but when they improved their speed this measurement dropped like a stone (suddenly years old, previously down prioritized, bugs were fixed) and managers got upset.

The second story was about a company where developers and "quality/production" (incl. testers) were separated. Testers in this case tested the components sent from developers in isolation before going to production. However, when the components got assembled tons of problems arised, problems customers reported back to the developers without quality/production knowing it. This lead to a situation where quality/production couldn't understand why managers were upset with bad sales, the product was fine in their view. The situation improved when Morgan started to send the number of bugs reported from customers (bad, quantitative measurement) to the quality/production department.

An interesting twist was later when they tried to change the bad quantitative measurement with something more representative and met a lot of resistance since management had learned to trust this number. I asked him if he in retrospect would had done anything differently but we never got to an answer.

I shared a creative test leader's way of dealing with number. She reported certain information (like test case progress and bug count) but removed the actual numbers. So when presenting for upper management she simply had visual graphs to demonstrate what she was talking about. As far as I know this was well received from all parties.

Finally an interesting comment from Per saying: "Often I find it more useful to say; give us x hours and I can report our perceived status of the product after that".

Epiphanies (or rather interesting insights)
During the discussions I had a bunch of interesting insights.
  • Measurements are not a solution to trust issues!
  • Instead of saying "we have terrible speech quality" or "the product quality is simply too bad" we could let the receiver listen to the speech quality or demo the aspects of the product we find bad. It's a very hands on way to transfer information we've found.
  • Ethics is a huge issue. If we want someone to believe something we can (almost) always find compelling numbers (or analysis for that matter).
  • Measuring progress or making estimations will not change the quality of test/product's state after a set amount of time (just as a reminder of what you don't achieve and the cost of measurements).
  • If a "bad measurement" can help us reach a better state in general (like in Morgan's example), is it really a bad measurement? (touching on ethics again).
  • When adding qualitative measurements to strengthen our qualitative measurements, are we digging our own grave? (risk of communicating: So it's not my analysis/gut feeling that matters, it's the numbers)
  • How a metric is presented is often more important than the metric itself.
  • In some cases brought up the measurements weren't even missed when not reported anymore.
  • Don't ask what reports or data someone wants, ask what questions they want you to answer.
  • Cem talked about transfer problem for students, making it hard for them to understand how a social science study can relate to computer science for instance. I think the same problem occurs when we move testing results into a world steered by economics (numbers).
  • Even bad measurements might be useful to highlight underlying problems. Once again Morgans examples somewhat shows this and Per talked about how more warnings in their static code analysis was an indication programmers might be too stressed (if I interpreted it correctly). In these cases it's important to state what the measurement is for and that's it's just a potential indication.
Measurement tempering
We talked about how we easily get into bad habits/behavior when quantitative measurements becomes "the way" to determine test status.

Bad behavior when measuring test case progress:
  • Testers saving easy tests so they have something to execute when the progress is questioned
  • Testers running all simple tests first to avoid being questioned early on
  • Testers not reporting the actual progress to avoid putting too much pressure on them or to fake progress to calm people down.
  • Testers writing tons of small simple tests that typically test functionality individually which creates a lot of administrative overhead as well as risk of not testing how components work together, this to ensure a steady test case progress.
  • Test leaders/managers questioning testers when more test cases are added (screws up "progress"), as a result, testers ignore broadening test scope even when obviously flawed.
Bad behavior when measuring pass/fail ratio:
  • Ignoring bugs occurring during setup / tear down.
  • Slightly modifying a test case to make it pass (making it conform with actual result rather than checking if that's a correct behavior).
  • Slightly modifying a test case to make it pass (removing a failing check or "irrelevant part that fails")
Culture
We also talked about how the culture (not only related to test measurements) in various countries affected testing. One question was: Why is CDT so popular in Sweden. Among the answers was low power distance (Geert Hofstede, cred to Johan Åtting), Law of Jante and that we're not measured that much in general (i.e. late grading in schools, not grading in numbers etc.).

We also talked about drawbacks with these cultural behavior (like hard to get to decisions since everyone should get involved/agree).

Finally, and mostly, we talked about how our view on testing and measuring sometimes collides with other cultures, with actual examples from India, Poland and the US. This discussion was a bit fragmented but feel free to grab me if you want to hear about it.

Summary
This was a great EAST meetup and I really feel sorry for the guys I know like this topic but couldn't attend. Definitely a topic I hope (and think) we'll get back to!

Finally a lovely quote from Magnus regarding a private investigator determining time to solve a crime:
"Let's see, there are 7 drops of blod, this will take 3 weeks to solve".

Good night! .)

2 comments:

  1. Great recap of your meet up! I wish I could have joined. Talking metrics get's my blood boiling so maybe it was good I wasn't there ;)
    I hope to join in June!

    ReplyDelete
  2. Thanks Erik for a very good summary of what we discussed.

    ReplyDelete