please enable javascript

Is It a Good Idea to Establish a Common Baseline for Story Points?

In my previous post, I wrote about how to establish a common baseline for story points across relatively large teams (a few hundred developers). In this post I want to consider whether doing so is a good idea.

The need for a common baseline to story points usually arises from the reasonable desire to know how big the entire project is. To know that, we must know the size of the work to be done by each team. Unfortunately, along with this goal comes the ability to compare teams based on their velocities. Since many managers are constantly looking for ways to compare team and individual performance it is not surprising that they begin to make such velocity comparisons. Almost all such comparisons are disruptive to performance of the combined, overall group or department.

A chart such as the one that follows can show a lot of interesting information.

Velocities before teams told they would be compared

However, this chart can be very dangerous because of how teams will assume the data is being interpreted. Shown a chart like this a common team response will be to feel that they need to faster than the other teams. Achieving this additional speed may come from working in a more focused manner (a good thing), but it may come instead from sacrificing quality, leaving important refactorings undone, or a variety of other not-so-good manners.

Some teams may respond to the pressure for their abstract measure of velocity to increase by gradually inflating the number of story points assigned to a story. This can happen in subtle and not particularly nefarious ways that can accumulate into large problems. Consider, for example, a team that is arguing over whether a particular story should be estimated at 5 or 8 points. If the team is under pressure (real or just perceived) to increase velocity they will be more likely to assign the 8. The next story the team considers is slightly larger. They compare it to the newly assigned 8 and decide to give it a 13. Without pressure to improve velocity, this same team may have given the first item a 5 and the second (slightly larger still) item an 8. In this one scenario the team has inflated their points from 5+8=13 to 8+13=21, or more than 50%. Story point inflation such as this tends to happen very quickly if it happens at all.

Consider what happened in the next few iterations for the four teams shown in the previous figure.

Four teams and their velocities

Not surprisingly, someone in the Project Management Office distributed the chart showing the similarities over the first three iterations. Two of the teams reacted by instantly inflating their story points. After seeing that, the yellow team followed suit. The green team is either extremely virtuous or they haven’t noticed the charts yet.

So, should you establish a common baseline? Yes, if there are advantages to doing so on your project. If you do, however, you need to make sure you go out of your way to create safety around that baseline for the teams. Stress that this isn’t being done as a way to compare teams and that you (and your bosses know) that there are many factors that influence velocity, not just “how good” a team.

Tags: , , , ,

25 Responses to “Is It a Good Idea to Establish a Common Baseline for Story Points?”

  1. For our past teams it was valuable to have a common baseline for all story points for stories estimated at the release level to compare cost/value. Beyond that the teams estimated and tracked their iteration velocity separately. However each major feature had its own PO, so it wasn’t as important to have the project level iteration/point burndown for one PO.

    I agree with the damage that comparing team release burndowns can create. Does the “sum of all points” release burndown have less value?

  2. Mike Cohn says:

    Hi Clint–

    I think that a “sum of all points” release burndown chart is one of the primary reasons that managers, executives or product owners want the common baseline. And I agree with them that it presents a useful view of the project. However, that usefulness is tempered a bit if the teams (or team members) are not highly interchangeable.

    For example, suppose we’re writing anti-spyware software and have 200 points of user interface work to do and have 300 points of hard-core C++ work to do. The C++ programmers are highly specialized, are writing code that loads before Windows does, etc. The UI programmers are good, skilled programmers but don’t know much about Windows internals. It’s one thing to show that project as having 500 points left; it’s another to look at the burndown of those 500 points and think that you can shift people between the two teams. The release date of that project is determined by whichever team is burning through “their work” more slowly. So in this case I’d be more interested in team burndown charts rather than a “sum of all points” chart.

  3. Mike,

    I agree that if you have component teams (we called them functional teams comprised of people of the same disciple) then the team burndown charts would show that the C++ velocity is the source of concern for the release.

    However, if I were part of that project, I would be concerned about some other things:
    A) Can C++ points really be initially baselined against UI points?
    B) Are component team really producing value every iteration? If widgets without code or UI code with widgets are delivered, can we measure true effort in points? Can the PO properly prioritize value between widgets stories and C++ stories?

    Would this debate lead us to a better organization of mixed teams that each delivered releasable value every iteration? This might lead to better transparency. For example, the mis-balance between UI artists and C++ programmers would come out in the first iteration from all or some of the teams.

    Sorry if I’m missing or straying from the point…do you think this is leading to a people over process discussion? ;)

  4. Tim Walker says:

    Great discussion. While noodling this a little I’m on both sides of the fence a little but am leaning a little towards one side of the fence (I’ll probably land in a cow pattie either way).

    Thinking out loud I think the goal of the multi-team velocity chart is to understand where the team is and when it will be done. Instead of story points as a way of measuring multiple team progress to goal (value) perhaps a chart that showed:

    Along the x is the iteration number and along the y is the “story estimation accuracy” which is simply estimated velocity – actual velocity for the sprint, for each team.

  5. Mike Cohn says:

    Hi Clint–

    To a large measure I agree with you. However, I was picking a real example of a team with three skills (C++, Delphi, and testing) where the C++ and Delphi programmers are absolutely, positively not interchangeable. We’re talking about very, very, very specialized C++ skills and deep Windows kernel knowledge. The Delphi programmers would be smart enough but would need years of experience before they’d want to (or be allowed to) go anywhere near what the C++ programmers were doing.

    Can a team with very disparate skills like this arrive at a common baseline for story points: Yes, I think so. Certainly the C++ programmers could have done the Delphi work so they had a feel for the effort involved there. Also, team members usually learn things about the relative effort of the work of others. On a game team, for example, I could never do the animation of a new character but after a little time with the team I would likely be able to contribute to a discussion of “this character will take longer than that one.”

    But to sink much of this: The C++/Delphi/Test example is perhaps not a good one because even though we could establish a common baseline there wouldn’t be a lot of value in doing so. The value is usually in being able to decide should we move people from this part of the project to that part. That wouldn’t have been practical in the case I used.

  6. Mike Cohn says:

    Hi Tim–

    I’ve seen teams graph whether they over or under-delivered in an iteration. A common result in those cases is for teams to feel judged (perhaps rightly so) if they commit and don’t deliver 100%. So teams learn to undercommit when they see charts such as that.

  7. Tim Walker says:

    I guess the thought of ‘gaming’ the system didn’t occur to me as this would violate the principle of trust. The idea was to increase accuracy of commitment with historical accuracy being a predictor of future estimation accuracy. Not under or over estimating the team is encouraged to simply trying to deliver what is committed.

  8. Tom Smallwood says:

    I was about halfway through the second paragraph and saw where this was going: a temptation to game the system in order to look good.

    Certainly a risk when transparency through information radiators are employed. Besides re-emphasizing trust is there other ways to keep point inflation in check?

  9. Mike Cohn says:

    Hi Tom–

    Here are two good ways to help avoid or at least reduce story point (or ideal day) inflation:

    • Be sure to triangulate your product backlog items (user stories). When estimating, the team (or coach / ScrumMaster) should ask a lot of questions of each other like, “So this is twice that one and a little smaller than this other,” and “That would mean this item is like that item and that it’s about the same as this one plus that one combined.”
    • Make sure that items and their estimates are visible to other teams. Post backlogs on walls. At iteration reviews, communicate the estimate on each item as it’s described. This doesn’t need to be more than a one-second mention of the size. But if a team knows that other teams will hear their sizes they’ll be more reluctant to let inflation occur. (“What?!? You called adding one new field to that report 100 points??”)
  10. Quinn Jones says:

    Hi All-

    As part of the Agile Champions Team at my place of work I would like to share some direct knowledge on the kinds of issues and impact enforcing a common pointing methodology has on a multiple development teams.

    1. Ego’s. All the teams had what they assumed was the “best” way to point stories. So you end up stepping on “egos” and pissing team leaders and the like off even if their pointing strategy is really bad because you are moving their “cheese”.

    2. Points = Hours of Effort…the great Myth and the biggest enemy we were faced with and trying to eliminate. Almost EVERY team had created their own pointing method to help them break down stories using points to reflect hours of effort. Most teams were using linear pointing methods (1, 2, 3, 4, 5 or 100, 200, 300, 400 or 1000, 2000, 3000, 4000, 5000) and these methods caused a lot of pain and suffering.

    3. Helping spread the knowledge of what makes other teams successful. A common pointing system means when you have a IS meeting everyone is talking the same language. If every team is using a different method to point then how do you apply the successes of one team’s pointing strategies to another struggling team? How do you go from a team that points using 100, 200, 300, 400, and 500 to a team that uses Fibonacci Series? It’s much easier to help teams if they are all on the same pointing strategy (preferably Fibonacci Series).

    4. Why Fibonacci Series…it makes no sense? All the teams love a linear pointing method because they can easily correlate this in their minds to hours of effort which should only be a distant after thought of pointing. In order to drive home the fact that pointing should be based on RELATIVE EFFORT or DIFFICULTY between stories, we enforced Fibonacci Series to help honor and maintain this concept. Plus, the level of effort for any software development task isn’t LINEAR per say but more exponential which Fibonacci Series better models.

    5. Good luck predicting and analyzing velocity and the like when teams try to use animals, fruit or vegetables to point with via relative size. Yes, you drive home the importance of relative difficulty BUT you lose your metrics. Leave it to software developers to come up with a story the size of a flounder and an epic that’s a Blue Whale.

    Conclusion, we don’t enforce many standards in our IS department BUT enforcing story pointing using Fibonacci Series has had a large POSITIVE impact across our development teams.

  11. Mike Cohn says:

    Excellent points, Quinn. Thanks for sharing these.

  12. Thushara says:

    Hi Mike..

    I have Team A, B, and C in my PO

    Team A assign points to user stories, B assign Points to user stories in their project and C assign points to the user stories in their project . So how do you make this chart meaningful.? Its not done by the same person in PO? As an example Team A will put 3 point to a Story X and Team B will put 2 Points to a same type of user story in their project. End of the day that depends on how each project team estimate the story points. its differ.. Am I missing something here? Could you please explain..

  13. Mike Cohn says:

    Please see the previous post at http://blog.mountaingoatsoftware.com/?p=43 which describes how to create a common baseline.

  14. Mathias Holmgren says:

    If multiple teams are working on the same backlog, items in that backlog will have to be estimated on some sort of standard story point scale. So one scale per back log seems to be a good rule. If teams have very different skills, perhaps their work should be driven by different backlogs, each with its own SP scale.

    Your thoughts

  15. Mike Cohn says:

    Mathias–
    Even if teams have different skills, I may want to a common story point meaning. But you are right that it will be less likely. A common value is most useful when we are making decisions like “should I switch this team from working on such-and-such to helping the other team work on this-and-that.” In a question like that it’s helpful to know the impact that the switch would make and a common unit provides that. If our teams are so different that could never happen then a common unit is not important.

  16. Chris Chan says:

    If the organization really need to compare productivity and performance across multiple independent teams, then one approach could be to use function points only after the user story has been shipped.

    i.e. Do not use function points as estimates of future work, but as estimates of what was built and implemented. And if you were to be really cautious as not disrupt the team’s approach (eg cutting corners) would be to perform the function point count only after the project is completed.

  17. Mike Cohn says:

    Hi Chris–
    That is exactly how I use function points. I documented a case study of measuring productivity this way in User Stories Applied. Thanks for sharing your thoughts here.

  18. Daryl says:

    Hi Mike,

    There are some concerns about baselining user stories because of the different skill levels in a team. Would baselining affect the results of poker planning? And how do you inform the team that this would not lose the value of using user story?

    Thanks,
    Daryl

  19. Mike Cohn says:

    Hi Daryl–
    You create the baseline so you can play Planning Poker. Without an established baseline you can’t. Story points can be used to allow people with different skill levels to estimate commonly.

  20. Juan says:

    Hi Make, Great Blog!!
    Question: What happens when you are an outsourcing development company and you provide service to a huge different customers from different industries, requiring expertise in different technologies, management practices and Agile methodologies knowledge? not only thinking about customer side, what happens inside the company with all those teams with so different skills and capabilities?

    How you measure a standard story point baseline for estimation with the above scenario? This can be great for new proposal effort estimations for this environment…

  21. Mike Cohn says:

    Hi Juan-
    I’m glad you enjoy the blog. Thanks.

    In a situation like you describe you’ll want to have a standard meaning for a story point. I don’t necessarily mean saying “one story point = one day of work” as teams try. But a meaning like “one story equals that logon story we did for client XYZ a year ago.” Then all teams can estimate relative to that. (That’s an oversimplification–you want multiple stories to make up a baseline, not just one.) For more on that topic see this blog post on establishing a common baseline for story points.

  22. Juan says:

    Hey Mike.

    I have already read that other post :-) (came to this one through the other).

    You stated ” (That’s an oversimplification–you want multiple stories to make up a baseline, not just one.) “. What would the criteria to choose those stories be? Would you start with the ones providing top business value to the customer for example? I can think of a logon. I would not believe that US having to much business value to the customer, comparing to some other feature an application might have… What is tough for me is to imagine how that set of US can serve as a good baseline for new teams in the way the US you choose can reflect something that makes sense, again as a good baseline.

  23. Mike Cohn says:

    Juan–
    The blog post I referenced covers what to select as the stories to use as the baseline:

    Not each estimator needs to understand every item but most people should understand most items. The items being estimated do not need to be new items; some could be from a project finished recently that many estimators remember or worked on. Some items could be artificial; perhaps the team is asked to estimate, “a typical transaction activity report.” If that meant something to most estimators, it would be a good candidate item.

    The goal is not the perfect set of stories for one team but a reasonable set that most people can understand most of them.

  24. Shanthi Dev says:

    What is the harm in “one story point = one day of ideal dev work”?
    Why cannot a story point be simple enough so everyone understands it?

  25. Mike Cohn says:

    Hi Shanthi–
    There may not be much harm in that but there’s certainly no benefit. If you’re going to say one point = one ideal day, just call them ideal days and forgo the benefits of story points. I do sometimes start teams that way–when necessary–but I quickly try to get them out of that mode.

Leave a Reply