Bugs happen…
Sunday, May 17th, 2009As a software developer, one is sometimes tempted to ignore some marginal risks. Using a timestamp or a random number to generate unique IDs is something common in many implementations, even if it could create collisions in some rare cases. It’s sometimes hard to motivate yourself to create something that’s guaranteed to work in every case when the chances of a bogus use are so low.
I stumbled on a bug yesterday that reminded me that however low the chances are that a known bug could happen, it definitely will some day, and preferably in production and at an inconvenient moment.
When I started MKGI Chess Club, a web based chess interface, I imported most of the game rules implementation from another open source chess project. I reviewed it and in spite of a weird coding style decided to integrate it. The site has been running for a few years now and has almost 1 million moves played on it, mostly without any issue in this implementation. The other day, a player notified me that one of his games was stuck. I took some time to reproduce and analyze his problem and finally found what was wrong in the rules code.
This problem would only occur when:
- A player had more than 2 instances of a given piece. This can only happen through pawn promotion, how many times did you actually have three queens in an actual game ??
- The second and third of these pieces should have a common reachable tile. This would not happen with the first one !
- The user should decide to move one of these pieces to the common square, when he has probably a lot of other possible moves available.
Here is the actual position of the player who found the bug. After obtaining his third queen, you can see that the second and third ones have access to the square D5. His actual move was D8-D5. The server acknowledged the move, but displayed it weirdly on the board and did not make a move in return. The move has been considered ambiguous by the engine because a move is composed internally by the piece type and target square. There is a way to extend the move by adding the original position when two pieces of the same type can reach the same square, but this would not work correctly when there was more than two pieces of this kind.
Here is the way that the move was displayed to the user. You can clearly see on the board that the move is not understood correctly by the server. Even if the pieces are at their right places, the green trail which should display the latest move shows G8-D5 instead of D8-D5 and the underlying GnuChess engine simply refused to reply to this move.
Even if I was not the original author of the code, it taught me a good lesson and comforted me in the idea that software should always anticipate strictly these kind of bad odds !
