We are living in very extraordinary times. We have encountered a lot of things that we would say any of these crazy things wouldn’t happen if somebody said us two years ago about the pandemic and its effect. It is not easy to spend one day without hearing something about it. Earlier today I saw the news about COVID vaccination saying that if not well-developed countries couldn’t get enough number for their citizens, the cost of medical treatments for patients those who couldn’t get vaccinated in time will outweigh the cost of vaccination itself. As a software engineer, I thought this analogy is suitable for a lot of situations in our industry, and I would like to talk a little about it.
When we are started to develop some kind of software things are not very complicated at the very start. Having great frameworks and boilerplate codes helps us to quickly start something and create a few working prototypes. But after some time things get complicated, even for a small change we have to think about the effect on the overall structure. A little further down the road, these little time-consuming dull problems increasing vastly, we have to think and act fast especially if there is a deadline pressure or being rushed by a product manager. At that point we have to create a feasible quick solution whether it is the best or not, it is just working now and everyone is pleased. If it is the best possible solution, we are one of the luckiest people on earth or we are very smart. If it is the other way around hence not the very best solution dust swept under the carpet eventually will be exposed and we have to deal with it and maybe requiring even more effort. I’m not inventing the wheel here, we have a term for this situation, technical debt. Vaccination analogy perfectly explains technical debt, cost of not doing proper solution today would possibly be much smaller as opposed to postponed one.
A real-world exemplary
After stating the obvious, I would expand the discussion by giving one real-world example but this time about our expertise. In my current sprint cycle, I have to develop an additional feature for our existing API which is responsible for processing several images with some kind of operations depending on the customer group. I knew that some of our API clients had encountered timeout problems while processing a bunch of images and I want to solve this issue since I was going to touch that endpoint already. But how am I going to decide on the proper timeout value? I looked for similar methods, some of them had very huge timeout values(60.000ms) and some of them had default values(10.000ms) and my problematic method had also a default timeout value. I decided to go with a rational way (since I’m an engineer), deciding proper timeout value according to prior execution times of the current API. Then, I opened up an in-house analytic tool that accumulates access logs and executed queries that will give me the execution time for the API. However, lacking index on the columns the query took a lot of time and even I got the warning from some system administrator that I should kill the query since it was taking too much time. After jumping back and forth the idea of fixing or leaving it the way it is now, I decided to not take any action in the end. Without having enough data it would not be logical to give any bigger timeout value since it could also cause customers to wait unnecessarily. I want to call attention to two different points here. First is the obvious one, eventually, one customer is going to have timeout issues with the API and we have to spend the effort to analyze the proper solution thus creating technical debt. Not solving the problem now will cost more in the future, on the customer side struggling with problem plus with the customer service agent creating appropriate Jira issue and ultimately developer to fix timeout problem. I should have spent more time, look for another logging tool and try to get numbers to solve the problem, right? I think there is another catch here.
Calculating real cost
I know that I started with stating the obvious one for the problem, but I think as almost with any problem there isn’t any silver-bullet solution to the problem and generally answer depends on the context. Let’s suppose I spent more time solving the problem, digging more into tools, and finally found the appropriate timeout value. But what if that specific endpoint is not used that much or the problem is a one-time network problem. Spending a lot of time for solving a non-critical problem is not logical either. So I think the preferred attitude should be calculating the seriousness of the problem and the impact that will create when solved, along with the required cost. I think we create a technical debt only if the problem is worth the effort and we must take it into account when deciding what to spend our precious time on.