Writing log records
The last post in this series introduced the problem briefly. You find that post here.
In this post I’ll talk about the changes made to make that writing of log records fast enough. There were 50 million records that was written. Each of them pretty much in its own transaction. Of course the commit activity caused problem, as did log buffer issues. Some of this could be somewhat remedied with configuration.
The big issue though was that the writes themselves took too much time and too often many session ended up in long contention chains. Yes, it would have been great to have the luxury of redesigning the whole logging situation from the ground up. But, as is often the case, the solution was built such that all systems connecting were implemented in such a way that redesigning was not an option. Fixing the performance of this had to be done without requiring code changes to the systems performing the logging. Oh, joy.
So what caused the problem then? For the inserts it was pretty straight forward. Too many transactions making an insert and a commit. This caused indexes to be hotspots where all processes wanted to write at the same spot. Hash-partitioning had been introduced and that had led to less contention but slower performance. As the partitions existed on different parts on the disks the write head had to be constantly moved and that caused slower service times.
What could we do to make a big improvement while not affecting the code? We’re not talking about just 10-20% of improvement on any area in this case, and even more important was to make the performance stable. That is, the most important thing was to ensure that there were no spikes where an insert suddenly to 20 times longer than usual. The contention chains that was occurring made performance spike such that the whole system became unusable.
The solution here turned out to be something so far from advanced technologies as questioning assumptions. The first time I asked “why do we have these indexes”, most people in the room thought I was just joking around. Eventually they realised that I was serious. After an amusing period of silence where I could see them thinking “Do we need to inform him that indexes are needed to enforce uniqueness and to support referential integrity?”, someone went ahead and did just that. OK, now we were on to a productive discussion, as of course that wasn’t what I meant. The follow up discussion about why we needed referential integrity and uniqueness for this set of data was very enlightening for everyone. To make a long story short, it was not needed at all. It was there because it had always been there and nobody had questioned the need before.
How come we didn’t need data to be unique? Well, this is log-data. That is it tells us what actions has been performed by the system. If some activity would be reported twice, it really wouldn’t be the end of the world. The possible problem that some activity isn’t logged cannot be handled with defining unique constraints. That is pure system design and nothing I could improve or worsen by removing some indexes.
Thus, the indexes was removed together with the foreign keys (referential integrity).
Sounds simple enough, but did it help? Did it ever! In one month after making the change, there has not been one report of one transaction that was anywhere close to take too long. This simple solution made the logging so fast that it is no longer a concern.
The next post in this series will discuss the solution for moving data to the history tables. This process took around 16 hours and it had to become at least three times as fast. As you’ll see, moving all these rows can be done much faster than so.