Algorithm Complexity … it’s still a thing, folks!

Recently, we had a situation where a weekly scheduled task in a client’s business pipeline was taking a bit too long to complete considering the amount of data that was being processed (around 20 minutes). Since the task was not time critical the running time was not really a problem for the business. Extending and testing the app however quickly immediately got unacceptably tedious and coffee-break inciting for the development team, so we decided to look into it and see if we can make things better. In this blog post I will walk you over our investigation and refactoring process, and hopefully give you a nice reminder of why working out code & algorithm complexity is a vital skill in our line of work.

Quickly enough, we found that our task’s spent most of its running time on building just one out of the 15 or so XML files the task was creating. This file export has a large, somewhat complex and deeply nested data model, and was being populated with data coming from roughly 20 tables in the database.

First things first

Our first step was to figure out how we are going to test any changes that we make don’t break the existing implementation. The code had been functionally tested and was already on production, so we needed to ensure that we won’t be introducing any regressions. Some unit tests and integration tests were already in place and would do the job, but for added measure we setup a test data set to compare the existing and refactored output side by side.

Surely, it’s not our fault

With our testing strategy in place we moved on to examining the implementation. Naturally, the first wrong conclusion we jumped to was that it was (of course!) not our code’s fault and it was the database queries and Oracle’s provider for .NET. Oh, the shock we felt when this wasn’t the case… (though we were able to shave off a few seconds by increasing the FetchSize on the OracleCommand to save a few trips to the database).

What’s the problem?

With the database ruled out, I started going through the logic proper. All the data sets were already loaded in memory but on a conceptual level, the code boiled down to something like:

foreach (var n_item in records_n)
    foreach (var m_item in records_m)
        //Build the model here, with numerous searches across all lists
       result.SomeChildListB = records_p
          .Where(x => x.Id == n_item.Id)
          .Select(p => p.Name)
       result.PropertyC = records_q
          .SingleOrDefault(x => x.Id == m_item.Id)?.Name;

        // More operations here, with r,s,t,u,v,w ...

Well, the complexity here is O(n*m*(p+q+r+s+t+u+v+w)). Given that each list contains thousands to tens of thousands of items, the execution time is bound to be high.

Ok, so how do we improve this?

What we did was to restructure the list using a Dictionary of Dictionaries, with the keys being the identifiers of the root and child tables (n and m), respectively. This allowed us to refactor the code with minimal intervention on the output building section, to reduce the possibility of errors. In essence, this boils down to:

foreach (var item_n in records_n)
   dictionary.Add(item_n.Id, new Dictionary<int, DtoModel>());

 foreach (var item_m in records_m)
    dictionary[item_m.ParentId_N].Add(item_m.Id, new DtoModel()
        //Set some properties here

foreach (var item_p in records_p)

//Rinse and repeat for the other lists

The time complexity of building our dictionary is O(n+m+… + w) – inserting an item into a dictionary approaches O(1) so the complexity of building the dictionary is the sum of all items added.

Our data model building code then becomes:

foreach (var item in dictionary.Keys)
   var child_items = dictionary[item];

   foreach (var child_item in child_items)
       //Build the model here, with numerous searches across all lists
       result.SomeChildListB = child_item.Value.P_List
          .Select(x => x.Name).ToList();

       // More operations here, with q,r,s,t,u,v,w ...

For building the model, we end up with a worst case of O(n*m) operations: all records in the q,r,s,t… collections have been inserted in the correct “buckets” and there’s no need to do any searches, so the complexity of operations with these collections remains O(1). The worst-case scenario would occur only when all child items correspond to a single parent, which is not the case in our data as the child items are evenly distributed. In effect, the average performance approaches O(n)! The final result is a running time of less than 5 seconds, as opposed to 15-20 minutes before the changes.

What are the key takeaways from this relatively simple scenario?

Working code first. The initial implementation did its job and focusing on performance during the initial implementation would have likely resulted in numerous bugs.

Calculating complexity is cool. Sure, the same refactoring can be done without knowing or thinking of algorithm complexity, just by stating “it’s X times faster” in the end. This means however that we lose perspective of the why and how– the understanding of how things work underneath. That’s also how we lose control of the code we write.

So maybe, just maybe sometimes we must roll up our sleeves and think of code complexity ourselves.

Further Reading:

This post was written by Petar Ivanov