Dan Byström’s Bwain

Blog without an interesting name


Posted by Dan Byström on December 29, 2013

This post contains a synopsis of an algorithm I used to cut down execution time of the following code:

foreach (var mline in mlines)
  foreach (var xline in xlines)
    if (xline.AIds.Contains(mline.AId) && xline.BIds.Contains(mline.BId) && xline.CIds.Contains(mline.CId))
      // do something useful...

Given that there are 100,000 “xlines” and an average of 4 items in the AIds/BIds/CIds arrays – the execution time drops from 2 hours and 54 minutes down to 47 seconds when matching 1,000,000 “mlines”.

Since I was about to write this down anyway and it contains an unusual data structure – dictionaries of hash sets – I thought that I could just as well publish it.

Given problem: match something called “m-lines”:

# A-Id B-Id C-Id Relevant data…
0 34 56 45
1 12 39 776
2 19 56 92
3 576 243 646

with something called “x-lines”:

# A-Ids B-Ids C-Ids Relevant data… Match with m-lines
0 12,34,59 18,39,56,89 45,78 0
1 12,19,59,117 39,56,243 78,646
2 19,34,35,576 18,56,89 45,92,776 0,2
3 34,35,117,576 56,243 45,78,646,776 0,3

The rule for a match is simple: an m-line matches an x-line if the A-Id of the m-line is present in the A-Ids of the x-line and the same for the B-Id in the B-Idsand the C-Id in the C-Ids.

This is pretty straightforward and as long as we have just a few hundred lines this will be just fine. But as the number of lines increases, this will become painfully slow. On may laptop, 1,000,000 m-lines combined with 100,000 c-lines takes almost 3 hours. And this is supposed to happen in a user interaction!

The solution I came up with is to first pre-processes  the c-lines and produces the following lookups:

A Dictionary B Dictionary C Dictionary
key int value HashSet<int> key int value HashSet<int> key int value HashSet<int>
12 0,1 18 0,1 45 0,2,3
19 1,2 39 0,1 78 0,1,3
34 0,2,3 56 0,1,2,3 92 2
35 2,3 89 0,2 646 1,3
59 0,1 243 1,3 776 2,3
117 1,3
576 2,3

The simple idea is that this gives all the indexes of x-lines containing a certain Id, for the A’s, B’s and C’s respectively. All the x-lines that matches a certain m-line is is then simply the intersection of these three hash sets. And this is a fast operation.

foreach (var mline in mlines)
  HashSet h1, h2, h3;
  if (!alookup.TryGetValue(mline.AId, out h1))
  if (!blookup.TryGetValue(mline.BId, out h2))
  if (!clookup.TryGetValue(mline.CId, out h3))
  putSmallestHashSetFirst(ref h1, ref h2, ref h3);
  foreach (var idx in h1.Where(_ => h2.Contains(_) && h3.Contains(_)))
    var xline = xlines[idx];
    //do something useful...

So there it is: a nice speed up of a factor 222!

It is not only the users waiting time we have saved here – we have also saved a lot of electricity by cutting down the computing time. Fast algorithms should indeed be classified as eco-friendly! 🙂

Therefore, I propose the following New Year’s resolutions for 2014:

  • try to eat less meat
  • try to travel by public transportation whenever possible
  • try to write more efficient algorithms

And then we have helped a little tiny bit to save the environment! Happy New Year!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: