Matchmaking
Posted by Dan Byström on December 29, 2013
This post contains a synopsis of an algorithm I used to cut down execution time of the following code:
foreach (var mline in mlines) foreach (var xline in xlines) if (xline.AIds.Contains(mline.AId) && xline.BIds.Contains(mline.BId) && xline.CIds.Contains(mline.CId)) { // do something useful... matchCount++; }
Given that there are 100,000 “xlines” and an average of 4 items in the AIds/BIds/CIds arrays – the execution time drops from 2 hours and 54 minutes down to 47 seconds when matching 1,000,000 “mlines”.
Since I was about to write this down anyway and it contains an unusual data structure – dictionaries of hash sets – I thought that I could just as well publish it.
Given problem: match something called “m-lines”:
# | A-Id | B-Id | C-Id | Relevant data… |
0 | 34 | 56 | 45 | … |
1 | 12 | 39 | 776 | … |
2 | 19 | 56 | 92 | … |
3 | 576 | 243 | 646 | … |
with something called “x-lines”:
# | A-Ids | B-Ids | C-Ids | Relevant data… | Match with m-lines | |
0 | 12,34,59 | 18,39,56,89 | 45,78 | … | 0 | |
1 | 12,19,59,117 | 39,56,243 | 78,646 | … | ||
2 | 19,34,35,576 | 18,56,89 | 45,92,776 | … | 0,2 | |
3 | 34,35,117,576 | 56,243 | 45,78,646,776 | … | 0,3 |
The rule for a match is simple: an m-line matches an x-line if the A-Id of the m-line is present in the A-Ids of the x-line and the same for the B-Id in the B-Idsand the C-Id in the C-Ids.
This is pretty straightforward and as long as we have just a few hundred lines this will be just fine. But as the number of lines increases, this will become painfully slow. On may laptop, 1,000,000 m-lines combined with 100,000 c-lines takes almost 3 hours. And this is supposed to happen in a user interaction!
The solution I came up with is to first pre-processes the c-lines and produces the following lookups:
A Dictionary | B Dictionary | C Dictionary | |||||
key int | value HashSet<int> | key int | value HashSet<int> | key int | value HashSet<int> | ||
12 | 0,1 | 18 | 0,1 | 45 | 0,2,3 | ||
19 | 1,2 | 39 | 0,1 | 78 | 0,1,3 | ||
34 | 0,2,3 | 56 | 0,1,2,3 | 92 | 2 | ||
35 | 2,3 | 89 | 0,2 | 646 | 1,3 | ||
59 | 0,1 | 243 | 1,3 | 776 | 2,3 | ||
117 | 1,3 | ||||||
576 | 2,3 |
The simple idea is that this gives all the indexes of x-lines containing a certain Id, for the A’s, B’s and C’s respectively. All the x-lines that matches a certain m-line is is then simply the intersection of these three hash sets. And this is a fast operation.
foreach (var mline in mlines) { HashSet h1, h2, h3; if (!alookup.TryGetValue(mline.AId, out h1)) continue; if (!blookup.TryGetValue(mline.BId, out h2)) continue; if (!clookup.TryGetValue(mline.CId, out h3)) continue; putSmallestHashSetFirst(ref h1, ref h2, ref h3); foreach (var idx in h1.Where(_ => h2.Contains(_) && h3.Contains(_))) { var xline = xlines[idx]; //do something useful... matchCount--; } }
So there it is: a nice speed up of a factor 222!
It is not only the users waiting time we have saved here – we have also saved a lot of electricity by cutting down the computing time. Fast algorithms should indeed be classified as eco-friendly! 🙂
Therefore, I propose the following New Year’s resolutions for 2014:
- try to eat less meat
- try to travel by public transportation whenever possible
- try to write more efficient algorithms
And then we have helped a little tiny bit to save the environment! Happy New Year!
Leave a Reply