Dan Byström’s Bwain

Blog without an interesting name

Improving performance…

Posted by Dan Byström on December 14, 2008

…of the fitness function in the EvoLisa project. In case you managed to miss it, EvoLisa is an already world famous project created by Roger Alsing earlier this week.

Continued from this post.

With just a few changes, we can actually make the original fitness function run 25 times faster. I’ll start by presenting the original code and then directly the improved version. After that I’ll discuss my reasoning behind each one of the changes.

Original code:

	public static class FitnessCalculator
	{
		public static double GetDrawingFitness( DnaDrawing newDrawing, Color[,] sourceColors )
		{
			double error = 0;

			using ( var b = new Bitmap( Tools.MaxWidth, Tools.MaxHeight, PixelFormat.Format24bppRgb ) )
			using ( Graphics g = Graphics.FromImage( b ) )
			{
				Renderer.Render(newDrawing, g, 1);

				BitmapData bmd1 = b.LockBits(
					new Rectangle( 0, 0, Tools.MaxWidth, Tools.MaxHeight ),
					ImageLockMode.ReadOnly,
					PixelFormat.Format24bppRgb );

				for ( int y = 0 ; y < Tools.MaxHeight ; y++ )
				{
					for ( int x = 0 ; x < Tools.MaxWidth ; x++ )
					{
						Color c1 = GetPixel( bmd1, x, y );
						Color c2 = sourceColors&#91;x, y&#93;;

						double pixelError = GetColorFitness( c1, c2 );
						error += pixelError;
					}
				}

				b.UnlockBits( bmd1 );
			}

			return error;
		}

		private static unsafe Color GetPixel( BitmapData bmd, int x, int y )
		{
			byte* p = (byte*)bmd.Scan0 + y * bmd.Stride + 3 * x;
			return Color.FromArgb( p&#91;2&#93;, p&#91;1&#93;, p&#91;0&#93; );
		}

		private static double GetColorFitness( Color c1, Color c2 )
		{
			double r = c1.R - c2.R;
			double g = c1.G - c2.G;
			double b = c1.B - c2.B;

			return r * r + g * g + b * b;
		}

	}
&#91;/sourcecode&#93;

Optimized code, 25 times as fast:

&#91;sourcecode language='csharp'&#93;
	public struct Pixel
	{
		public byte B;
		public byte G;
		public byte R;
		public byte A;
	}

	public class NewFitnessCalculator : IDisposable
	{
		private Bitmap _bmp;
		private Graphics _g;

		public NewFitnessCalculator()
		{
			_bmp = new Bitmap( Tools.MaxWidth, Tools.MaxHeight );
			_g = Graphics.FromImage( _bmp );
		}

		public void Dispose()
		{
			_g.Dispose();
			_bmp.Dispose();
		}

		public double GetDrawingFitness( DnaDrawing newDrawing, Pixel&#91;&#93; sourcePixels )
		{
			double error = 0;

			Renderer.Render(newDrawing, g, 1);

			BitmapData bd = _bmp.LockBits(
				new Rectangle( 0, 0, Tools.MaxWidth, Tools.MaxHeight ),
				ImageLockMode.ReadOnly,
				PixelFormat.Format32bppArgb );

			unchecked
			{
				unsafe
				{
					fixed ( Pixel* psourcePixels = sourcePixels )
					{
						Pixel* p1 = (Pixel*)bd.Scan0.ToPointer();
						Pixel* p2 = psourcePixels;
						for ( int i = sourcePixels.Length ; i > 0 ; i--, p1++, p2++ )
						{
							int r = p1->R - p2->R;
							int g = p1->G - p2->G;
							int b = p1->B - p2->B;
							error += r * r + g * g + b * b;
						}
					}
				}
			}
			_bmp.UnlockBits( bd );

			return error;
		}

	}

First of all we notice that each time the fitness function is called, a new bitmap is constructed, used and then destroyed. This is fine for a function that seldom gets called. But for a function that is repeatedly called, we’ll be far better off if we reuse the same Bitmap and Graphics objects over and over.

Therefore I have changed the class from being static into one that muct be instantiated. Of course, that requires some minor changes to the consumer of this class, but in my opinion this will only be for the better. Although convenient, static methods (and/or singletons) are very hostile to unit testing and mocking, so I’m trying to move away from them anyway.

To my surprise, this first optimization attempt only buys us a few percent performance increase. I’m somewhat surprised at this, but anyway, it’s a start, and now it will get better. Read on.

So, once we’ve added a constructor to create the bitmap and graphics objects once and for all (as we’ll as making the class disposable so that the two GDI+ objects can be disposed) we move on to the real performance issues:

	for ( int y = 0 ; y < Tools.MaxHeight ; y++ )
	{
		for ( int x = 0 ; x < Tools.MaxWidth ; x++ )
		{
			Color c1 = GetPixel( bmd1, x, y );
			Color c2 = sourceColors&#91;x, y&#93;;

			double pixelError = GetColorFitness( c1, c2 );
			error += pixelError;
		}
	}
&#91;/sourcecode&#93;

This code looks pretty innocent, eh? It is not.

Even for a moderately sized bitmap, say 1,000 by 1,000 pixels, the code in the inner loop is executed 1,000,000 times. Thats a pretty big number. This means that each tiny little "error", performance-wise, is multiplied by 1,000,000 so every little tiny tiny thing will count in the end.

So for example, just each method call will consume time compared to having the method's code inline within the loop. Above we find two method calls GetPixel and GetColorFitness which will be far better off moved inside the loop, but as I will end up explaining is that the worst performance hog here is really the innocent looking line "Color c2 = sourceColors&#91;x, y&#93;;". Anyway, off we go:

&#91;sourcecode language='csharp'&#93;
			unchecked
			{
				unsafe
				{
					for ( int y = 0 ; y < Tools.MaxHeight ; y++ )
					{
						for ( int x = 0 ; x < Tools.MaxWidth ; x++ )
						{
							byte* p = (byte*)bmd1.Scan0 + y * bmd1.Stride + 3 * x;
							Color c1 = Color.FromArgb( p&#91;2&#93;, p&#91;1&#93;, p&#91;0&#93; );
							Color c2 = sourceColors&#91;x, y&#93;;

							int R = c1.R - c2.R;
							int G = c1.G - c2.G;
							int B = c1.B - c2.B;

							error += R * R + G * G + B * B;
						}
					}
				}
			}
&#91;/sourcecode&#93;

The above changes, including changing the variables R, G &amp; B from double into int will buy us approximately a 30% speed increase. Ain't much compared to 25 times but still we're moving on. Then we can look at the "Color c1" and notice that we can get rid of it completely by simply changing the inner code like so:

&#91;sourcecode language='csharp'&#93;
			byte* p = (byte*)bmd1.Scan0 + y * bmd1.Stride + 3 * x;
			Color c2 = sourceColors&#91;x, y&#93;;

			int R = p&#91;2&#93; - c2.R;
			int G = p&#91;1&#93; - c2.G;
			int B = p&#91;0&#93; - c2.B;

			error += R * R + G * G + B * B;
&#91;/sourcecode&#93;

Now we actually have code that executes TWICE as fast as our original code. And now we must turn our attention to the first two. The rest I don't think we can do much about.

Think about it. What we want to to is loop over each and every pixel in the image. Why then do we need to <em><strong>calculate </strong></em>the memory address for <em><strong>each pixel</strong></em> when we want to move to the <em><strong>next pixel</strong></em>? For each pixel we do completely unnecessary calculations. First "(byte*)bmd1.Scan0 + y * bmd1.Stride + 3 * x"; this contains four variables, two additions and two multiplications when really a single increment is all we need.

Then "sourceColors[x, y]". Fast enough and nothing we can improve here, right? No, no no, this is far WORSE! It looks completely harmless, but not only is a <strong>similar formula as the previous one</strong> taking place behind the scenes; for each pixel, the <em><strong>x and y parameters are being bounds checked</strong></em>, ensuring that we do not pass illegal values to the array-lookup!!!

So this innocent-looking expression will cause someting like this to happen somewhere around a million times for each fitness calculation:


			// pseudo-code
			if ( x < sourceColors.GetLowerBound( 0 ) || y < sourceColors.GetLowerBound( 1 ) || x > sourceColors.GetUpperBound( 0 ) || y > sourceColors.GetUpperBound( 1 ) )
				throw new IndexOutOfRangeException( "(Index was outside the bounds of the array." );
			Color c2 = *( &sourceColors + x * ( sourceColors.GetUpperBound( 1 ) + 1 ) + y );

Now we’re in for a little heavier refactoring. Unfortunately the sourcePixel matrix is laid out column-by-row instead of row-by-column which would have been better, so in order to solve this issue I’ll even change it into a vector of type “Pixel” instead. This requires change to the method signature and to the construction of the the matrix/vector itself of course, but once in place:

			unchecked
			{
				unsafe
				{
					fixed ( Pixel* psourceColors = sourceColors )
					{
						Pixel* pc = psourceColors;
						for ( int y = 0 ; y < Tools.MaxHeight ; y++ )
						{
							byte* p = (byte*)bmd1.Scan0 + y * bmd1.Stride;
							for ( int x = 0 ; x < Tools.MaxWidth ; x++, p += 3, pc++ )
							{
								int R = p&#91;2&#93; - pc->R;
								int G = p[1] - pc->G;
								int B = p[0] - pc->B;

								error += R * R + G * G + B * B;
							}
						}
					}
				}
			}

we´re actually in for a performance improvement of 15 times!!!

Yeah, that’s actually how bad (performance-wise) the innocent looking line “Color c2 = sourceColors[x, y];” was. Bet some of you didn’t know that!!! 🙂

In order to change sourceColors from a matrix of Color into a vector of Pixel (declared as in the second code window above) I did this:

		public static Pixel[] SetupSourceColorMatrix( Bitmap sourceImage )
		{
			if ( sourceImage == null )
				throw new NotSupportedException( "A source image of Bitmap format must be provided" );

			BitmapData bd = sourceImage.LockBits(
			new Rectangle( 0, 0, Tools.MaxWidth, Tools.MaxHeight ),
			ImageLockMode.ReadOnly,
			PixelFormat.Format32bppArgb );
			Pixel[] sourcePixels = new Pixel[Tools.MaxWidth * Tools.MaxHeight];
			unsafe
			{
				fixed ( Pixel* psourcePixels = sourcePixels )
				{
					Pixel* pSrc = (Pixel*)bd.Scan0.ToPointer();
					Pixel* pDst = psourcePixels;
					for ( int i = sourcePixels.Length ; i > 0 ; i-- )
						*( pDst++ ) = *( pSrc++ );
				}
			}
			sourceImage.UnlockBits( bd );

			return sourcePixels;
		}

Probably a little overkill… but what the heck… Now I guess many people who are familiar with LockBits and direct pixel manipulation will cry out HEY YOU CAN’T DO THAT! YOU MUST TAKE THE “STRIDE” INTO ACCOUNT WHEN YOU MOVE TO A NEW SCAN LINE.

Well, yes… and no. Not when I use the PixelFormat.Format32bppArgb! Go figure! 🙂

So our new changes means that we process each ROW in the bitmap blindingly fast, as compared to the original version and combined with our caching of the bitmap we have gained a performance boost of 20 times!

Now for my final version I have rendered the drawing in PixelFormat.Format32bppArgb, which is the default format for bitmaps in GDI+. In that format each pixel will be exactly four bytes in size, which in turn means that GDI+ places no “gap” between the last pixel in one row and the first pixel in the next row and so we are actually able to treat the whole image as a single vector, processing it all in one go.

To conclude: in c# we can use unsafe code and pointer arithmetic to access an array far faster than the normal indexer, because we short-circuit the bounds checking. If we iterate over several consecutive elements in the array our gain is even larger, because we just increment the pointer instead of recalculating the memory address of the cells of the array over and over.

Normally we don’t want to bother with this because the gain of safe and managed code is bigger than the performance gain. But when it comes to image processing this perspective may not be as clear.

BTW, I wrote a post on a similar topic a few months ago: Soft Edged Images in GDI+.

EDIT: Continued here Ekeforshus

15 Responses to “Improving performance…”

  1. usernameguy said

    Again, great little bit of optimization. Very impressive. And yeah, that [] call should not be in an inner loop like that.

    Any chance we could have the binary? Or you could send a patch to Roger? I’d love to play with the 25x sped-up version. 🙂

  2. usernameguy said

    Heh, excellent. Thanks.

    And I understand. Still – I’m guessing this sucker is, what, 80% or 90% its fitness function? So this 25x speed up is probably a 10x overall?

    BTW have you considered a CUDA implementation? Your optimizations are well-suited for porting to a GPU. (If I had any skill at all in CUDA, I’d give it a shot myself.)

    • danbystrom said

      Mats Helander (Rogers partner in crime on this) says that he estimates that the overall performance gain was approx four times. I haven’t measured it myself. And I’d LOVE to play around with GPUs… should I ever find the time…

  3. A.W said

    You can also get a nice speed boost by only checking every other pixel. The quality of the fitness is not really degraded.

  4. nemerle said

    Checking every other pixel is a special case of doing fitness evaluations on source image pyramid.
    Using n-level pyramid, and properly setting fitness steps:
    if(fitness<pyramid_threshold[current_level])
    current_level–;
    would make this even faster 🙂

  5. […] Filed under: Programming — danbystrom @ 13:57 Follow-up on Improving performance… I just tested that I can optimize this […]

  6. […] Dan Byström provided an optimization that resulted in a 25 times performance improvement for the fitness function, completely crazy stuff 🙂 […]

  7. Det är precis sånt här som jag vill läsa innan julen så att jag har något att laborera med i mörkret när barnen och frun gått och lagt sig. Riktigt snyggt jobbat Dan!

  8. Joe said

    Can someone please upload the changed sourcecode?

  9. John said

    I used your improvement, and I have an improvement of my own. Using ints instead of doubles for pictures with under 10,922 pixels (using an if statement to call the correct method) will improve the speed by around 25%. You may be able to increase that maximum pixel number to 21,845 by using unsigned ints. Also, your assembly version is obviously made for 64-bit processors, since on my 32-bit machine it runs around 20% slower than your normal improvement. (Also, 64-bit machines could use a long instead of an int for a speed increase for all pictures, but 32-bit machines would suffer greatly from that.)

    • danbystrom said

      Yes, if you’re willing to have two different loops depending on picture size, then you can have one version with uint and one with double. Have you tried using generics to avoid code duplication? The assembler version is definetly not written for a 64-bit processor, on the contrary: if you do run your C# version on a 64-bit processor, then the JIT compiler will produce code optimized for your processor and so you may get better performance than my 32-bit code.

Leave a comment