Dan Byström’s Bwain

Blog without an interesting name

Optimizing away II

Posted by Dan Byström on December 22, 2008

Continued from Optimizing away. Ok, now I have worked up the courage.

Prepare yourself for a major disappointment. I really do not know how to tweak that C#-loop to run a nanosecond faster. But I can do the same calculation much faster. How? Just my old favorite party trick. It goes like this:

1. Add a new project to your solution

2. Chose Visual C++ / CLR / Class Library

3. Insert the following managed class:

	public ref class FastImageCompare
	{
	public:
		static double compare( void* p1, void* p2, int count )
		{
			return NativeCode::fastImageCompare( p1, p2, count );
		}
		static double compare( IntPtr p1, IntPtr p2, int count )
		{
			return NativeCode::fastImageCompare( p1.ToPointer(), p2.ToPointer(), count );
		}
	};

4. Insert the following function into an unmanaged class (which I happened to call NativeCode):

unsigned long long NativeCode::fastImageCompare( void* p1, void* p2, int count )
{
	int high32 = 0;

	_asm
	{
		push	ebx
		push	esi
		push	edi

		mov		esi, p1
		mov		edi, p2
		xor		eax, eax
again:
		dec		count
		js		done

		movzx	ebx, [esi]
		movzx	edx, [edi]
		sub		edx, ebx
		imul	edx, edx

		movzx	ebx, [esi+1]
		movzx	ecx, [edi+1]
		sub		ebx, ecx
		imul	ebx, ebx
		add		edx, ebx

		movzx	ebx, [esi+2]
		movzx	ecx, [edi+2]
		sub		ebx, ecx
		imul	ebx, ebx
		add		edx, ebx

		add		esi, 4
		add		edi, 4

		add		eax, edx
		jnc		again

		inc		high32
		jmp		again
done:
		mov		edx, high32

		pop		edi
		pop		esi
		pop		ebx
	}

}

Yeah. That’s it. Hand tuned assembly language within a .NET Assembly. UPDATE 2009-01-01: return type of the function changed from “unsigned long” to “unsigned long long”, see here.

I guess that’s almost cheating. And we will be locked inside the Intel platform. Most people won’t mind I guess, but other may have very strong feelings about it. If we really would like to exploit this kind of optimizations while still be portable (to Mono/Mac for example) one possibility would be to load the assembly with native code dynamically. If it fails we could fall back to an alternative version written in pure managed code.

(I know from experience that some people with lesser programming skills react to this with a “what? it must be a crappy compiler if you can write faster code by yourself”. Let me assure you that this is not the case. On the contrary: I’m amazed about the quality of the code emitted by the C# + .NET JIT compilers.)

13 Responses to “Optimizing away II”

  1. […] (RSS) « Besserwisser post on EvoLisa Optimizing Away II […]

  2. Alex said

    Any chance you’ll put up a binary so we mortals can reap the benefits of your faster code?

  3. […] Optimizing away II […]

  4. I tried to apply all of your optimizations (on an Intel box) and it works fine for the default image. However I am having trouble when I open my own image (a 300×300 jpg). In the end, I pulled out the NativeCode tweak, and it works fine now (for both). Strange…

  5. […] Optimizing away II […]

  6. […] Optimizing away II […]

  7. […] video was made using Roger’s source code modified with Dan Bystrom’s FitnessCalculator written in assembler. The program was changed to output a frame whenever the image improves by 5%. […]

  8. Yannickm said

    I am unsure how fast imul is nowadays, but wouldn’t using a lookup table be faster?

    int sqrtLookup[256];

    for( int i=0; i 0 ; i–, p1++, p2++ )
    {
    int r = p1->R – p2->R;
    int g = p1->G – p2->G;
    int b = p1->B – p2->B;
    error += sqrtLookup[r] + sqrtLookup[g] + sqrtLookup[b];
    }

  9. John said

    How about this MSIL code?

    ldloc.s error
    ldloc.s p1
    ldfld uint8 GenArt.Core.Classes.Pixel::R
    ldloc.s p2
    ldfld uint8 GenArt.Core.Classes.Pixel::R
    sub
    dup
    mul
    ldloc.s p1
    ldfld uint8 GenArt.Core.Classes.Pixel::G
    ldloc.s p2
    ldfld uint8 GenArt.Core.Classes.Pixel::G
    sub
    dup
    mul
    ldloc.s p1
    ldfld uint8 GenArt.Core.Classes.Pixel::B
    ldloc.s p2
    ldfld uint8 GenArt.Core.Classes.Pixel::B
    sub
    dup
    mul
    ldloc.s p1
    ldfld uint8 GenArt.Core.Classes.Pixel::A
    ldloc.s p2
    ldfld uint8 GenArt.Core.Classes.Pixel::A
    sub
    dup
    mul
    add
    add
    add
    add
    ldloc.s p1
    sizeof GenArt.Core.Classes.Pixel
    add
    stloc.s p1
    ldloc.s p2
    sizeof GenArt.Core.Classes.Pixel
    add
    stloc.s p2
    stloc.s error

Leave a comment