Optimizing away II

Posted by Dan Byström on December 22, 2008

Continued from Optimizing away. Ok, now I have worked up the courage.

Prepare yourself for a major disappointment. I really do not know how to tweak that C#-loop to run a nanosecond faster. But I can do the same calculation much faster. How? Just my old favorite party trick. It goes like this:

1. Add a new project to your solution

2. Chose Visual C++ / CLR / Class Library

3. Insert the following managed class:

	public ref class FastImageCompare
	{
	public:
		static double compare( void* p1, void* p2, int count )
		{
			return NativeCode::fastImageCompare( p1, p2, count );
		}
		static double compare( IntPtr p1, IntPtr p2, int count )
		{
			return NativeCode::fastImageCompare( p1.ToPointer(), p2.ToPointer(), count );
		}
	};

4. Insert the following function into an unmanaged class (which I happened to call NativeCode):

unsigned long long NativeCode::fastImageCompare( void* p1, void* p2, int count )
{
	int high32 = 0;

	_asm
	{
		push	ebx
		push	esi
		push	edi

		mov		esi, p1
		mov		edi, p2
		xor		eax, eax
again:
		dec		count
		js		done

		movzx	ebx, [esi]
		movzx	edx, [edi]
		sub		edx, ebx
		imul	edx, edx

		movzx	ebx, [esi+1]
		movzx	ecx, [edi+1]
		sub		ebx, ecx
		imul	ebx, ebx
		add		edx, ebx

		movzx	ebx, [esi+2]
		movzx	ecx, [edi+2]
		sub		ebx, ecx
		imul	ebx, ebx
		add		edx, ebx

		add		esi, 4
		add		edi, 4

		add		eax, edx
		jnc		again

		inc		high32
		jmp		again
done:
		mov		edx, high32

		pop		edi
		pop		esi
		pop		ebx
	}

}

Yeah. That’s it. Hand tuned assembly language within a .NET Assembly. UPDATE 2009-01-01: return type of the function changed from “unsigned long” to “unsigned long long”, see here.

I guess that’s almost cheating. And we will be locked inside the Intel platform. Most people won’t mind I guess, but other may have very strong feelings about it. If we really would like to exploit this kind of optimizations while still be portable (to Mono/Mac for example) one possibility would be to load the assembly with native code dynamically. If it fails we could fall back to an alternative version written in pure managed code.

(I know from experience that some people with lesser programming skills react to this with a “what? it must be a crappy compiler if you can write faster code by yourself”. Let me assure you that this is not the case. On the contrary: I’m amazed about the quality of the code emitted by the C# + .NET JIT compilers.)

This entry was posted on December 22, 2008 at 12:17 and is filed under .NET, Programming. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

13 Responses to “Optimizing away II”

Optimizing away « Dan Byström’s Bwain said

December 22, 2008 at 13:56
[…] (RSS) « Besserwisser post on EvoLisa Optimizing Away II […]

Reply
Alex said

December 29, 2008 at 19:52
Any chance you’ll put up a binary so we mortals can reap the benefits of your faster code?

Reply
- danbystrom said
  
  December 29, 2008 at 20:22
  Eh, sure I can… but there is really nothing more to it than I wrote… but just gimme ’till tomorrow…
  
  Reply
Optimizing away II.2 « Dan Byström’s Bwain said

December 30, 2008 at 14:21
[…] Optimizing away II […]

Reply
Nate Huddleson said

December 31, 2008 at 00:00
I tried to apply all of your optimizations (on an Intel box) and it works fine for the default image. However I am having trouble when I open my own image (a 300×300 jpg). In the end, I pulled out the NativeCode tweak, and it works fine now (for both). Strange…

Reply
- danbystrom said
  
  December 31, 2008 at 10:14
  Since I haven’t even tried it on the default image I cannot decide if this is good or bad news… 🙂
  
  How ’bout you zipping the thing and mail it to me and I’ll take a peek?
  
  Reply
- danbystrom said
  
  January 1, 2009 at 20:44
  Problem solved: https://danbystrom.se/2009/01/01/optimizing-away-ii3/
  The reason it worked for the default image I guess, is that it is rather dark and EvoLisa starts out with a black image. You probably used a much lighter image, which caused some significant bits to be truncated. Sorry for the inconvenience.
  
  Reply
Otimizing away II.3 « Dan Byström’s Bwain said

January 1, 2009 at 20:29
[…] Optimizing away II […]

Reply
Optimizing away II.3 « Dan Byström’s Bwain said

January 1, 2009 at 20:30
[…] Optimizing away II […]

Reply
Brian Low » EvoLisa Video said

January 27, 2009 at 04:35
[…] video was made using Roger’s source code modified with Dan Bystrom’s FitnessCalculator written in assembler. The program was changed to output a frame whenever the image improves by 5%. […]

Reply
Yannickm said

February 24, 2009 at 16:31
I am unsure how fast imul is nowadays, but wouldn’t using a lookup table be faster?

int sqrtLookup[256];

for( int i=0; i 0 ; i–, p1++, p2++ )
{
int r = p1->R – p2->R;
int g = p1->G – p2->G;
int b = p1->B – p2->B;
error += sqrtLookup[r] + sqrtLookup[g] + sqrtLookup[b];
}

Reply
John said

May 31, 2009 at 18:54
How about this MSIL code?

ldloc.s error
ldloc.s p1
ldfld uint8 GenArt.Core.Classes.Pixel::R
ldloc.s p2
ldfld uint8 GenArt.Core.Classes.Pixel::R
sub
dup
mul
ldloc.s p1
ldfld uint8 GenArt.Core.Classes.Pixel::G
ldloc.s p2
ldfld uint8 GenArt.Core.Classes.Pixel::G
sub
dup
mul
ldloc.s p1
ldfld uint8 GenArt.Core.Classes.Pixel::B
ldloc.s p2
ldfld uint8 GenArt.Core.Classes.Pixel::B
sub
dup
mul
ldloc.s p1
ldfld uint8 GenArt.Core.Classes.Pixel::A
ldloc.s p2
ldfld uint8 GenArt.Core.Classes.Pixel::A
sub
dup
mul
add
add
add
add
ldloc.s p1
sizeof GenArt.Core.Classes.Pixel
add
stloc.s p1
ldloc.s p2
sizeof GenArt.Core.Classes.Pixel
add
stloc.s p2
stloc.s error

Reply
- John said
  
  May 31, 2009 at 18:56
  P.S. I’ll post the implementation of this code within a week or two.
  
  Reply

Dan Byström’s Bwain

Blog without an interesting name

Recent Posts

Top Posts

Categories

Subscribe

Archive