Improving Drawing Performance using Multiple Threads

Posted by Matt | Filed under , ,

The application that I write is a 2D CAD-type of application.  The application draws many lines, circles, bezier curves on the screen.  It's not uncommon to require drawing over 1,000,000 lines on the screen at any given time.  So I was looking for ways to increase the performance of the drawing.

There are some techniques like caching results, dirty buffers, etc. that can save the screen to avoid re-drawing when necessary.  It's good to investigate those methods and implement them if they help.  However, it does not help with the initial drawing.

FYI, my test computer is my working laptop: AMD Turion 64 X2 1.8 GHz, 2 GB RAM, 1680x1050x32, Vista Business, 3.0 "Windows Experience Index".  I cannot say that my conditions are "typical" for the average user of my application, but it's the basis that I have to start from.  It is dual core, so I can accurately play with multiple threads.

Attempt 0: Baseline

I created a simple program and timed the results.  Basically, calling CDC::MoveTo() and CDC::LineTo() 1,000,000 times to draw all the lines takes 15 seconds on my laptop.  My CPU never passes 50% usage.  So during painting, half of my processor is going to waste.

Attempt 1: GDI+

I also tried using GDI+ but that was much slower.  I stopped the test because it did 1,000 lines in 5 seconds.  At this rate it'll finish 1,000,000 lines in over an hour.

Attempt 2: Multiple Threads

My 2nd attempt was to take the 1,000,000 lines, split them into 2 bunches of 500,000 lines, and have 2 threads drawing simultaneously to the CDC.

The results were 26 seconds to render the same lines as before.  My guess is that the CDC, MFC, Windows, or device driver (somewhere along that pipeline) is doing some critical sections.  This overhead of blocking and task switching is reducing performance.

Attempt 3: Multiple Threads with Memory Buffers

My 3rd attempt was to take the same 2 bunches of lines and same 2 threads, but render each to an independent memory device context.  In this case, a memory CDC rendering to a CBitmap.  After the rendering is complete, I would merge the 2 bitmaps and copy that onto the screen.

The results were around 18 seconds.  Better than writing to the CDC directly, but still no better than a single thread.  CDC or MFC must have some critical sections somewhere.  I timed the merging separately from the drawing.  I thought the merging would be an expensive operation, but it's actually rather cheap.  On my laptop, around 50ms to merge the 2 bitmaps, so I was surprised by that.

Attempt 4: Single Thread with non-CDC Memory Buffer

My 4th attempt was to create my own replacement for CDC for drawing lines.  So I created my own RGBA buffer and line drawing routine (not hard to do, and there are many on the internet).

After drawing, I could copy the memory buffer into a CBitmap and blit that onto the screen.

The results were 19 seconds.  I'm still going in the wrong direction.  CDC might have some video acceleration advantage on this laptop.

However...

Attempt 5: Multiple Threads with non-CDC Memory Buffers

I used the same 2 bunches of lines and the same 2 threads that I did for the multithread CDC test.

Using this configuration, I actually got some pleasing results:  12 seconds.  Progress!

So using my own memory buffer, I managed to get from 19 seconds to 12 seconds by utilizing the 2nd core on my laptop.  However, in reality, I only went from 15 seconds (baseline) to 12 seconds.  Better, but not what I was hoping for.

Attempt 6:  Try Someone Else's Computer

Like any developer, when you have an algorithm that you're positive should work, but just isn't performing as you would expect, the next logical step is to try it on someone else's computer.  Obviously it's the computer, not the developer!

Intel 2.8 GHz Dual Core:

Attempt Time in seconds
0 12
1 gave up
2 skipped
3 17
4 10
5 5

These results please me a bit more.  Not only is the non-CDC memory buffer algorithm faster than CDC in single threaded mode, but I've reduced the rendering time by almost 60% overall.  These are results I can be happy with.

Going even further, I suspect that if I was to run this on a quad-core processor and have 4 threads, that the rendering time would be around 2.5 seconds.  Unfortunately, I don't have such a computer at my disposal, so I'll have to do that test another day.

Currently rated 5.0 by 1 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Add comment


(Will show your Gravatar icon)  

  Country flag

biuquote
  • Comment
  • Preview
Loading