This is the fifth installment of a multi-part series demonstrating multithreading techniques and performance characteristics in VB.Net. Catch up on the previous installments: Introduction to multithreading, The Application Skeleton, Single Threaded Performance, and SyncLock Performance.

While the last post used SyncLock to mark a “critical section” of code to maintain concurrency, we are now going to take a look at the Mutex class. A word of warning: as the test result show, Mutex is an extremely slow system! You want to be very careful when using Mutex. According to MSDN Magazine (September, 2006), Mutex can take 9801.6 CPU cycles to acquire a lock on a CPU without contention. In comparison to 112 CPU cycles for a Win32 CRITICAL_SECTION (which SyncLock uses), it is easy to see why the decision to use Mutex must be careful weighed.

Here is the code used for this test:

Public Sub MutexMultiThreadComputation(ByVal Iterations As Integer, Optional ByVal ThreadCount As Integer = 0)

  Dim twMutexLock As MutexThreadWorker

  Dim IntegerIterationCounter As Integer

  Dim iOriginalMaxThreads As Integer

  Dim iOriginalMinThreads As Integer

  Dim iOriginalMaxIOThreads As Integer

  Dim iOriginalMinIOThreads As Integer  twMutexLock = New MutexThreadWorker  Threading.ThreadPool.GetMaxThreads(iOriginalMaxThreads, iOriginalMaxIOThreads)


  Threading.ThreadPool.GetMinThreads(iOriginalMinThreads, iOriginalMinIOThreads)  If ThreadCount > 0 Then

    Threading.ThreadPool.SetMaxThreads(ThreadCount, ThreadCount)

    Threading.ThreadPool.SetMinThreads(ThreadCount, ThreadCount)

  End If  For IntegerIterationCounter = 1 To Iterations

    Threading.ThreadPool.QueueUserWorkItem(AddressOf twMutexLock.ThreadProc, Double.Parse(IntegerIterationCounter))

  Next  While MutexThreadWorker.IntegerCompletedComputations < Iterations  End While

  Threading.ThreadPool.SetMaxThreads(iOriginalMaxThreads, iOriginalMaxIOThreads)

  Threading.ThreadPool.SetMinThreads(iOriginalMinThreads, iOriginalMinIOThreads)

  twMutexLock = Nothing

  IntegerIterationCounter = Nothing

End Sub

And the MutexThreadWorker class:

Public Class MutexThreadWorker

  Public Shared MutexStorageLock As New Threading.Mutex

  Public Shared MutexCompletedComputationsLock As New Threading.Mutex

  Public Shared IntegerCompletedComputations As Integer = 0

  Private Shared DoubleStorage As Double  Public Property Storage() As Double

    Get

      MutexStorageLock.WaitOne()

      Return DoubleStorage

      MutexStorageLock.ReleaseMutex()

    End Get

    Set(ByVal value As Double)

      MutexStorageLock.WaitOne()

      DoubleStorage = value

      MutexStorageLock.ReleaseMutex()

    End Set

  End Property  Public Property CompletedComputations() As Integer

    Get

      Return IntegerCompletedComputations

    End Get

    Set(ByVal value As Integer)

      IntegerCompletedComputations = value

    End Set

  End Property  Public Sub ThreadProc(ByVal StateObject As Object)


    Dim ttuComputation As ThreadTestUtilities    ttuComputation = New ThreadTestUtilities    Storage = ttuComputation.Compute(CDbl(StateObject))    MutexCompletedComputationsLock.WaitOne()

    CompletedComputations += 1

    MutexCompletedComputationsLock.ReleaseMutex()

    ttuComputation = Nothing

  End Sub

  Public Sub New()

  End Sub

End Class

Here are the results of our tests. All tests are for 100,000 iterations, and the results are in milliseconds per test run. This is a significant departure from previous posts in this series, where tests were performed with 1,000,000 iterations.
TEST 1

This test allows the ThreadPool to manage the total number of minimum and maximum threads on its own:

Test 1 Test 2 Test 3 Test 4 Test 5 Average
System A 953.125 546.875 625.000 562.500 656.250 668.750
System B 733.886 765.115 624.584 702.657 687.042 702.657
System C 671.862 796.859 749.985 718.736 765.610 740.610
System D 2972.759 2925.820 2941.466 2910.174 2957.112 2941.466
Average 1263.371

TEST 2

In this test, we limit the maximum number of threads to one per logical processor:

Test 1 Test 2 Test 3 Test 4 Test 5 Average
System A 578.125 562.500 640.625 500.000 515.625 559.375
System B 624.854 780.730 687.042 640.198 702.657 687.096
System C 687.486 703.111 718.736 781.235 703.111 718.736
System D 2894.528 2941.466 2861.143 2954.561 2985.826 2927.505
Average 1223.178

TEST 3

This test uses only one thread:

Test 1 Test 2 Test 3 Test 4 Test 5 Average
System A 578.125 562.500 609.375 562.500 640.625 590.625
System B 640.198 780.730 655.813 702.657 765.115 708.903
System C 796.859 749.985 781.235 765.610 734.360 765.610
System D 2892.031 3001.459 2876.398 3001.459 3985.826 3151.435
Average 1304.143

TEST 4

This test uses two concurrent threads:

Test 1 Test 2 Test 3 Test 4 Test 5 Average
System A 703.125 500.000 640.625 578.125 671.875 618.750
System B 733.886 749.500 671.427 718.271 640.198 702.656
System C 859.358 687.486 671.862 703.111 718.736 728.111
System D 2953.635 2906.752 2891.124 2906.752 2984.890 2928.631
Average 1244.537

TEST 5

Here we show four concurrent threads:

Test 1 Test 2 Test 3 Test 4 Test 5 Average
System A 562.500 609.375 531.250 515.625 546.875 553.125
System B 765.115 655.813 687.042 718.271 733.886 712.025
System C 781.235 749.985 828.109 718.736 874.983 790.610
System D 2954.561 2985.826 2923.296 2907.663 2938.928 2942.055
Average 1249.454

TEST 6

This test uses eight concurrent threads:

Test 1 Test 2 Test 3 Test 4 Test 5 Average
System A 640.625 562.500 609.375 625.000 546.875 596.875
System B 890.032 671.427 718.271 687.042 640.198 721.394
System C 703.111 812.484 734.360 765.610 796.859 762.485
System D 2985.826 2970.194 3000.518 2875.496 2969.263 2960.259
Average 1260.253

TEST 7

Finally, this test runs 16 simultaneous threads:

Test 1 Test 2 Test 3 Test 4 Test 5 Average
System A 609.375 562.500 546.875 625.000 578.125 584.375
System B 749.500 780.730 655.813 671.427 640.198 699.534
System C 828.109 718.736 749.985 796.859 703.111 759.360
System D 5438.439 5407.184 5329.045 5157.141 5235.279 5313.418
Average 1839.172

System A: AMD Sempron 3200 (1 logical x64 CPU), 1 GB RAM
System B: AMD Athlon 3200+ (1 logical x64 CPU), 1 GB RAM
System C: Intel Pentium 4 2.8 gHz (1 logical x86 CPU), 1 GB RAM
System D: Two Intel Xeon 3.0 gHz (2 dual core, HyperThreaded CPUs providing 8 logical x64 CPUs), 2 GB RAM

It is extremely important to understand the following information and disclaimers regarding these benchmark figures:

They are not to be taken as absolute numbers. They are taken on real-world systems with real-world OS installations, not clean benchmark systems. They are not to be used as any concrete measure of relative CPU performance; they simply illustrate the different relative performance characteristics of different multithreading techniques on different numbers of logical CPUs, in order to show how different processors can perform differently with different techniques.

Testing revealed some very unusual performance characteristics. Although a little bit of “spin up” in .Net code is always expected, due to the GAC and what not, initial test runs sometimes took much longer than subsequent tests. Furthermore, tests with high numbers of iterations (such as 1,000,000) seemed to take quite some time to “spin down.” Where the problem is cannot be found without much more in depth investigation. However, it did seem that there is definitely a “break or break” point on an individual system basis for this code, at which point something is going very wrong outside of the code itself. It may quite well be the garbage collector; the code itself had finished running, but the application was still consuming massive amounts of RAM and CPU time long after it reported success. There may be some possible tweaks that can be made to the code itself to assist the garbage collector, but altogether, it seems like Mutex is indeed an incredibly slow system. The speed difference between the Xeon system and the other systems seems to indicate that context switching is also killing performance, since it is running many more threads at once.

J.Ja