This is the fifth installment of a multi-part series demonstrating multithreading techniques and performance characteristics in VB.Net. Catch up on the previous installments: Introduction to multithreading, The Application Skeleton, Single Threaded Performance, and SyncLock Performance.
While the last post used SyncLock to mark a “critical section” of code to maintain concurrency, we are now going to take a look at the Mutex class. A word of warning: as the test result show, Mutex is an extremely slow system! You want to be very careful when using Mutex. According to MSDN Magazine (September, 2006), Mutex can take 9801.6 CPU cycles to acquire a lock on a CPU without contention. In comparison to 112 CPU cycles for a Win32 CRITICAL_SECTION (which SyncLock uses), it is easy to see why the decision to use Mutex must be careful weighed.
Here is the code used for this test:
Public Sub MutexMultiThreadComputation(ByVal Iterations As Integer, Optional ByVal ThreadCount As Integer = 0)
Dim twMutexLock As MutexThreadWorker
Dim IntegerIterationCounter As Integer
Dim iOriginalMaxThreads As Integer
Dim iOriginalMinThreads As Integer
Dim iOriginalMaxIOThreads As Integer
Dim iOriginalMinIOThreads As Integer twMutexLock = New MutexThreadWorker Threading.ThreadPool.GetMaxThreads(iOriginalMaxThreads, iOriginalMaxIOThreads)
Threading.ThreadPool.GetMinThreads(iOriginalMinThreads, iOriginalMinIOThreads) If ThreadCount > 0 Then
Threading.ThreadPool.SetMaxThreads(ThreadCount, ThreadCount)
Threading.ThreadPool.SetMinThreads(ThreadCount, ThreadCount)
End If For IntegerIterationCounter = 1 To Iterations
Threading.ThreadPool.QueueUserWorkItem(AddressOf twMutexLock.ThreadProc, Double.Parse(IntegerIterationCounter))
Next While MutexThreadWorker.IntegerCompletedComputations < Iterations End While
Threading.ThreadPool.SetMaxThreads(iOriginalMaxThreads, iOriginalMaxIOThreads)
Threading.ThreadPool.SetMinThreads(iOriginalMinThreads, iOriginalMinIOThreads)
twMutexLock = Nothing
IntegerIterationCounter = Nothing
End Sub
And the MutexThreadWorker class:
Public Class MutexThreadWorker
Public Shared MutexStorageLock As New Threading.Mutex
Public Shared MutexCompletedComputationsLock As New Threading.Mutex
Public Shared IntegerCompletedComputations As Integer = 0
Private Shared DoubleStorage As Double Public Property Storage() As Double
Get
MutexStorageLock.WaitOne()
Return DoubleStorage
MutexStorageLock.ReleaseMutex()
End Get
Set(ByVal value As Double)
MutexStorageLock.WaitOne()
DoubleStorage = value
MutexStorageLock.ReleaseMutex()
End Set
End Property Public Property CompletedComputations() As Integer
Get
Return IntegerCompletedComputations
End Get
Set(ByVal value As Integer)
IntegerCompletedComputations = value
End Set
End Property Public Sub ThreadProc(ByVal StateObject As Object)
Dim ttuComputation As ThreadTestUtilities ttuComputation = New ThreadTestUtilities Storage = ttuComputation.Compute(CDbl(StateObject)) MutexCompletedComputationsLock.WaitOne()
CompletedComputations += 1
MutexCompletedComputationsLock.ReleaseMutex()
ttuComputation = Nothing
End Sub
Public Sub New()
End Sub
End Class
Here are the results of our tests. All tests are for 100,000 iterations, and the results are in milliseconds per test run. This is a significant departure from previous posts in this series, where tests were performed with 1,000,000 iterations.
TEST 1
This test allows the ThreadPool to manage the total number of minimum and maximum threads on its own:
Test 1 | Test 2 | Test 3 | Test 4 | Test 5 | Average | |
System A | 953.125 | 546.875 | 625.000 | 562.500 | 656.250 | 668.750 |
System B | 733.886 | 765.115 | 624.584 | 702.657 | 687.042 | 702.657 |
System C | 671.862 | 796.859 | 749.985 | 718.736 | 765.610 | 740.610 |
System D | 2972.759 | 2925.820 | 2941.466 | 2910.174 | 2957.112 | 2941.466 |
Average | 1263.371 |
TEST 2
In this test, we limit the maximum number of threads to one per logical processor:
Test 1 | Test 2 | Test 3 | Test 4 | Test 5 | Average | |
System A | 578.125 | 562.500 | 640.625 | 500.000 | 515.625 | 559.375 |
System B | 624.854 | 780.730 | 687.042 | 640.198 | 702.657 | 687.096 |
System C | 687.486 | 703.111 | 718.736 | 781.235 | 703.111 | 718.736 |
System D | 2894.528 | 2941.466 | 2861.143 | 2954.561 | 2985.826 | 2927.505 |
Average | 1223.178 |
TEST 3
This test uses only one thread:
Test 1 | Test 2 | Test 3 | Test 4 | Test 5 | Average | |
System A | 578.125 | 562.500 | 609.375 | 562.500 | 640.625 | 590.625 |
System B | 640.198 | 780.730 | 655.813 | 702.657 | 765.115 | 708.903 |
System C | 796.859 | 749.985 | 781.235 | 765.610 | 734.360 | 765.610 |
System D | 2892.031 | 3001.459 | 2876.398 | 3001.459 | 3985.826 | 3151.435 |
Average | 1304.143 |
TEST 4
This test uses two concurrent threads:
Test 1 | Test 2 | Test 3 | Test 4 | Test 5 | Average | |
System A | 703.125 | 500.000 | 640.625 | 578.125 | 671.875 | 618.750 |
System B | 733.886 | 749.500 | 671.427 | 718.271 | 640.198 | 702.656 |
System C | 859.358 | 687.486 | 671.862 | 703.111 | 718.736 | 728.111 |
System D | 2953.635 | 2906.752 | 2891.124 | 2906.752 | 2984.890 | 2928.631 |
Average | 1244.537 |
TEST 5
Here we show four concurrent threads:
Test 1 | Test 2 | Test 3 | Test 4 | Test 5 | Average | |
System A | 562.500 | 609.375 | 531.250 | 515.625 | 546.875 | 553.125 |
System B | 765.115 | 655.813 | 687.042 | 718.271 | 733.886 | 712.025 |
System C | 781.235 | 749.985 | 828.109 | 718.736 | 874.983 | 790.610 |
System D | 2954.561 | 2985.826 | 2923.296 | 2907.663 | 2938.928 | 2942.055 |
Average | 1249.454 |
TEST 6
This test uses eight concurrent threads:
Test 1 | Test 2 | Test 3 | Test 4 | Test 5 | Average | |
System A | 640.625 | 562.500 | 609.375 | 625.000 | 546.875 | 596.875 |
System B | 890.032 | 671.427 | 718.271 | 687.042 | 640.198 | 721.394 |
System C | 703.111 | 812.484 | 734.360 | 765.610 | 796.859 | 762.485 |
System D | 2985.826 | 2970.194 | 3000.518 | 2875.496 | 2969.263 | 2960.259 |
Average | 1260.253 |
TEST 7
Finally, this test runs 16 simultaneous threads:
Test 1 | Test 2 | Test 3 | Test 4 | Test 5 | Average | |
System A | 609.375 | 562.500 | 546.875 | 625.000 | 578.125 | 584.375 |
System B | 749.500 | 780.730 | 655.813 | 671.427 | 640.198 | 699.534 |
System C | 828.109 | 718.736 | 749.985 | 796.859 | 703.111 | 759.360 |
System D | 5438.439 | 5407.184 | 5329.045 | 5157.141 | 5235.279 | 5313.418 |
Average | 1839.172 |
System A: AMD Sempron 3200 (1 logical x64 CPU), 1 GB RAM
System B: AMD Athlon 3200+ (1 logical x64 CPU), 1 GB RAM
System C: Intel Pentium 4 2.8 gHz (1 logical x86 CPU), 1 GB RAM
System D: Two Intel Xeon 3.0 gHz (2 dual core, HyperThreaded CPUs providing 8 logical x64 CPUs), 2 GB RAM
It is extremely important to understand the following information and disclaimers regarding these benchmark figures:
They are not to be taken as absolute numbers. They are taken on real-world systems with real-world OS installations, not clean benchmark systems. They are not to be used as any concrete measure of relative CPU performance; they simply illustrate the different relative performance characteristics of different multithreading techniques on different numbers of logical CPUs, in order to show how different processors can perform differently with different techniques.
Testing revealed some very unusual performance characteristics. Although a little bit of “spin up” in .Net code is always expected, due to the GAC and what not, initial test runs sometimes took much longer than subsequent tests. Furthermore, tests with high numbers of iterations (such as 1,000,000) seemed to take quite some time to “spin down.” Where the problem is cannot be found without much more in depth investigation. However, it did seem that there is definitely a “break or break” point on an individual system basis for this code, at which point something is going very wrong outside of the code itself. It may quite well be the garbage collector; the code itself had finished running, but the application was still consuming massive amounts of RAM and CPU time long after it reported success. There may be some possible tweaks that can be made to the code itself to assist the garbage collector, but altogether, it seems like Mutex is indeed an incredibly slow system. The speed difference between the Xeon system and the other systems seems to indicate that context switching is also killing performance, since it is running many more threads at once.
J.Ja