Parallel Programming with C# – Data Parallelism

Data Parallelism and Task Parallelism are two ways of performing multi core parallel programming with C#. In this post I will be putting details of Data Parallelism and will continue Task Parallelism in the next post .Both these concepts have a lot of information to share which cannot be put in a single article. So let’s see how Data Parallelism works in C# and how we can do multi core programming.

 In Data Parallelism same task is executed in parallel threads with different set of data which are intern part of one collection e.g. there is a collection of data and you have to perform a task on each and every element of that collection. In normal scenario you will loop through each of the element and perform the task. Data parallelism allows you to break the collection into different set and each set is fed into parallel threads that perform the desired task. Hence the execution is faster as compared to sequential execution.

 One thing that is important to understand for data parallelism is that, breaking of collection data in individual sets and merging the results after parallel execution is taken care by the framework.

 PLINQ and Parallel class are the two most important components in .Net Framework 4.0 that provide support for this kind of parallel execution.

 As earlier mentioned, collection is broken into individual set and later merged in case of data parallelism, this is completely supported by PLINQ while Parallel classes only support default partition and merging or collation of results need to be done manually.

 Using PLINQ for Data Parallelism

 AsParallel() method of PLINQ is the simplest way to achieve data parallelism through a LINQ query .

      List<int> dataSet = newList<int>(newint[4]{1,2,3,4}); List<int> squaredData;

            squaredData = dataSet.AsParallel().Select(data => data * data).ToList();

            squaredData.ForEach(data => Console.WriteLine(data.ToString()));

 The input data set is partitioned and each set is assigned to a parallel thread for execution. The end result is collated and combined into one output.

The PLINQ query can only be used with .net collections objects and not with the LINQ query that result in SQL query execution on the database for the obvious reason that .net framework can not access the result set of SQL server. Though if that result set returned back to the application in form of any .net collection class then PLINQ can be used.

Also when we use AsParallel method the result of execution is never ordered as the input data since the execution is done on different thread and the sequence can not be maintained for these threads. Though AsOrdered() can be used to order the output of PLINQ output.

squaredData = dataSet.AsParallel().AsOrdered()Select(data => data * data).ToList();

*Similarly AsUnordered() can be used to remove any ordering in the result set and also during the flow of the PLINQ query.

Using PLINQ with Joins

It is important to use AssParallel method with all the input source while using joins in the query else query results in an exception message as given below.

The correct use of PLINQ with joins is as below:

List leftDataSet = new List(new int[4] { 1, 2, 3, 4 });

List rightDataSet = new List(new int[4] { 2, 4, 6, 8 });

var resultSet = from x in leftDataSet.AsParallel()

join y in rightDataSet.AsParallel() on x equals y

select x*y;

resultSet.ForAll(data => MessageBox.Show(data.ToString()));

Similarly if you have more than one data input sets as in case of group join, union and other operation then it is required to use AsParallel for all the input sets.

Degree of Parallelism with PLINQ

AsParallel().WithDegreeOfParallelism(X) can be used with queries to define the level of parallelism when you think that the default parallelism is not enough and each task executed in the parallel thread is time consuming e.g. if the number of cores available is x but number of tasks to be done is more than x and each task may involve some waiting period then with degree of parallelism we control the number of threads and each thread will not be waiting for other thread to complete.

Exception Handling with PLINQ

System.AggregateException is thrown if any exception occurs in any of the thread spawned by the PLINQ query and execution of any other parallel thread is stopped.

We can catch this exception in our application and move ahead with the normal flow of the application but the drawback is that the query is not completed.

List leftDataSet = new List(new int[4] { 1, 2, 3, 4 });

      List rightDataSet = new List(new int[4] { 2, 4, 6, 8 });

var resultSet = from x in leftDataSet.AsParallel()

join y in rightDataSet.AsParallel() on 1 equals 1

select x / (x – y); // DivideByZeroException will be  raised

try

{

resultSet.ForAll(data => MessageBox.Show(data.ToString()));

}

catch (AggregateException e)

{

foreach (var ex in e.InnerExceptions)

{

MessageBox.Show(ex.Message);

if (ex is DivideByZeroException)

MessageBox.Show(“The data source is corrupt. Query stopped.”);

}

}

If you want to catch individual exception for each thread and want that the query is executed completely then we can include the error prone area of code in a delegate and call this delegate in our PLINQ query.

           List<int> leftDataSet = new List<int>(new int[4] { 1, 2, 3, 4 });

List<int> rightDataSet = new List<int>(new int[4] { 2, 4, 6, 8 });

Func<int, int=”” int,=””> GetData = (x, y) =>

            {

                int returnValue;

                try

                {

                    return returnValue = x / (x – y);

               }

                catch (DivideByZeroException ex)

                {

                    MessageBox.Show(ex.Message);

                    return 0;

                }

            };

var resultSet = from x in leftDataSet.AsParallel()

join y in rightDataSet.AsParallel() on 1 equals 1

select GetData(x, y);

try

{

resultSet.ForAll(data => MessageBox.Show(data.ToString()));

}

catch (AggregateException e)

{

foreach (var ex in e.InnerExceptions)

{

MessageBox.Show(ex.Message);

if (ex is DivideByZeroException)

MessageBox.Show(“The data source is corrupt. Query stopped.”);

}

}

Using Parallel Class for Data Parallelism

System.Threading.Tasks.Parallel class provides a parallel implementations of for and foreach loops where you do not have to create separate threads, queue work items or take locks while executing the a task in parallel fashion over a set of collection data. Every individual iteration runs in a parallel thread instead of sequential manner and any exception in any one thread halts the execution of all the subsequent thread as in PLINQ query.

For,Foreach and Invoke are the three methods of Parallel class that provides the data parallelism functionality to .Net framework.

Parallel.For

Sequential Loop:

for (int i = 0; i < 100; i++)

PerformSomeTask (i);

Parallel Implementation:

Parallel.For(0, 100, i => PerformSomeTask(i));

Parallel.ForEach

Sequential Loop:

foreach (var ex in e.InnerExceptions)

{

MessageBox.Show(ex.Message);

}

Parallel Implementation:

Parallel.ForEach(e.InnerExceptions, ex => MessageBox.Show(ex.Message));

Breaking the loop in parallel execution:

ParallelLoopState.Break() : Used when we have to retrieve values from a collection up to a certain index and when the behaviour is required to be similar to the break used in normal sequential execution.Break() ensures that all the thread created before reaching to break point are completed.

Parallel.ForEach(e.InnerExceptions, (ex, loopState) =>

{

MessageBox.Show(ex.Message);

loopState.Break();

});

ParallelLoopState.Stop() : It is used in scenarios where some search is being done on the loop. It asks all the loops to stop as soon as possible without guaranteeing the completion of the threads. So if Stop is used to retrieve 100 elements from a collection then it may or may not return all the 100 elements. It request the runtime environment to terminate the execution of all the threads as soon as possible.Once Stop method is called IsStopped property of ParallelLoopState is set to true.

Parallel.Invoke

It allows you to invoke multiple delegates in parallel fashion. Since results of Parallel class method need to be collated by the users itself, you need to wait for the execution of individual threads and then store the result in a collection construct. The best way to store these results in concurrent collection as they provide a thread safe way to collate your data.

 Output of Parallel.Invoke varies for the above displayed code depending on the scheduling of the parallel threads. [My system being a single core machine]

Internally Invoke method works as a task but it more intelligent in handling large amount of operations. Instead of creating a single task for every delegate passed to it, it creates a batch of delegates and then executes them.

These are the basic details for acheiving data parallelism which can get you started with parallel programing in C#.The more advance concepts can be referred in MSDN or other online resource.You can also let me know about any specific tobic that you would like to discuss further.

Advertisements

2 responses to “Parallel Programming with C# – Data Parallelism

  1. Pingback: Taking a problem from simple to a massively parallel execution « Supercoderz

  2. Pingback: Parallel Programming with C# – Task Parallelism | Vikrant Ravi

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s