## Introduction

In this article, I will demonstrate how to run an elementary Monte Carlo experiment on Azure Batch. More specifically, we are going to estimate π.

☝🏻 Throughout this post, I assume a basic understanding of Azure Batch. For an introductory read, please refer to the official documentation.

There are three variants for calculating π in a probabilistic manner, these are:

1. The unit-circle/unit-square method
2. The integral way through the calculation of the average of functions
3. Buffon's needle problem

There are more efficient methods to calculate π than the one we will use here (e.g., Spigots algorithm). However, we are going to stick with the first one, as it is easy to follow and well-known.

If you would like to know more about the mathematical details, this blog post by Maike Elisa provides excellent explanations, which you can find here.

Estimating π with a Monte Carlo experiment can be solved intrinsically parallel, meaning worker tasks do not have to exchange information. Each instance can complete part of the work on its own. This approach allows to run the workload at a large scale and reduce code complexity.

This type of workload contrasts to so-called tightly coupled workloads, where tasks need to communicate with each other (usually via Message Passing Interface, MPI). Some example workloads are fluid dynamics, finite element analysis, and others. However, these are problems we are not going to deal with in this article.

Before we jump in and start with some math 🤓👨🏻‍🔬 here is the link to the repository. The exemplary code is written in .NET Core/C#.

## Mathematical background

To determine π, we are going to create a massive amount of random points on a plane that encloses a unit circle, where `(x, y|x ∈ [-1..1] ∧ y ∈ [-1..1])`.

Then we will test if they fall in the unit circle by using Pythagoras' theorem `x² + y² ≤ 1` and counting those hits.

As the circle's area can be calculated by `A = r²π` and the area of the square by `A = (2r)²`, we can put them into proportion and then deduce π as follows.

### How to parallel the problem?

Now that we would d like to solve the problem with Azure Batch, we need a way to break the problem apart into smaller pieces.

So let us cut the plane into stripes and let each worker node generate random points on its unit of work plane (the stripe). We also could have cut the plane into squares, but using stripes eases the splitting part.

In the following picture, we created working units for four nodes, where each will generate random points with `(x0, y0|x0 ∈ [-1..-0.5] ∧ y0 ∈ [-1..1])` and `(x1, y1|x1 ∈ [-0.5..0] ∧ y1 ∈ [-1..1])`, etc...

Then, we add the number of circle hits on each of the units of work planes and divide it by the total number of random points generated, resulting in a rough estimation of π.

### A more optimized approach to parallel the problem

The described approach to parallel the problem is a primitive one and not very efficient. By cutting the plane into stripes, we enclose huge areas known to be in the circle and therefore waste computation time.

Instead, we could divide the area into smaller squares. Then we would only generate random points on those sub-squares, which have an intersection with the circle (e.g., number 3).

For example, generating random points on sub-square number 1 does not make sense, as all would end up outside the circle. The same goes for sub-square number 2, where all points would end up in the circle. These sub-squares can be added as constants to the final summation.

If a sub-square requires computation (e.g., number 3), it could easily be tested by checking if at least one edge is in and another one outside the circle (test with `x² + y² ≤ 1` for each edge).

So, the given example would reduce the number of areas from 64 to 28, which is a time (and money) saver! However, that exercise is left for some other rainy day.

## The final solution

With the math being out of the way, we can now walk through the final application. This section is going to describe the components and only the most relevant parts of the workflow. These are:

• Performing the computation
• Monitoring and waiting for the tasks to finish
• Retrieving logs & results from tasks

Hopefully, the other parts of the source code are self-explanatory. Here is the link to the repository.

### Components

The batch solution is following the blueprint architecture provided by Microsoft. It is consists of two main components, the Cluster Head, which controls the operations, and the Workload Runner doing the actual computation on the nodes.

☝🏻 Please do not get confused! Microsoft usually calls the Cluster Head just client, application, or service.

• Runs on a workstation
• Splits input data into chunks and uploads them to the storage account
• Creates a job and defines the tasks
• Collects partial results and aggregates them

• Runs on pool nodes
• Stored as an application package (manually uploaded)
• Gets triggered by each task
• Performs computation
• Uploads output (partial result) to the storage account

For splitting the data, I have a method that takes the total number of random numbers and units (input data for tasks) to generate.

So when calling, e.g., `var units = Tools.GenerateUnits(100, 4)`, we will get four units, where each will generate 25 random points. Next, the units are serialized and uploaded via `UploadResourceFilesAsync` that finally calls `BlobContainerClient.UploadBlobAsync()`.

I couldn't find a way to upload resource files directly via the Batch SDK. Please drop me a mail if you know of a more native way!

And eventually, the input files end up in the configured blob container as depicted.

There are three options available to download resource files onto the nodes, these are:

I have chosen to go down the auto-storage route, as I would like to keep things as simple as possible.

Using the auto-storage container allows us to bypass configuring and creating a SAS URL to access the storage container. Instead, we only need to provide the name of the storage container, and we are good to go!

This is achieved by calling `ResourceFile.FromAutoStorageContainer()` at the time of task creation. Then, the resource file will end up in the task-specific working directory (e.g., `wd\input-0.json`) if nothing else was specified via the `filePath` parameter.

### Performing the computation

After the consumer has read and deserialized the JSON input file, it generates the random points on the specific stripe and counts the circle hits.

``````public static bool IsInCircle(double x, double y) => x * x + y * y <= 1.0;

public static Unit GenerateRandomPoints(Unit unit)
{
var stopwatch = new Stopwatch();
stopwatch.Start();

var random = new Random();
ulong circleHits = 0;

for (ulong i = 0; i < unit.NumRandomPoints; i++)
{
var x = random.NextDouble() * (unit.Area.UpperX - unit.Area.LowerX) + unit.Area.LowerX;
var y = random.NextDouble() * (unit.Area.UpperY - unit.Area.LowerY) + unit.Area.LowerY;

if (IsInCircle(x, y))
circleHits++;
}
stopwatch.Stop();

unit.ElapsedMilliseconds = stopwatch.ElapsedMilliseconds;
unit.CircleHits = circleHits;

return unit;
}``````

For uploading the output data, we can choose between a couple of options, which are:

I decided to use the second approach and go with the Batch File Conventions library, as it simplifies the process of storing task output and retrieving it.

Also, I wanted to view the task output & logs from within the Azure Portal, which is made possible due to the convention.

The result of the computation is uploaded by the task itself, like so:

As you might have noticed, the creation of the `TaskOutputStorage()` type requires a SAS URL, that I am creating on the Cluster Head and then pass down by an environment variable.

Then, the method `UploadTaskOutput()` is finally calling `TaskOutputStorage.SaveAsync()` which allows classifying the kind of output we are saving.

Depending on the value of `TaskOutputKind` the data ends up in one of the following paths.

### Monitoring and waiting for the tasks to finish

This can be done with the help of a `TaskStateMonitor`.

``````var batchClient = BatchClient.Create(...);

It provides two methods. First, there is `WaitAll()` which is blocking, and second the asynchronous counterpart `WhenAll()`.

These methods take our tasks as input and further let us configure the desired state we'd like to wait for. Also, we can optionally define the polling delay and maximum time to wait.

### Retrieving logs & results from tasks

Last but not least, we are going to read back the files by leveraging again a `TaskOutputStorage`.

## The mighty results

Here are some test runs I did. The results show the huge amount of iterations required to reach a decent precision.

``````IterationsTotal: 1'000'000
UnitsToGenerate: 4
Nodes: 4
vCPUs: 4
Total calculation time: 00h:00m:00s:055ms
Total run time:         00h:00m:11s:333ms
Estimated PI:           3.1420560
Math.Pi:                3.141592653589793``````

With `1'000'000` random points, we reached a precision of 2 decimal places.

``````IterationsTotal: 34'359'738'360
UnitsToGenerate: 4
Nodes: 4
vCPUs: 4
Total calculation time: 00h:33m:54s:936ms
Total run time:         00h:09m:48s:308ms
Estimated PI:           3.1415838323630354908208911052
Math.Pi:                3.141592653589793``````

With `34'359'738'360` random points, we reached a precision of 4 decimal places.

``````IterationsTotal: 137'438'953'440
UnitsToGenerate: 16
Nodes: 6
vCPUs: 6
Total calculation time: 02h:13m:13s:804ms
Total run time:         00h:31m:46s:174ms
Estimated PI:           3.1415946968520513495551541796
Math.Pi:                3.141592653589793``````

With `137'438'953'440` random points, we reached a precision of 5 decimal places.

``````IterationsTotal: 549'755'813'760
UnitsToGenerate: 16
Nodes: 2
vCPUs: 4
Total calculation time: 08h:59m:13s:381ms
Total run time:         02h:21m:47s:118ms
Estimated PI:           3.1415927164746315024035590472
Math.Pi:                3.141592653589793``````

With `549'755'813'760` random points, we reached a precision of 6 decimal places.

## Further thoughts & questions

Here are some unsorted questions, ideas & thoughts that came up during the work on this article. They might be incorporated in further posts.

### Lessons learned & thoughts

• The cluster head code currently doesn't allow re-attaching to the service and needs to be running during computation. However, this could easily be fixed by some slight modifications.
• Create an Alert on the Batch Service, to get notified when a (long-running) job has been completed.
• When using low-priority nodes, it's better to choose virtual machine sizes with fewer cores, as this will leave you with the most available vCPUs in the pool, in case a pre-empting is happening.
• When dealing with huge numbers, it's even more important to unit-test edge cases.
• Mocking Azure Batch type can be painful (needs a lot of wrapping), e.g. CloudJob, CloudPool, BatchClient, ...
• To progressively get premature results, asynchronous messaging (Service Bus, Storage Queues) and Azure Durable functions could be leverages.

### Questions

• How can the experiment prematurely be stopped in case a required precision has been reached?  When will the algorithm converge?
• How to choose the right node size (performance vs. cost)?
• How can we auto-resize a static pool, in case a low-priority node gets preempted?
• Coming up with an efficient thread & task (TPL) design on the worker node can be difficult. How much work should be left to the Azure Batch Scheduler?
• Could this experiment be run more efficiently on Azure Durable Functions and a fan-out/fan-in pattern?

## Conclusion

Wow, that was an exciting journey 😄. Writing this article took me (a lot) longer than I originally envisioned, it's easy to get carried away by details 🤪. However, the learnings I took were worth it.