Wednesday, August 2, 2017

Monte Carlo Simulation


Stanislaw Ulam (mathematician) was recovering from illness and while playing Solitaire card game multiple times he was trying to estimate a probability of winning. Because winning the game was such a rare event, he needed to play thousands of games which wasn't practical. He started wondering if it was possible to simulate a lot of games on the computer and then simply observe the number of successful outcomes. Monte Carlo computation method was born.

I want to try using Monte Carlo method to calculate the probability of having the same heads after rolling two dice.

First I going to define a Class with all possible variable and methods:

import random

class Dice():
    def __init__(self):
        self.sides = 6
        self.top_side = None
    def roll(self):
        self.top_side = random.randrange(0,self.sides)
    def __repr__(self):
        return 'Dice roll'

I am going to distinguish between two dice by calling them first_dice and second_dice:

first_dice = Dice()
second_dice = Dice()

The Rolls function count the number of equal heads

def Rolls(number):
number_of_equal_sides = 0
for i in range(number):
first_dice.roll()
second_dice.roll()
if (first_dice.top_side == second_dice.top_side):
number_of_equal_sides = number_of_equal_sides + 1
return number_of_equal_sides

I will roll two dices number of times and for each case calculate probability by counting number
of cases where heads are the same and dividing it by the total number of rolls.

number_of_rolls = [100, 1000, 10000, 100000, 1000000]
for number in number_of_rolls:
print "Number of rolls:", number,",","Probability of two sides are the same:", Rolls(number)/float(number)

The larger the number of experiments, the more accurate is probability estimation:

Number of rolls: 100 , Probability of two sides are the same: 0.17
Number of rolls: 1000 , Probability of two sides are the same: 0.147
Number of rolls: 10000 , Probability of two sides are the same: 0.1704
Number of rolls: 100000 , Probability of two sides are the same: 0.1676
Number of rolls: 1000000 , Probability of two sides are the same: 0.167328


Can we prove the answer is correct analytically? If we roll two dice there can only be 6*6 = 36 possible outcomes. Out of 36 outcomes there can only be 6 cases there heads are the same. Therefore 6/36 = 0.16666 is the probability of two dice having the same heads.

Sunday, July 30, 2017

Why Python Numpy is fast


The dot product of two vectors a and be is the sum of the products of its coordinates. It can also be written as:






where b in power T is the transpose of the vector b.

I am going to use python Numpy library to create two vectors a and b.

a = np.random.randint(10, size=10000000)
b = np.random.randint(10, size=10000000)

Each vector has a size of 10,000,000 elements and each element is an integer between zero and 10.

I am going to perform a dot product operation on these two vectors. I'll repeat each operation 10 times and then calculate the average execution time. First, i am going to use a for loop to iterate through the arrays.

import numpy as np
import time

def Dot_no_numpy(a,b):
'''
performs dot product with for loop
accepts two vectors on the input and returns the dot product
'''
sum = 0;
for i in range(len(a)):
sum = sum + a[i]*b[i]
return sum


time_measurements = np.array([]) for i in range (10): time_begin = time.time() Dot_no_numpy(a,b) time_end = time.time() delta_time = time_end - time_begin time_measurements = np.append(time_measurements, delta_time) mean = np.mean(time_measurements) print "average time", mean

Result:
average time 5.15


Now I am going to use Numpy dot function to perform a dot product and check if the result is different this time.

def Dot_with_numpy(a,b):
'''
performs dot product with numpy dot function
accepts two vectors on the input and returs the dot product
'''

return np.dot(a,b)

time_measurements = np.array([]) for i in range (10): time_begin = time.time() Dot_with_numpy(a,b) time_end = time.time() delta_time = time_end - time_begin time_measurements = np.append(time_measurements, delta_time) mean = np.mean(time_measurements) print "average time", mean

Result:
average time 0.00909998416901

By comparing average times we can see that if Numpy is used to perform the dot product, the time it takes is almost 560 times less compared to the case without Numpy. This is a huge difference.

Why the solution utilising for loop is so slow? Original python interpreter (CPython implemented in C language) was written with a GIL (Global Interpreter Lock) which prevents executing multiple threads in parallel. In this situation a systems performs as a SISD (Single Instruction Single Data) or a uniprocessor machine. It executes a single instruction at a time.

Why Numpy is different? Numpy utilises BLAS (Basic Linear Algebra Subprograms) libraries for its linear algebra operations. BLAS in turn makes use of SIMD (Single Instruction Multiple Data) instructions which allows to fully utilise multicore machine. SIMD instructions can greatly increase performance if the same operation is performed on multiple data objects. Intel has introduced a SIMD instruction set support in his Pentium III processors (in year 1999).



Saturday, July 29, 2017

Normal Distribution


Let's say we  want to model a collected data with a continuous function, by using only few parameters such as mean and standard deviation. From the continuous function we can then speculate about data distribution and probabilities certain data values have. This is where probability density functions can be very useful.

One of the most common distributions is Normal (or Gaussian) PDF. A lot of natural phenomena which observed randomly follows this distribution when number of observations is large.

I am going to use a Python programming language to build a normal distribution.

For my model I decided to obtain CPU utilisation data. I am running a Windows on my machine. "psutil" library can be used to measure CPU utilisation at a given time. As shown in the code below I perform 50 measurements in total, with 5s time delta between the measurements.

import matplotlib.pyplot as plt
import psutil
import time

raw_data = []


for i in range(50):

    util = psutil.cpu_percent()
    time.sleep(5)
    raw_data.append(util)

plt.ylabel('CPU Utilisation, %')

plt.xlabel('Measurement sequence')
plt.plot(raw_data)

The resulted CPU utilisation data is plotted on the chart:

















Now I am going to calculate and plot a Normal Distribution using built in python functions. First, i need to calculate mean and standard deviation values:

import math
from scipy.stats import norm
   
mean, std = norm.fit(raw_data)
print "mean", mean
print "std", std

Result:
mean 10.69
std 3.83474901395

Second, i need to define the range of my random variables. Each random variable represents a certain CPU utilisation value. I assume that all random variables are within four standard deviations range from the mean.

range_value = int(mean + 4*std)

X = [i for i in range(range_value)] 

And finally pass the mean, std and x to the pdf function and display the chart.

p = norm.pdf(X, mean, std)
plt.ylabel('PDF(X)')

plt.xlabel('X')
plt.plot(X,p)

















The mean, or the expected value of the variable (10.69), is the centroid of the pdf.

Python offers very efficient and easy to use implementation of normal distribution. However I decided to implement it myself in order to understand how it works in detail.


First, I calculated mean and standard deviation.

Mean can be calculated as a sum of all CPU utilisation values divided by the number of items.


sum = raw_data.sum()
number_of_el = raw_data.shape[0]
mean = sum/number_of_el
print "mean_value", mean

Result:
mean_value 10.69


A standard deviation quantifies variation of data around the mean.


It can be calculated as a square root of a sum of  of squared differences between mean and a cpu utilisation value divided by the number of items.

std = math.sqrt(np.asarray([(x-mean)**2 for x in raw_data]).sum()/number_of_el)

print "std", std

Result:
std 3.83474901395


Normal PDF is defined by the following formula:

And python implementation:



range_value = int(mean + 4*std)
for x in range(range_value):
    f_x = (1/(std*math.sqrt(2*math.pi)))*math.exp((-(x-mean)**2)/(2*std**2))
    p.append(f_x)
plt.ylabel('PDF(X)')
plt.xlabel('X')
plt.plot(p)


The obtained pdf function looks exactly the same as in the previous example with built-in
 Python implementation: