Saturday, July 29, 2017

Normal Distribution


Let's say we  want to model a collected data with a continuous function, by using only few parameters such as mean and standard deviation. From the continuous function we can then speculate about data distribution and probabilities certain data values have. This is where probability density functions can be very useful.

One of the most common distributions is Normal (or Gaussian) PDF. A lot of natural phenomena which observed randomly follows this distribution when number of observations is large.

I am going to use a Python programming language to build a normal distribution.

For my model I decided to obtain CPU utilisation data. I am running a Windows on my machine. "psutil" library can be used to measure CPU utilisation at a given time. As shown in the code below I perform 50 measurements in total, with 5s time delta between the measurements.

import matplotlib.pyplot as plt
import psutil
import time

raw_data = []


for i in range(50):

    util = psutil.cpu_percent()
    time.sleep(5)
    raw_data.append(util)

plt.ylabel('CPU Utilisation, %')

plt.xlabel('Measurement sequence')
plt.plot(raw_data)

The resulted CPU utilisation data is plotted on the chart:

















Now I am going to calculate and plot a Normal Distribution using built in python functions. First, i need to calculate mean and standard deviation values:

import math
from scipy.stats import norm
   
mean, std = norm.fit(raw_data)
print "mean", mean
print "std", std

Result:
mean 10.69
std 3.83474901395

Second, i need to define the range of my random variables. Each random variable represents a certain CPU utilisation value. I assume that all random variables are within four standard deviations range from the mean.

range_value = int(mean + 4*std)

X = [i for i in range(range_value)] 

And finally pass the mean, std and x to the pdf function and display the chart.

p = norm.pdf(X, mean, std)
plt.ylabel('PDF(X)')

plt.xlabel('X')
plt.plot(X,p)

















The mean, or the expected value of the variable (10.69), is the centroid of the pdf.

Python offers very efficient and easy to use implementation of normal distribution. However I decided to implement it myself in order to understand how it works in detail.


First, I calculated mean and standard deviation.

Mean can be calculated as a sum of all CPU utilisation values divided by the number of items.


sum = raw_data.sum()
number_of_el = raw_data.shape[0]
mean = sum/number_of_el
print "mean_value", mean

Result:
mean_value 10.69


A standard deviation quantifies variation of data around the mean.


It can be calculated as a square root of a sum of  of squared differences between mean and a cpu utilisation value divided by the number of items.

std = math.sqrt(np.asarray([(x-mean)**2 for x in raw_data]).sum()/number_of_el)

print "std", std

Result:
std 3.83474901395


Normal PDF is defined by the following formula:

And python implementation:



range_value = int(mean + 4*std)
for x in range(range_value):
    f_x = (1/(std*math.sqrt(2*math.pi)))*math.exp((-(x-mean)**2)/(2*std**2))
    p.append(f_x)
plt.ylabel('PDF(X)')
plt.xlabel('X')
plt.plot(p)


The obtained pdf function looks exactly the same as in the previous example with built-in
 Python implementation:






No comments:

Post a Comment