Programming and Technology: July 2017

Sunday, July 30, 2017

Why Python Numpy is fast

The dot product of two vectors a and be is the sum of the products of its coordinates. It can also be written as:

where b in power T is the transpose of the vector b.

I am going to use python Numpy library to create two vectors a and b.

a = np.random.randint(10, size=10000000)
b = np.random.randint(10, size=10000000)

Each vector has a size of 10,000,000 elements and each element is an integer between zero and 10.

I am going to perform a dot product operation on these two vectors. I'll repeat each operation 10 times and then calculate the average execution time. First, i am going to use a for loop to iterate through the arrays.

import numpy as np
import time

def Dot_no_numpy(a,b):
'''
performs dot product with for loop
accepts two vectors on the input and returns the dot product
'''
sum = 0;
for i in range(len(a)):
sum = sum + a[i]*b[i]
return sum

time_measurements = np.array([]) for i in range (10): time_begin = time.time() Dot_no_numpy(a,b) time_end = time.time() delta_time = time_end - time_begin time_measurements = np.append(time_measurements, delta_time) mean = np.mean(time_measurements) print "average time", mean

Result:
average time 5.15

Now I am going to use Numpy dot function to perform a dot product and check if the result is different this time.

def Dot_with_numpy(a,b):
'''
performs dot product with numpy dot function
accepts two vectors on the input and returs the dot product
'''

return np.dot(a,b)

time_measurements = np.array([]) for i in range (10): time_begin = time.time() Dot_with_numpy(a,b) time_end = time.time() delta_time = time_end - time_begin time_measurements = np.append(time_measurements, delta_time) mean = np.mean(time_measurements) print "average time", mean

Result:
average time 0.00909998416901

By comparing average times we can see that if Numpy is used to perform the dot product, the time it takes is almost 560 times less compared to the case without Numpy. This is a huge difference.

Why the solution utilising for loop is so slow? Original python interpreter (CPython implemented in C language) was written with a GIL (Global Interpreter Lock) which prevents executing multiple threads in parallel. In this situation a systems performs as a SISD (Single Instruction Single Data) or a uniprocessor machine. It executes a single instruction at a time.

Why Numpy is different? Numpy utilises BLAS (Basic Linear Algebra Subprograms) libraries for its linear algebra operations. BLAS in turn makes use of SIMD (Single Instruction Multiple Data) instructions which allows to fully utilise multicore machine. SIMD instructions can greatly increase performance if the same operation is performed on multiple data objects. Intel has introduced a SIMD instruction set support in his Pentium III processors (in year 1999).

Saturday, July 29, 2017

Normal Distribution

Let's say we want to model a collected data with a continuous function, by using only few parameters such as mean and standard deviation. From the continuous function we can then speculate about data distribution and probabilities certain data values have. This is where probability density functions can be very useful.

One of the most common distributions is Normal (or Gaussian) PDF. A lot of natural phenomena which observed randomly follows this distribution when number of observations is large.

I am going to use a Python programming language to build a normal distribution.

For my model I decided to obtain CPU utilisation data. I am running a Windows on my machine. "psutil" library can be used to measure CPU utilisation at a given time. As shown in the code below I perform 50 measurements in total, with 5s time delta between the measurements.

import matplotlib.pyplot as plt
import psutil
import time

raw_data = []

for i in range(50):
util = psutil.cpu_percent()
time.sleep(5)
raw_data.append(util)

plt.ylabel('CPU Utilisation, %')
plt.xlabel('Measurement sequence')
plt.plot(raw_data)

The resulted CPU utilisation data is plotted on the chart:

Now I am going to calculate and plot a Normal Distribution using built in python functions. First, i need to calculate mean and standard deviation values:

import math
from scipy.stats import norm

mean, std = norm.fit(raw_data)
print "mean", mean
print "std", std

Result:

mean 10.69
std 3.83474901395

Second, i need to define the range of my random variables. Each random variable represents a certain CPU utilisation value. I assume that all random variables are within four standard deviations range from the mean.

range_value = int(mean + 4*std)
X = [i for i in range(range_value)]

And finally pass the mean, std and x to the pdf function and display the chart.

p = norm.pdf(X, mean, std)
plt.ylabel('PDF(X)')

plt.xlabel('X')
plt.plot(X,p)

The mean, or the expected value of the variable (10.69), is the centroid of the pdf.

Python offers very efficient and easy to use implementation of normal distribution. However I decided to implement it myself in order to understand how it works in detail.

First, I calculated mean and standard deviation.
Mean can be calculated as a sum of all CPU utilisation values divided by the number of items.

sum = raw_data.sum()
number_of_el = raw_data.shape[0]
mean = sum/number_of_el
print "mean_value", mean

Result:

mean_value 10.69

A standard deviation quantifies variation of data around the mean.

It can be calculated as a square root of a sum of of squared differences between mean and a cpu utilisation value divided by the number of items.

std = math.sqrt(np.asarray([(x-mean)**2 for x in raw_data]).sum()/number_of_el)

print "std", std

Result:

std 3.83474901395

Normal PDF is defined by the following formula:

And python implementation:

range_value = int(mean + 4*std)
for x in range(range_value):
    f_x = (1/(std*math.sqrt(2*math.pi)))*math.exp((-(x-mean)**2)/(2*std**2))
    p.append(f_x)
plt.ylabel('PDF(X)')
plt.xlabel('X')
plt.plot(p)

The obtained pdf function looks exactly the same as in the previous example with built-in

 Python implementation:

Programming and Technology

Sunday, July 30, 2017

Why Python Numpy is fast

Saturday, July 29, 2017

Normal Distribution

Report Abuse

Labels