Thursday, August 31, 2023

Normalisation of data

Normalisation is a data preprocessing technique used in statistics, to scale and transform numerical data of different units and measurement scales to a common measurement scale. This helps in improving the performance and convergence of certain algorithms that are sensitive to the scale of measurement of input variables.

Here are some common methods:

1. Ratio of Mean: Normalisation by the ratio of mean is a technique used to scale data by dividing each data point by the mean of the dataset. This method ensures that the mean of the normalised data becomes 1. It's a simple approach and can be useful in situations where you want to emphasise the relative position of each data point with respect to the mean.

Formula:

Example:

Suppose you have a dataset representing the monthly sales figures (in thousands of dollars) for a retail store over a year:

X=[15,20,25,18,30,22,28,16,32,26,19,23]

Calculate the mean of the original dataset:

Normalise each data point by dividing it by the mean:

Calculated values for normalised data:

2. Min-Max Scaling (Normalisation): This method scales the data to a specific range, typically

between 0 and 1.

Formula:

Example: Let's say you have a dataset of house prices ($50,000 - $1,000,000) and you want to
normalise them to a range between 0 and 1. If a house costs $300,000, after normalisation:
3. Z-Score (Standardisation): Z-Score normalisation transforms the data to have a mean of 0 and a
standard deviation of 1.

Formula:
Example: Suppose you have a dataset of exam scores with a mean of 75 and a standard deviation
of 10. If a student's score is 85, after standardisation:
4. Robust Scaling: This method is similar to Min-Max scaling but is more suitable for data with
outliers. It scales the data to a specified range, often between -1 and 1, while considering the median
and interquartile range.

Formula:
Example: Consider a dataset of employee salaries with a median salary of $60,000 and an
interquartile range (IQR) of $20,000. If an employee earns $80,000, after robust scaling:
5. Unit Vector Scaling (Normalisation): This method scales each data point to have a Euclidean
norm (magnitude) of 1. It is commonly used in text classification and clustering.

Formula:
Example: If you have a data vector [3, 4] and you want to normalise it to a unit vector, you
calculate the magnitude as-
And then the unit vector is-