Normalisation of data
Normalisation is a data preprocessing technique used in statistics, to scale and transform numerical data of different units and measurement scales to a common measurement scale. This helps in improving the performance and convergence of certain algorithms that are sensitive to the scale of measurement of input variables.
Here are some common methods:
1. Ratio of Mean: Normalisation by the ratio of mean is a technique used to scale data by dividing each data point by the mean of the dataset. This method ensures that the mean of the normalised data becomes 1. It's a simple approach and can be useful in situations where you want to emphasise the relative position of each data point with respect to the mean.
Formula:
Example:
Suppose you have a dataset representing the monthly sales figures (in thousands of dollars) for a retail store over a year:
X=[15,20,25,18,30,22,28,16,32,26,19,23]
Calculate the mean of the original dataset:
Normalise each data point by dividing it by the mean:
Calculated values for normalised data:
Example: Let's say you have a dataset of house prices ($50,000 - $1,000,000) and you want to
normalise them to a range between 0 and 1. If a house costs $300,000, after normalisation:
- 3. Z-Score (Standardisation): Z-Score normalisation transforms the data to have a mean of 0 and astandard deviation of 1.Formula:
Example: Suppose you have a dataset of exam scores with a mean of 75 and a standard deviation
of 10. If a student's score is 85, after standardisation:
- 4. Robust Scaling: This method is similar to Min-Max scaling but is more suitable for data withoutliers. It scales the data to a specified range, often between -1 and 1, while considering the medianand interquartile range.Formula:
Example: Consider a dataset of employee salaries with a median salary of $60,000 and an
interquartile range (IQR) of $20,000. If an employee earns $80,000, after robust scaling:
- 5. Unit Vector Scaling (Normalisation): This method scales each data point to have a Euclideannorm (magnitude) of 1. It is commonly used in text classification and clustering.Formula:
Example: If you have a data vector [3, 4] and you want to normalise it to a unit vector, you
calculate the magnitude as-
And then the unit vector is-