Thursday, August 31, 2023

Normalisation of data

 Normalisation is a data preprocessing technique used in statistics, to scale and transform numerical data of different units and measurement scales to a common measurement scale. This helps in improving the performance and convergence of certain algorithms that are sensitive to the scale of measurement of input variables. 

Here are some common methods:


1. Ratio of Mean: Normalisation by the ratio of mean is a technique used to scale data by dividing each data point by the mean of the dataset. This method ensures that the mean of the normalised data becomes 1. It's a simple approach and can be useful in situations where you want to emphasise the relative position of each data point with respect to the mean. 


Formula


Example:

Suppose you have a dataset representing the monthly sales figures (in thousands of dollars) for a retail store over a year:

X=[15,20,25,18,30,22,28,16,32,26,19,23]

Calculate the mean of the original dataset:

Normalise each data point by dividing it by the mean:

Calculated values for normalised data:


2. Min-Max Scaling (Normalisation): This method scales the data to a specific range, typically 
between 0 and 1.

Formula:


  • Example: Let's say you have a dataset of house prices ($50,000 - $1,000,000) and you want to 

    normalise them to a range between 0 and 1. If a house costs $300,000, after normalisation:


  • 3. Z-Score (Standardisation): Z-Score normalisation transforms the data to have a mean of 0 and a 
    standard deviation of 1.

    Formula:


  • Example: Suppose you have a dataset of exam scores with a mean of 75 and a standard deviation 

    of 10. If a student's score is 85, after standardisation:


  • 4. Robust Scaling: This method is similar to Min-Max scaling but is more suitable for data with 
    outliers. It scales the data to a specified range, often between -1 and 1, while considering the median
    and interquartile range.

    Formula:


  • Example: Consider a dataset of employee salaries with a median salary of $60,000 and an 

    interquartile range (IQR) of $20,000. If an employee earns $80,000, after robust scaling:


  • 5. Unit Vector Scaling (Normalisation): This method scales each data point to have a Euclidean 
    norm (magnitude) of 1. It is commonly used in text classification and clustering.

    Formula:


  • Example: If you have a data vector [3, 4] and you want to normalise it to a unit vector, you 

    calculate the magnitude as-

  • And then the unit vector is-

No comments:

Post a Comment