Thursday, August 31, 2023

Normalisation of data

Thursday, August 31, 2023 0 Comments

 Normalisation is a data preprocessing technique used in statistics, to scale and transform numerical data of different units and measurement scales to a common measurement scale. This helps in improving the performance and convergence of certain algorithms that are sensitive to the scale of measurement of input variables. 

Here are some common methods:


1. Ratio of Mean: Normalisation by the ratio of mean is a technique used to scale data by dividing each data point by the mean of the dataset. This method ensures that the mean of the normalised data becomes 1. It's a simple approach and can be useful in situations where you want to emphasise the relative position of each data point with respect to the mean. 


Formula


Example:

Suppose you have a dataset representing the monthly sales figures (in thousands of dollars) for a retail store over a year:

X=[15,20,25,18,30,22,28,16,32,26,19,23]

Calculate the mean of the original dataset:

Normalise each data point by dividing it by the mean:

Calculated values for normalised data:


2. Min-Max Scaling (Normalisation): This method scales the data to a specific range, typically 
between 0 and 1.

Formula:


  • Example: Let's say you have a dataset of house prices ($50,000 - $1,000,000) and you want to 

    normalise them to a range between 0 and 1. If a house costs $300,000, after normalisation:


  • 3. Z-Score (Standardisation): Z-Score normalisation transforms the data to have a mean of 0 and a 
    standard deviation of 1.

    Formula:


  • Example: Suppose you have a dataset of exam scores with a mean of 75 and a standard deviation 

    of 10. If a student's score is 85, after standardisation:


  • 4. Robust Scaling: This method is similar to Min-Max scaling but is more suitable for data with 
    outliers. It scales the data to a specified range, often between -1 and 1, while considering the median
    and interquartile range.

    Formula:


  • Example: Consider a dataset of employee salaries with a median salary of $60,000 and an 

    interquartile range (IQR) of $20,000. If an employee earns $80,000, after robust scaling:


  • 5. Unit Vector Scaling (Normalisation): This method scales each data point to have a Euclidean 
    norm (magnitude) of 1. It is commonly used in text classification and clustering.

    Formula:


  • Example: If you have a data vector [3, 4] and you want to normalise it to a unit vector, you 

    calculate the magnitude as-

  • And then the unit vector is-

Techniques of Composite Index: Some Examples

Thursday, August 31, 2023 0 Comments

Here are a few techniques of combining multiple variables to get a composite index:

1. Simple average: The Simple average method involves taking the average of the values of each variable in the dataset. This approach assumes that all variables are equally important and treat them equally. The resulting composite index is a simple arithmetic mean of the variables. But before making average normalisation is necessary.

For example, if we are creating a composite index of social development, we might include variables such as education level, health status, and employment rate. We would first normalise the variables and then calculate the average value of each normalised variable across the relevant geographic region (such as a country or city). Then, we would calculate the average of these average values to obtain a composite index.

Suppose we have data for three variables: average income, average education level, and average life expectancy, for three cities: City A, City B, and City C.

City Average Income Average Education Level Average Life Expectancy

  • A $50,000 12 years 75 years
  • B $60,000 14 years 80 years
  • C $70,000 16 years 85 years

Normalisation (Ratio of Mean)

City Average Income Average Education Level Average Life Expectancy

  • A 0.833 0.857 0.935
  • B 1.000 1.000 1.000
  • C 1.167 1.143            1.063

Then, we would calculate the average of these average values to obtain the composite index for each region:

  • Composite Index for A= (0.833+ .857 + .935) / 3 = 0.875

The resulting composite index of A (0.875) represents the average value of the three variables across the three cities. This approach is simple and easy to interpret, but it assumes that all variables are equally important and may not accurately reflect the relative importance of the variables or the underlying structure of the data.


2. Weighted average: The Weighted average method involves assigning weights to each variable based on 

its relative importance. The weights can be determined using expert opinion, statistical analysis, or a 

combination of both.

For example, if we are creating a composite index of economic development, we might assign a higher weight to GDP per capita than to unemployment rate, reflecting the fact that GDP per capita is a more important indicator of economic development. The resulting composite index is a weighted average of the variables, with the weights reflecting their relative importance.

For example, let's say we are creating a composite index of the quality of life in different cities, and we have data on four variables: crime rate, air quality, education, and healthcare. We might decide to assign different weights to each variable based on their perceived importance:

  • Crime rate: 20%
  • Air quality: 30%
  • Education: 25%
  • Healthcare: 25%

To calculate the composite index using the Weighted average method, we would first multiply each variable by its weight, sum the products, and then divide by the number of variables:

  • Composite Index = (0.2 x Crime Rate) + (0.3 x Air Quality) + (0.25 x Education) + (0.25 x Healthcare)/3

For example, let's say we have the following data for three cities:

City Crime Rate (%) Air Quality Index (%) Education Level (%) Healthcare Index (%)

A         10          80       12          0.8

B         15          70       16           0.9

C         5            90       14           0.7

To calculate the composite index using the weights above, we would use the formula:

Composite Index = (0.2 x Crime Rate) + (0.3 x Air Quality) + (0.25 x Education) + (0.25 x Healthcare)

  • City A: (0.2 x 10) + (0.3 x 80) + (0.25 x 12) + (0.25 x 0.8) = 23.4/3= 7.8
  • City B: (0.2 x 15) + (0.3 x 70) + (0.25 x 16) + (0.25 x 0.9) = 22.95/3= 7.65
  • City C: (0.2 x 5) + (0.3 x 90) + (0.25 x 14) + (0.25 x 0.7) = 25.3/3= 8.43

The resulting composite index values are 7.8 for City A, 7.65 for City B, and 8.43 for City C. These values reflect the relative importance of each variable in the composite index, as determined by the assigned weights.

The Weighted average method allows for a more sophisticated approach to creating composite indices, taking into account the relative importance of each variable. However, the choice of weights can be subjective, and different weightings can lead to different results. Therefore, it is important to carefully consider the weighting scheme used and to validate the results.


3. Principal component analysis: PCA is a statistical technique that involves reducing the dimensionality of a dataset by identifying patterns in the data and creating new variables, called principal components, that capture the most important information in the data. These principal components are linear combinations of the original variables, and are chosen to maximise the amount of variation in the data that they explain.

To use PCA for creating a composite index, we would first standardise the data by subtracting the mean and dividing by the standard deviation. This is necessary to ensure that all variables are on the same scale and have equal weighting in the analysis. We would then perform PCA on the standardised data to identify the principal components.

The first principal component will explain the most variance in the data, followed by the second, third, and so on. We can choose to retain only the first few principal components, depending on how much variance we want to capture. These principal components can then be used as the basis for the composite index.

To calculate the composite index using PCA, we would first calculate the scores for each observation on each principal component. These scores represent the contribution of each observation to each principal component. We would then weight each principal component by its variance, reflecting the amount of variation in the data that it explains, and sum the weighted scores to obtain the composite index.

For example, let's say we are creating a composite index of economic development in different countries, and we have data on four variables: GDP per capita, unemployment rate, inflation rate, and trade openness. We might perform PCA on the standardised data to identify the principal components:

Variable         GDP per capita Unemployment rate Inflation rate Trade openness

Mean                        5000    7.5       2.5     0.4

Standard Deviation 1000     1.5       1.0      0.1

The first principal component might be a linear combination of all four variables, with weights of 0.5 for GDP per capita, -0.3 for unemployment rate, 0.4 for inflation rate, and 0.7 for trade openness, reflecting the fact that trade openness is the most important variable for economic development. The second principal component might be a linear combination of just GDP per capita and unemployment rate, reflecting the fact that these variables are highly correlated.

To calculate the composite index using the first two principal components, we would first calculate the scores for each country on each principal component:

Country GDP per capita Unemployment rate Inflation rate Trade openness PC1 Score PC2 Score

A             6000                         6.0                         2.0                         0.5                         1.2             0.8

B             4000                         8.0                         3.0                         0.3                        -1.1           -0.8

C             5500                         7.5                         2.5                         0.6                         1.0             0.3

We would then weight each principal component by its variance, which can be obtained from the PCA analysis:

Principal Component Variance

    PC1                             2.5

    PC2                             0.5

Finally, we would sum the weighted scores to obtain the composite index:

  • Composite Index = (0.8 * PC1 Score) + (0.2 * PC2 Score)

For example, for country A, the composite index would be:

  • Composite Index for Country A = (0.8 * 1.2) + (0.2 * 0.8) = 1.04

This composite index reflects both the country's level of economic development captured by the first principal component and its level of GDP and unemployment captured by the second principal component.

Overall, the PCA method is useful when we have many variables that are highly correlated, making it difficult to disentangle their individual effects on the composite index. PCA allows us to identify the underlying structure of the data and capture the most important information in a small number of principal components. However, interpreting the principal components can be difficult, and the weights assigned to each variable in the composite index are not always intuitive.


4. Factor analysis: Factor analysis is another statistical technique that can be used to create a composite index. It is similar to principal component analysis (PCA) in that it involves identifying underlying factors or dimensions in a set of variables. However, unlike PCA, factor analysis focuses on the common variance among variables, rather than the variance explained by each principal component.

To illustrate how factor analysis can be used to create a composite index, consider the following example. Suppose we want to create a composite index of financial stability for a group of banks, and we have data on several financial ratios, including capital adequacy ratio, liquidity ratio, asset quality ratio, and profitability ratio. We believe that these ratios are related to a latent construct of financial stability, but we do not know the exact relationship among them.

To identify the underlying factors or dimensions, we would first conduct a factor analysis on the financial ratios. The factor analysis would identify the minimum number of factors needed to explain the common variance among the variables.

For example, it might identify two factors: one related to capital and asset quality, and another related to liquidity and profitability.

Next, we would estimate factor scores for each bank on each factor. Factor scores represent the degree to which a bank exhibits the characteristics of each factor. We can use these factor scores as weights to create the composite index of financial stability. For example, we might use the following formula:

  • Composite Index = 0.6 * Factor 1 Score + 0.4 * Factor 2 Score

where 0.6 and 0.4 are the weights assigned to each factor based on their importance in explaining financial stability.

The resulting composite index reflects the degree to which each bank exhibits the characteristics of each factor, with higher scores indicating greater financial stability.

One advantage of the factor analysis method is that it allows us to identify the underlying structure of the data and capture the most important information in a small number of factors. It also allows us to estimate factor scores for each observation, which can be used as weights in creating the composite index. However, the method assumes that the variables are related to the underlying factors in a linear and additive way, which may not always be the case. It may also be sensitive to the choice of rotation method used to identify the factors. Therefore, it is important to carefully evaluate the assumptions and limitations of the method when using it to create a composite index.


5. Cluster Analysis: Cluster analysis is a statistical technique that can be used to create a composite index by grouping observations (e.g. countries, companies, individuals) into clusters based on their similarity across multiple variables. The resulting clusters can then be used to create a composite index that reflects the characteristics of each group.

To illustrate how cluster analysis can be used to create a composite index, consider the following example. Suppose we want to create a composite index of social development for a group of countries, and we have data on several social indicators, including life expectancy, literacy rate, poverty rate, and access to basic services. We believe that these indicators are related to a latent construct of social development, but we do not know the exact relationship among them.

To identify the underlying clusters of countries, we would first conduct a cluster analysis on the social indicators. The cluster analysis would group countries based on their similarity across the indicators, with countries in the same cluster having similar characteristics. For example, it might identify two clusters: one representing more developed countries with higher life expectancy, literacy rate, and access to basic services, and another representing less developed countries with higher poverty rates and lower levels of social indicators.

Next, we would assign weights to each indicator based on their importance in defining each cluster. For example, we might assign higher weights to indicators that are more strongly associated with social development in each cluster. We can then use these weights to create the composite index of social development for each cluster. For example, we might use the following formula:

  • Composite Index for Cluster 1 = 0.4 * Life Expectancy + 0.4 * Literacy Rate + 0.2 * Access to Basic Services
  • Composite Index for Cluster 2 = 0.6 * Poverty Rate + 0.4 * Access to Basic Services

where 0.4, 0.4, and 0.2 are the weights assigned to each indicator based on their importance in defining each cluster.

The resulting composite index reflects the characteristics of each cluster and provides a more nuanced picture of social development than a single index for all countries combined.

One advantage of the cluster analysis method is that it allows us to identify groups of observations with similar characteristics and create composite indices tailored to each group. It also allows us to take into account the relative importance of each variable in defining each group. However, the method is sensitive to the choice of distance metric and clustering algorithm used, which may affect the resulting clusters and composite indices. Therefore, it is important to carefully evaluate the assumptions and limitations of the method when using it to create a composite index.


6. Regression analysis: Regression analysis is another technique that can be used to create a composite index. It involves estimating a linear regression model that predicts an outcome variable (e.g., economic development) from a set of predictor variables (e.g., GDP per capita, education level, healthcare expenditure). The coefficients of the regression model are then used as weights to combine the predictor variables into a composite index.

The regression model can be estimated using a variety of techniques, such as ordinary least squares regression or ridge regression. The choice of technique will depend on the characteristics of the data and the goals of the analysis.

To illustrate how regression analysis can be used to create a composite index, consider the following example. We want to create a composite index of social welfare for different states in a country, and we have data on four variables: poverty rate, unemployment rate, education level, and healthcare expenditure. We would first estimate a linear regression model that predicts social welfare from these variables:

  • Social Welfare = b0 + b1 * Poverty Rate + b2 * Unemployment Rate + b3 * Education Level + b4 * Healthcare Expenditure + e

where b0 is the intercept, b1-b4 are the regression coefficients for each predictor variable, and e is the error term.

The regression coefficients indicate the strength and direction of the relationship between each predictor variable and social welfare, holding all other variables constant. For example, a positive coefficient for education level would indicate that states with higher education levels tend to have higher social welfare levels, all else being equal.

To use the regression coefficients as weights for the composite index, we would first normalise them to ensure that they sum to 1. We can do this by dividing each coefficient by the sum of all coefficients:

  • Weighted Index = (b1/sum(b1-b4)) * Poverty Rate + (b2/sum(b1-b4)) * Unemployment Rate + (b3/sum(b1-b4)) * Education Level + (b4/sum(b1-b4)) * Healthcare Expenditure

The resulting composite index reflects the relative importance of each predictor variable in predicting social welfare, as determined by the regression model.

One advantage of the regression analysis method is that it allows us to estimate the contribution of each predictor variable to the composite index, and to control for the effects of other variables in the analysis. However, the method assumes that the relationship between the predictor variables and the outcome variable is linear and that there are no interactions among the variables. It may also be sensitive to outliers or influential observations in the data. Therefore, it is important to carefully evaluate the assumptions and limitations of the method when using it to create a composite index.