PySpark Variance Inflation Factor (VIF) – Understanding of VIF and how it can help you improve your regression models.

Join thousands of students who advanced their careers with MachineLearningPlus. Go from Beginner to Data Science Expert through a structured road map of 70+ courses in 9 core specializations. Build industry grade Data Science projects.

VIF concept is critical for understanding multicollinearity in regression models, let’s break down the concept into simple terms, explain how to calculate VIF, and discuss its practical uses

What is Variance Inflation Factor (VIF)?

VIF is a measure that helps us understand the extent of multicollinearity in a multiple regression model. Multicollinearity occurs when two or more independent variables in the model are highly correlated with each other. This correlation can cause problems in interpreting the model, as it becomes difficult to determine the individual impact of each variable on the dependent variable.

VIF quantifies the severity of multicollinearity by measuring how much the variance of a regression coefficient is increased due to the presence of multicollinearity. In simpler terms, VIF tells us how much the uncertainty of a coefficient is inflated because of the correlation between independent variables.

How to Calculate VIF?

Calculating VIF involves the following steps

Run a multiple regression model with all the independent variables.
For each independent variable, run a separate regression model using it as the dependent variable and the remaining independent variables as predictors.
Calculate the R-squared value for each of these separate regression models.
Compute VIF for each independent variable using the following formula:

VIF = 1 / (1 – R-squared)

Let’s take a look at an example to better understand the process

Imagine we have a dataset with three independent variables: X1, X2, and X3. We want to calculate the VIF for each variable.

First, run a multiple regression with X1, X2, and X3 as independent variables and the dependent variable, Y.
Next, run three separate regression models

a. X1 as the dependent variable, with X2 and X3 as predictors.

b. X2 as the dependent variable, with X1 and X3 as predictors.

c. X3 as the dependent variable, with X1 and X2 as predictors.
Calculate the R-squared values for each of these separate regression models.
Compute the VIF for each independent variable using the formula above.

What is the Use of VIF?

The primary use of VIF is to identify and mitigate multicollinearity in regression models. High VIF values indicate that an independent variable is highly correlated with other independent variables in the model.

As a rule of thumb, a VIF value above 10 suggests severe multicollinearity, while a value below 5 is generally considered acceptable.

By calculating VIF, you can

Identify the variables that contribute the most to multicollinearity.
Decide whether to remove, combine, or transform variables to reduce multicollinearity.
Improve the interpretability and reliability of your regression models.

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

import findspark
findspark.init()

from pyspark import SparkFiles
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("VIF Calculation").getOrCreate()

2. Preparing the Sample Data

To demonstrate the method of calculating VIF, we’ll use a sample dataset. First, let’s load the data into a DataFrame

# Sample dataset
url = "https://raw.githubusercontent.com/selva86/datasets/master/Iris.csv"
spark.sparkContext.addFile(url)

df = spark.read.csv(SparkFiles.get("Iris.csv"), header=True, inferSchema=True)
df.show(5)

columns = ["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"]

+---+-------------+------------+-------------+------------+-----------+
| Id|SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm|    Species|
+---+-------------+------------+-------------+------------+-----------+
|  1|          5.1|         3.5|          1.4|         0.2|Iris-setosa|
|  2|          4.9|         3.0|          1.4|         0.2|Iris-setosa|
|  3|          4.7|         3.2|          1.3|         0.2|Iris-setosa|
|  4|          4.6|         3.1|          1.5|         0.2|Iris-setosa|
|  5|          5.0|         3.6|          1.4|         0.2|Iris-setosa|
+---+-------------+------------+-------------+------------+-----------+
only showing top 5 rows

3. VIF Function

Let’s create a PySpark function to calculate VIF for the defined variables and eliminate variables iteratively based on a given VIF threshold.

def calculate_vif(data, features, vif_threshold=5):
    """
    Calculate Variance Inflation Factor (VIF) for the defined variables
    and eliminate variables iteratively based on VIF threshold.

    :param data: A PySpark DataFrame containing the predictor variables
    :param features: A list of column names for the predictor variables
    :param vif_threshold: The VIF threshold for eliminating variables (default is 5)
    :return: 1. A list of remaining features after eliminating variables based on the VIF threshold
             2. A DataFrame of remaining features and their VIF Values
    """
    remaining_features = features[:]
    vif_values = []

    while True:
        vif_values.clear()
        for feature in remaining_features:
            # Assemble the features into a vector
            assembler = VectorAssembler(inputCols=[col for col in remaining_features if col != feature],
                                        outputCol="features")
            feature_data = assembler.transform(data)

            # Fit a linear regression model
            lr = LinearRegression(featuresCol='features', labelCol=feature)
            lr_model = lr.fit(feature_data)

            # Calculate VIF for the feature
            vif = 1 / (1 - lr_model.summary.r2)
            vif_values.append((feature, vif))

        # Find the feature with the highest VIF
        max_vif_feature, max_vif_value = max(vif_values, key=lambda item: item[1])

        # Eliminate the feature if its VIF is above the threshold
        if max_vif_value > vif_threshold:
            remaining_features.remove(max_vif_feature)
        else:
            break
    vif_df = spark.createDataFrame(vif_values, ['Variable', 'VIF'])
    return remaining_features, vif_df

4. How to use the calculate_vif function?

Before executing the calculate_vif function have a detailed understanding on the function parameters and their data types

This function will return two PySpark objects
1. A list of remaining features after eliminating variables based on the VIF threshold
2. A DataFrame of remaining features and their VIF Values

Let’s take a look at below example to better understand the execution process

remaining_features, vif_values = calculate_vif(df, columns, vif_threshold=5)

print("Remaining features after VIF elimination:", remaining_features)
vif_values.show()

Remaining features after VIF elimination: ['SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm']
+-------------+------------------+
|     Variable|               VIF|
+-------------+------------------+
|SepalLengthCm| 3.414225339022868|
| SepalWidthCm|1.2945066660800224|
| PetalWidthCm|3.8646777121823304|
+-------------+------------------+

Conclusion

Variance Inflation Factor (VIF) is a vital tool in detecting multicollinearity in multiple regression models. By understanding VIF and how to calculate it, you can build more accurate and robust regression models, making your data analysis more reliable and insightful

PySpark

PySpark Exercises – 101 PySpark Exercises for Data Analysis

May 19, 2023

PySpark

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

May 08, 2023

PySpark

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

May 07, 2023

PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

May 05, 2023

PySpark Variance Inflation Factor (VIF) – Understanding of VIF and how it can help you improve your regression models.

What is Variance Inflation Factor (VIF)?

How to Calculate VIF?

What is the Use of VIF?

1. Import required libraries and initialize SparkSession

2. Preparing the Sample Data

3. VIF Function

4. How to use the calculate_vif function?

Conclusion

More Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

Similar Articles

Complete Introduction to Linear Regression in R

How to implement common statistical significance tests and find the p value?

Logistic Regression – A Complete Tutorial With Examples in R

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos: