Linear Regression with Python: A Simple Guide

What is Linear Regression?

Linear regression is a basic but powerful method used in statistics and machine learning to find the relationship between two variables. Imagine you want to understand how the number of hours you study affects your exam score. Linear regression helps you find a straight-line relationship between these two things.

In simple terms, it draws a straight line through your data points that best fits the pattern of the data. This line can then be used to predict or estimate one value based on another.

Why Learn Linear Regression?

It helps you understand how one thing influences another.
It’s a foundation for many more advanced data analysis and machine learning techniques.
It’s simple to implement with Python, a popular programming language.

Key Terms You Should Know

Variable: Something you can measure or change. For example, hours studied or exam scores.
Dependent Variable: The outcome you want to predict (e.g., exam score).
Independent Variable: The input or cause (e.g., hours studied).
Regression Line: The best-fit straight line that shows the relationship.
Slope: How steep the line is; shows how much the dependent variable changes for each unit change in the independent variable.
Intercept: The point where the line crosses the Y-axis, showing the value of the dependent variable when the independent variable is zero.

How Does Linear Regression Work?

Imagine you plot your data points on a graph, with hours studied on the X-axis and exam scores on the Y-axis. Linear regression finds the line that minimizes the total distance (errors) between itself and all these points.

The mathematical formula of a simple linear regression line is:

Y = mX + b

Where:

Y is the predicted score (dependent variable)
X is the number of hours studied (independent variable)
m is the slope of the line
b is the intercept

How to Perform Linear Regression in Python

Step 1: Installing Python and Required Libraries

If you don’t have Python installed, you can download it from python.org. Python comes with a package manager called pip, which lets you install libraries, collections of code other people wrote that help you do specific tasks.

We will use a library called scikit-learn, which makes performing linear regression easy.

Open your command prompt (Windows) or terminal (Mac/Linux) and type:

pip install scikit-learn

This installs the tools we need.

Step 2: Writing Your First Linear Regression Program

Open any text editor or an environment like Jupyter Notebook or VS Code, then write the following code step by step.

Step 3: Import Python Libraries

First, we import the libraries. Think of this as bringing tools to your workspace.

from sklearn.linear_model import LinearRegression  # To perform linear regression
import numpy as np  # For working with numbers and arrays

Step 4: Prepare Your Data

We need to create some example data — the number of hours studied and the corresponding exam scores.

# Hours studied (independent variable)
hours_studied = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)

# Exam scores (dependent variable)
exam_scores = np.array([50, 55, 65, 70, 75])

np.array creates an array (a list of numbers).
.reshape(-1, 1) changes the shape of our data so Python understands it’s a column, which is needed for scikit-learn.

Step 5: Create the Linear Regression Model

Now, we create the model and teach it to find the best-fit line.

model = LinearRegression()
model.fit(hours_studied, exam_scores)

model.fit() tells Python to find the line that best fits the data.

Step 6: Get the Results

We want to know the slope and intercept — the parameters of our line.

print("Slope (m):", model.coef_[0])
print("Intercept (b):", model.intercept_)

This will print the slope and intercept.

Step 7: Make Predictions

Suppose you want to predict the exam score for someone who studied 6 hours.

predicted_score = model.predict([[6]])
print("Predicted exam score for 6 hours of study:", predicted_score[0])

Full Python Code Together

from sklearn.linear_model import LinearRegression
import numpy as np

# Data
hours_studied = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
exam_scores = np.array([50, 55, 65, 70, 75])

# Create model and train
model = LinearRegression()
model.fit(hours_studied, exam_scores)

# Results
print("Slope (m):", model.coef_[0])
print("Intercept (b):", model.intercept_)

# Prediction
predicted_score = model.predict([[6]])
print("Predicted exam score for 6 hours of study:", predicted_score[0])

Linear regression is just the start. As you learn more, you can explore multiple variables, more complex models, and real datasets.
Practice running this code and changing the numbers to see how predictions change.
Python and scikit-learn make it easy to bring statistical concepts to life with just a few lines of code.