The Least Squares Method is a way to find the best-fitting line for your data. It might sound complicated, but it’s similar to thinking you do in everyday life.
Example: Ice Cream Sales and Temperature
You run an ice cream shop. Every day, you record the highest temperature of the day and how many ice creams you sell.
| Highest Temperature (°C) | Ice Cream Sales (Units) |
| 20 | 10 |
| 25 | 15 |
| 30 | 20 |
| 35 | 25 |
Looking at this data, you notice that “ice cream sales tend to be higher when the temperature is warmer.” You want to represent the relationship between temperature and sales with a straight line!
The Difficulty of Drawing a Line
But there isn’t just one straight line that can go through all the data points. There are countless lines you could draw. Which line should you choose to be the “best” line?
This is where the Least Squares Method comes in.
Minimize the Error (Residuals)!
The Least Squares Method finds the line where the total difference between each data point and the line (the error) is as small as possible. This difference is called a residual.
- Draw a Trial Line: First, decide on a straight line with an arbitrary slope and intercept. (For example, Sales = 0.5 * Temperature + 0)
- Calculate the Residuals: For each data point, calculate the difference between the actual sales and the predicted sales from the line.
- Temperature 20°C: Actual Sales 10 units – Predicted Sales (0.5 * 20 + 0 = 10) = 0
- Temperature 25°C: Actual Sales 15 units – Predicted Sales (0.5 * 25 + 0 = 12.5) = 2.5
- Temperature 30°C: Actual Sales 20 units – Predicted Sales (0.5 * 30 + 0 = 15) = 5
- Temperature 35°C: Actual Sales 25 units – Predicted Sales (0.5 * 35 + 0 = 17.5) = 7.5
- Calculate the Squared Residuals: Square each residual. We square them to treat positive and negative differences equally.
- Calculate the Sum of Squared Residuals: Add up all the squared residuals. This is an indicator of how “bad” this line is.
- 0 + 6.25 + 25 + 56.25 = 87.5
- Slightly Change the Line: Slightly change the slope and intercept of the line, looking for the line that minimizes the sum of squared residuals.
- For example, try a slope of 0.6 or an intercept of 2…
Calculation with the Least Squares Method
When you actually calculate it, in this data case, the line Sales = 0.5 * Temperature + 5 is the best fit.
| Highest Temperature (°C) | Ice Cream Sales (Units) | Predicted Sales | Residual |
| 20 | 10 | 15 | -5 |
| 25 | 15 | 17.5 | -2.5 |
| 30 | 20 | 20 | 0 |
| 35 | 25 | 22.5 | 2.5 |
The sum of squared residuals for this line is smaller than before.
Summary
The Least Squares Method finds the best-fitting line by following these steps:
- Draw a trial line.
- Calculate the residuals.
- Square the residuals.
- Calculate the sum of squared residuals.
- Slightly change the line and find the line that minimizes the sum of squared residuals.
Python Code: Example with Ice Cream Sales Data
import numpy as np
import matplotlib.pyplot as plt
# Data
x = np.array([20, 25, 30, 35]) # Highest temperature
y = np.array([10, 15, 20, 25]) # Ice cream sales
# Calculate the coefficients of the regression line using the Least Squares Method
# polyfit(x, y, 1) easily calculates this for us
# deg=1 means a linear equation (straight line)
coefficients = np.polyfit(x, y, 1)
a = coefficients[0] # Slope
b = coefficients[1] # Intercept
print(f"Slope: {a}")
print(f"Intercept: {b}")
# Create the regression line
y_predicted = a * x + b
# Plot the results
plt.scatter(x, y, label="Actual Data")
plt.plot(x, y_predicted, color='red', label=f"Regression Line: y = {a:.2f}x + {b:.2f}")
plt.xlabel("Highest Temperature (°C)")
plt.ylabel("Ice Cream Sales (Units)")
plt.title("Regression Analysis using the Least Squares Method")
plt.legend()
plt.grid(True)
plt.show()
# Calculate and check the residuals
residuals = y - y_predicted
print("\nResiduals:")
for i in range(len(x)):
print(f"Temperature {x[i]}°C: {residuals[i]:.2f}")
# Calculate the sum of squared errors
sum_squared_errors = np.sum(residuals**2)
print(f"\nSum of Squared Errors: {sum_squared_errors:.2f}")