Cosine similarity is a metric used to measure the similarity between two vectors. It’s commonly employed for comparing high-dimensional data, such as text data.
Features:
- Independent of vector magnitude: It doesn’t consider the length of the vectors; it focuses on their direction. Vectors pointing in the same direction will have a high cosine similarity regardless of their lengths.
- Values range from -1 to 1:
- 1: The vectors point in exactly the same direction (most similar).
- 0: The vectors are orthogonal (unrelated).
- -1: The vectors point in opposite directions (most dissimilar).
Example:
Let’s say we have two users, A and B, who rate movies using vectors.
- User A: (5, 3, 0, 1) (Action: 5 stars, Comedy: 3 stars, Drama: 0 stars, Sci-Fi: 1 star)
- User B: (4, 2, 0, 0) (Action: 4 stars, Comedy: 2 stars, Drama: 0 stars, Sci-Fi: 0 stars)
In this case, users A and B have similar ratings for Action and Comedy, so the cosine similarity between them will be high. On the other hand, if User C has ratings of (0, 0, 5, 4), their Action and Comedy ratings are different, but they share a common interest in Drama and Sci-Fi, resulting in some degree of similarity.
Cosine Similarity Formula
Given two vectors a and b, cosine similarity is calculated as follows:
cos(θ) = (a・b) / (||a|| * ||b||)
- a・b: The dot product of vectors a and b.
- ||a||: The magnitude (norm) of vector a.
- ||b||: The magnitude (norm) of vector b.
Python Code Example (Using NumPy)
import numpy as np
def cosine_similarity(a, b):
"""
Calculates the cosine similarity between two vectors.
Args:
a: A 1D NumPy array or list.
b: A 1D NumPy array or list.
Returns:
The cosine similarity (value between 0 and 1).
"""
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Example
user_a = [5, 3, 0, 1]
user_b = [4, 2, 0, 0]
user_c = [0, 0, 5, 4]
similarity_ab = cosine_similarity(user_a, user_b)
similarity_ac = cosine_similarity(user_a, user_c)
print(f"Cosine similarity between User A and B: {similarity_ab}") # Output example: 0.9746318461970762
print(f"Cosine similarity between User A and C: {similarity_ac}") # Output example: 0.5547001962287764
Using scikit-learn Library Example
from sklearn.metrics.pairwise import cosine_similarity
# Example
user_a = [5, 3, 0, 1]
user_b = [4, 2, 0, 0]
user_c = [0, 0, 5, 4]
# Needs to be converted into a 2D array
data = np.array([user_a, user_b, user_c])
similarity_matrix = cosine_similarity(data)
print(f"Cosine similarity matrix:\n{similarity_matrix}")
# Output example:
# [[1. 0.97463185 0.5547002 ]
# [0.97463185 1. 0.26726124]
# [0.5547002 0.26726124 1. ]]
print(f"Cosine similarity between User A and B: {similarity_matrix[0, 1]}") # Output example: 0.9746318461970762
Cosine similarity is a powerful tool for measuring the similarity of high-dimensional data. It’s widely used in various fields such as text analysis, image processing, and music recommendation. You can easily calculate it using NumPy or the scikit-learn library.