Cosine Similarity Explained(basic algorithms -machine learning )

Cosine similarity is a metric used to measure the similarity between two vectors. It’s commonly employed for comparing high-dimensional data, such as text data.

Features:

Independent of vector magnitude: It doesn’t consider the length of the vectors; it focuses on their direction. Vectors pointing in the same direction will have a high cosine similarity regardless of their lengths.
Values range from -1 to 1:
- 1: The vectors point in exactly the same direction (most similar).
- 0: The vectors are orthogonal (unrelated).
- -1: The vectors point in opposite directions (most dissimilar).

Example:

Let’s say we have two users, A and B, who rate movies using vectors.

User A: (5, 3, 0, 1) (Action: 5 stars, Comedy: 3 stars, Drama: 0 stars, Sci-Fi: 1 star)
User B: (4, 2, 0, 0) (Action: 4 stars, Comedy: 2 stars, Drama: 0 stars, Sci-Fi: 0 stars)

In this case, users A and B have similar ratings for Action and Comedy, so the cosine similarity between them will be high. On the other hand, if User C has ratings of (0, 0, 5, 4), their Action and Comedy ratings are different, but they share a common interest in Drama and Sci-Fi, resulting in some degree of similarity.

Cosine Similarity Formula

Given two vectors a and b, cosine similarity is calculated as follows:

cos(θ) = (a・b) / (||a|| * ||b||)

a・b: The dot product of vectors a and b.
||a||: The magnitude (norm) of vector a.
||b||: The magnitude (norm) of vector b.

Python Code Example (Using NumPy)

import numpy as np

def cosine_similarity(a, b):
  """
  Calculates the cosine similarity between two vectors.

  Args:
    a: A 1D NumPy array or list.
    b: A 1D NumPy array or list.

  Returns:
    The cosine similarity (value between 0 and 1).
  """
  a = np.array(a)
  b = np.array(b)
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example
user_a = [5, 3, 0, 1]
user_b = [4, 2, 0, 0]
user_c = [0, 0, 5, 4]

similarity_ab = cosine_similarity(user_a, user_b)
similarity_ac = cosine_similarity(user_a, user_c)

print(f"Cosine similarity between User A and B: {similarity_ab}")  # Output example: 0.9746318461970762
print(f"Cosine similarity between User A and C: {similarity_ac}")  # Output example: 0.5547001962287764

import numpy as np

def cosine_similarity(a, b):
  """
  Calculates the cosine similarity between two vectors.

  Args:
    a: A 1D NumPy array or list.
    b: A 1D NumPy array or list.

  Returns:
    The cosine similarity (value between 0 and 1).
  """
  a = np.array(a)
  b = np.array(b)
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example
user_a = [5, 3, 0, 1]
user_b = [4, 2, 0, 0]
user_c = [0, 0, 5, 4]

similarity_ab = cosine_similarity(user_a, user_b)
similarity_ac = cosine_similarity(user_a, user_c)

print(f"Cosine similarity between User A and B: {similarity_ab}")  # Output example: 0.9746318461970762
print(f"Cosine similarity between User A and C: {similarity_ac}")  # Output example: 0.5547001962287764

Using scikit-learn Library Example

from sklearn.metrics.pairwise import cosine_similarity

# Example
user_a = [5, 3, 0, 1]
user_b = [4, 2, 0, 0]
user_c = [0, 0, 5, 4]

# Needs to be converted into a 2D array
data = np.array([user_a, user_b, user_c])

similarity_matrix = cosine_similarity(data)

print(f"Cosine similarity matrix:\n{similarity_matrix}")
# Output example:
# [[1.         0.97463185 0.5547002 ]
#  [0.97463185 1.         0.26726124]
#  [0.5547002  0.26726124 1.        ]]

print(f"Cosine similarity between User A and B: {similarity_matrix[0, 1]}") # Output example: 0.9746318461970762

from sklearn.metrics.pairwise import cosine_similarity

# Example
user_a = [5, 3, 0, 1]
user_b = [4, 2, 0, 0]
user_c = [0, 0, 5, 4]

# Needs to be converted into a 2D array
data = np.array([user_a, user_b, user_c])

similarity_matrix = cosine_similarity(data)

print(f"Cosine similarity matrix:\n{similarity_matrix}")
# Output example:
# [[1.         0.97463185 0.5547002 ]
#  [0.97463185 1.         0.26726124]
#  [0.5547002  0.26726124 1.        ]]

print(f"Cosine similarity between User A and B: {similarity_matrix[0, 1]}") # Output example: 0.9746318461970762

Cosine similarity is a powerful tool for measuring the similarity of high-dimensional data. It’s widely used in various fields such as text analysis, image processing, and music recommendation. You can easily calculate it using NumPy or the scikit-learn library.