Cosine Similarity between text documents with example in Java
Cosine Similarity is an algorithm for comparing text documents without taking into account their sizes.
The algorithm is based on building vector of frequencies of words in a given document. The vector consists of numbers, which represents how many times a given word occurs in the document. When I have for example two documents, I compare their vectors and calculate “cosine distance” between these vectors. This distance is a number between 0 and 1, where 1 represents identical documents.
In Java, this algorithm is implemented in Apache Commons Text library’s CosineSimilarity class. The usage of this class is as follows:
import java.util.Arrays;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.commons.text.similarity.CosineSimilarity;
/**
*
* @author sstanchev
*/
public class DBCosineSimilarity {
static String textDelimiter = " ";
static String documentA
= "We have to choose some coffee. Which one is not important to me.";
static String documentB
= "We have to choose coffee by it's beans. Darker beans looks better to me.";
public static void main(String[] args) {
CosineSimilarity documentsSimilarity = new CosineSimilarity();
Map<CharSequence, Integer> vectorA = Arrays.stream(documentA.split(textDelimiter)).collect(Collectors.toMap(
character -> character, character -> 1, Integer::sum));
Map<CharSequence, Integer> vectorB = Arrays.stream(documentB.split(textDelimiter)).collect(Collectors.toMap(
character -> character, character -> 1, Integer::sum));
Double docABCosSimilarity = documentsSimilarity.cosineSimilarity(vectorA, vectorB);
System.out.printf("%4.3f\n", docABCosSimilarity);
}
}
So what I am doing here is first to create an instance of CosineSImilarity class. After that vectors for two documents under comparison are created – vectorA for DocumentA and vectorB for documentB. Cosine similarity is calculated into docABCosSimilarity variable.
A lot of experimentation could be done with this small piece of code and interesting results could be revealed.
To finish the technical part, I have to mention that this code to compile, I have to add next dependency to the project’s Maven pom.xml:
org.apache.commons
commons-text
1.6
The reasonable question follows – where this could be used? As a numerical metric, Cosine Similarity could be used in document classification and clustering algorithms. The advantage of this algorithm is that it does not take document size into account, so two documents with very different sizes still could be considered similar. It could be used also in recommendation systems – in publishing for example, where we could calculate the similarity between two books with calculating Cosine Similarity for their short descriptions.
In the domain area of medical software where OSI has a good amount of expertise
, such an algorithm could be used in medical experience sharing systems – as stated here.
Also, some applications could be in similarity measure for biomedical data as is shown here:
https://www.sciencedirect.com/science/article/pii/S1532046413000889
In chatbot development, entering a query from the user could be converted into a vectorized form. All the sentences in the corpus could also be represented by their corresponding vectorized forms, and the sentence with the highest cosine similarity with the user input could be selected as a response.