In modern development, measuring text similarity is a common issue whether for search engine optimization, data deduplication, or content recommendation systems, understanding how to calculate text similarity in PHP is an essential skill. This tutorial will walk you through the steps to implement a concise yet effective text similarity function.
Text similarity measures the degree of similarity between two text segments. A high similarity means the texts are closely related; a low similarity indicates significant differences. Text similarity is often applied in search engine optimization, recommendation systems, and data deduplication.
There are three common algorithms for calculating text similarity:
Edit Distance, also known as Levenshtein distance, calculates the minimum number of edits required to transform one string into another. Edit operations include inserting, deleting, and replacing characters.
Cosine similarity primarily measures the similarity between two vectors, often represented as term frequencies and computes the cosine value to determine similarity.
Jaccard Similarity compares the similarity of two sets using the formula: J(A, B) = |A ∩ B| / |A ∪ B|.
Before we start, make sure your development environment is set up with PHP and the necessary supporting packages. You can test locally or use an online editor. PHP 7.0 or later is recommended.
Next, we'll implement the three text similarity algorithms mentioned above, starting with Edit Distance:
<?phpfunction levenshteinDistance($str1, $str2) { $len1 = strlen($str1); $len2 = strlen($str2); $matrix = array(); for ($i = 0; $i <= $len1; $i++) { $matrix[$i][0] = $i; } for ($j = 0; $j <= $len2; $j++) { $matrix[0][$j] = $j; } for ($i = 1; $i <= $len1; $i++) { for ($j = 1; $j <= $len2; $j++) { $cost = ($str1[$i - 1] == $str2[$j - 1]) ? 0 : 1; $matrix[$i][$j] = min( $matrix[$i - 1][$j] + 1, $matrix[$i][$j - 1] + 1, $matrix[$i - 1][$j - 1] + $cost ); } } return $matrix[$len1][$len2];}?>
<?phpfunction cosineSimilarity($str1, $str2) { $vector1 = array_count_values(str_word_count($str1, 1)); $vector2 = array_count_values(str_word_count($str2, 1)); $intersection = array_intersect_key($vector1, $vector2); $numerator = 0; foreach ($intersection as $word => $count) { $numerator += $count * $vector2[$word]; } $sum1 = array_sum(array_map(function($v) { return $v * $v; }, $vector1)); $sum2 = array_sum(array_map(function($v) { return $v * $v; }, $vector2)); $denominator = sqrt($sum1) * sqrt($sum2); return ($denominator == 0) ? 0 : $numerator / $denominator;}?>
<?phpfunction jaccardSimilarity($str1, $str2) { $words1 = array_unique(explode(' ', $str1)); $words2 = array_unique(explode(' ', $str2)); $intersection = count(array_intersect($words1, $words2)); $union = count(array_unique(array_merge($words1, $words2))); return ($union == 0) ? 0 : $intersection / $union;}?>
Next, we'll demonstrate how to apply these functions. Suppose we have the following two text segments:
$text1 = "PHP is a widely-used open-source scripting language.";$text2 = "PHP is a popular open-source scripting programming language.";
We can use these three functions to compute their similarities:
echo "Edit Distance: " . levenshteinDistance($text1, $text2) . " <br>";echo "Cosine Similarity: " . cosineSimilarity($text1, $text2) . " <br>";echo "Jaccard Similarity: " . jaccardSimilarity($text1, $text2) . " <br>";
This article introduced how to write text similarity functions in PHP, covering Edit Distance, Cosine Similarity, and Jaccard Similarity Coefficient. We hope this tutorial has been helpful for your applications in text similarity.
In the future, we can explore more text processing techniques and apply them to more complex projects. Thank you for reading, and we look forward to your feedback!