Leveinshtein 和 hash - 找到一种导致相关性的散列算法（更近的距离）

debugcn 发表于 Dev

c1377554

我正在寻找一种散列类算法，它不提供任何安全性，而是为字符串提供固定且不同的模式，这样可以使用 Leveinshtein 距离计算或任何距离度量来关联近似相似的字符串。

假设我有两个字符串“你好/朋友/我的？” 和“你好/朋友/我的”，我在python中计算没有和有哈希的距离（Levenshtein）：

>>> import Levenshtein as lev
>>> Str1 = "hello/friend/my?"
>>> Str2 = "hello/friend/my"
>>> Distance = lev.distance(Str1.lower(),Str2.lower()),
>>> print(Distance)
>>> Ratio = lev.ratio(Str1.lower(),Str2.lower())
>>> print(Ratio)

(1,)

0.967741935483871

>>> Str1hash = hash(Str1)
>>> Str2hash = hash(Str2)
>>> Distance = lev.distance(str(Str1hash), str(Str2hash)),
>>> print(Distance)
>>> Ratio = lev.ratio(str(Str1hash), str(Str2hash))
>>> print(Ratio)

(16,)

0.41025641025641024

您可以看到在没有散列的情况下生成的值显示更近的距离 (1)，而使用散列的距离太远 (16)。

我想找到一种散列类型的函数或算法，它可以返回相似字符串之间更近的距离和比率。有什么线索吗？

c1377554

我想要的解决方案是 LSH：https : //en.wikipedia.org/wiki/Locality-sensitive_hashing

它解决了我提出的问题。这是一种在信息检索中用于查找重复文档或网页的技术。因此，我可以使用它来比较我的两个字符串并获得它们的相似度索引。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。