我想从传入推文流中的每个短语中过滤不相关的单词。
我可以这样使用ArrayList来做到这一点:
import java.util.ArrayList;
// Example Tweet
String tweetText = "Awful glad vaccine is coming at last! #COVID19";
// First convert tweet text to array of words
String text = tweetText
.replaceAll("\\p{Punct}", "")
.replaceAll("\\r|\\n", "")
.toLowerCase();
String[] words = text.split(" ");
// We define an array of irrelevant words to be filtered out
String[] irrelevantWords = {"is", "at", "http", "https", "football"};
// first we create an extensible ArrayList to add filtered words to
ArrayList<String> filteredWords = new ArrayList<String>();
// we assume each word is relevant to begin with...
boolean relevant;
// ... and then we check by iterating over each word...
for (String w : words){
// ... assuming initially that it is relevant ...
relevant = true;
// ... and iterating over each irrelevant word ...
for (String irrelevant : irrelevantWords){
// ... and if a word is the same as an irrelevant word
if (w.equals(irrelevant)){
// ... we know that it is not relevant.
relevant = false;
}
}
// If, having compared the word to all the irrelevant words,
// it is still found to be relevant, we add it to our ArrayList.
if (relevant == true){filteredWords.add(w);}
}
// NB: This is not the most efficient method of filtering words,
// but it is the most simple to understand and implement.
System.out.println(filteredWords);
但是,尽管对于Java新手来说,这很容易理解和实现(基本上,它仅依赖于循环的迭代,尽管我们必须导入ArrayList),但它的效率很低。
这样做的最佳方法是什么(最简单或更有效)?
使用哈希集存储不相关的单词。
Set<String> irrelevantWords = new HashSet<String>();
将单词添加到此集合中,并用于irrelevantWords.contains(word)
检查单词是否不相关。
来自哈希集的查找是O(1)对列表/数组中的O(n)。由于您在循环中使用查找,因此可以大大提高性能。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句