我有以下一段代码,它从控制器读取文本文件。我使用了停用词列表,并且在从这些文件中删除停用词后,这些文件的词及其位置随后会出现多余的空白字符来代替停用词在文档中的位置。
例如,读为
计算机科学系//文件
当我遍历文档时从文档中删除停用词'of'之后,输出如下:
部门(0)(1)计算机(2)科学(3)//输出
但是空白不应该在那里。
这是代码:
<?php
$directory = "archive/";
$dir = opendir($directory);
while (($file = readdir($dir)) !== false) {
$filename = $directory . $file;
$type = filetype($filename);
if ($type == 'file') {
$contents = file_get_contents($filename);
$texts = preg_replace('/\s+/', ' ', $contents);
$texts = preg_replace('/[^A-Za-z0-9\-\n ]/', '', $texts);
$text = explode(" ", $texts);
$text = array_map('strtolower', $text);
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or", " ");
$text = (array_diff($text,$stopwords));
echo "<br><br>";
$total_count = count($text);
$b = -1;
foreach ($text as $a=>$v)
{
$b++;
echo $text[$b]. "(" .$b. ")" ." ";
}
}
}
closedir($dir);
?>
真正地不是100%确定字符串位置的最终输出,而是假设您将其放置在此处仅供参考。使用regex的此测试代码preg_replace
似乎运行良好。
header('Content-Type: text/plain; charset=utf-8');
// Set test content array.
$contents_array = array();
$contents_array[] = "Department of Computer Science // A document";
$contents_array[] = "Department of Economics // A document";
// Set the stopwords.
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");
// Set a regex based on the stopwords.
$regex = '/(' . implode('\b|', $stopwords) . '\b)/i';
foreach ($contents_array as $contents) {
// Remove the stopwords.
$contents = preg_replace($regex, '', $contents);
// Clear out the extra whitespace; anything 2 spaces or more in a row.
$contents = preg_replace('/\s{2,}/', ' ', $contents);
// Echo contents.
echo $contents . "\n";
}
输出将按照以下格式进行清理和格式化:
部门计算机科学//文档
部门经济学//文件
因此,要将其集成到您的代码中,您应该这样做。请注意我是如何移动$stopwords
与$regex
该外while
循环,因为它没有任何意义,在每个重置这些值while
循环迭代。在循环外设置一次,并让循环中的内容仅专注于循环中您需要的内容:
<?php
$directory = "archive/";
$dir = opendir($directory);
// Set the stopwords.
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");
// Set a regex based on the stopwords.
$regex = '/(' . implode('\b|', $stopwords) . '\b)/i';
while (($file = readdir($dir)) !== false) {
$filename = $directory . $file;
$type = filetype($filename);
if ($type == 'file') {
// Get the contents of the filename.
$contents = file_get_contents($filename);
// Remove the stopwords.
$contents = preg_replace($regex, '', $contents);
// Clear out the extra whitespace; anything 2 spaces or more in a row.
$contents = preg_replace('/\s{2,}/', ' ', $contents);
// Echo contents.
echo $contents;
}
}
closedir($dir);
?>
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句