如何在PHP中删除隐藏字符

user3764140 发表于 Dev

用户名

我有以下一段代码，它从控制器读取文本文件。我使用了停用词列表，并且在从这些文件中删除停用词后，这些文件的词及其位置随后会出现多余的空白字符来代替停用词在文档中的位置。

例如，读为

计算机科学系//文件

当我遍历文档时从文档中删除停用词'of'之后，输出如下：

部门（0）（1）计算机（2）科学（3）//输出

但是空白不应该在那里。

这是代码：

<?php
$directory = "archive/";
$dir = opendir($directory);
while (($file = readdir($dir)) !== false) {
  $filename = $directory . $file;
  $type = filetype($filename);
  if ($type == 'file') {
    $contents = file_get_contents($filename);
    $texts = preg_replace('/\s+/', ' ',  $contents);
    $texts = preg_replace('/[^A-Za-z0-9\-\n ]/', '', $texts);
    $text = explode(" ", $texts);
    $text = array_map('strtolower', $text);
    $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or", " ");
    $text = (array_diff($text,$stopwords));
    echo "<br><br>";
    $total_count = count($text);
    $b = -1;
   foreach ($text as $a=>$v)
   {
     $b++;
     echo $text[$b]. "(" .$b. ")" ." ";
   } 
 } 
}
closedir($dir); 
?>

贾科莫1968

真正地不是100％确定字符串位置的最终输出，而是假设您将其放置在此处仅供参考。使用regex的此测试代码preg_replace似乎运行良好。

header('Content-Type: text/plain; charset=utf-8');

// Set test content array.
$contents_array = array();
$contents_array[] = "Department of Computer Science // A document";
$contents_array[] = "Department of Economics // A document";

// Set the stopwords.
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");

// Set a regex based on the stopwords.
$regex = '/(' . implode('\b|', $stopwords) . '\b)/i';

foreach ($contents_array as $contents) {

  // Remove the stopwords.
  $contents = preg_replace($regex, '', $contents);

  // Clear out the extra whitespace; anything 2 spaces or more in a row.
  $contents = preg_replace('/\s{2,}/', ' ', $contents);

  // Echo contents.
  echo $contents . "\n";

}

输出将按照以下格式进行清理和格式化：

部门计算机科学//文档

部门经济学//文件

因此，要将其集成到您的代码中，您应该这样做。请注意我是如何移动$stopwords与$regex该外while循环，因为它没有任何意义，在每个重置这些值while循环迭代。在循环外设置一次，并让循环中的内容仅专注于循环中您需要的内容：

<?php
$directory = "archive/";
$dir = opendir($directory);

// Set the stopwords.
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");

// Set a regex based on the stopwords.
$regex = '/(' . implode('\b|', $stopwords) . '\b)/i';

while (($file = readdir($dir)) !== false) {
  $filename = $directory . $file;
  $type = filetype($filename);
  if ($type == 'file') {

    // Get the contents of the filename.
    $contents = file_get_contents($filename);

    // Remove the stopwords.
    $contents = preg_replace($regex, '', $contents);

    // Clear out the extra whitespace; anything 2 spaces or more in a row.
    $contents = preg_replace('/\s{2,}/', ' ', $contents);

    // Echo contents.
    echo $contents;

 } 
}
closedir($dir); 
?>