使用WGET从索引中提取所有文件名

debugcn 发表于 Dev

伊恩·普林格（Ian Pringle）

我在一家大学工作，我想提取我们拥有的所有PDF目录的所有文件名并创建一个文本文件。这些PDF都位于Intranet索引中。WGET与Intranet配合良好，我知道如何使用它从该索引下载大量文件。但是，我正在对目录进行审核，我需要每个目录的文件名-而不是实际的PDF文件，而只是“ UniOfState0708.pdf”。

所有PDF都位于不同的目录中，因此/ catalog /的Indox具有UniOfStateA /，UniOfStateB /等目录，并且每个索引中都有PDF，这些都是我想要收集的名称。

WGET可以做到这一点吗？如果可以，我将如何处理呢？

小的

以下解决方案仅适用于未格式化的标准apache2生成的目录索引。您可以wget使用grep和cut来索引文件并进行解析，例如：

#this will download the directory listing index.html file for /folder/
wget the.server.ip.address/folder/   

#this will grep for the table of the files, remove the top line (parent folder) and cut out
#the necessary fields
grep '</a></td>' index.html | tail -n +2 | cut -d'>' -f7 | cut -d'<' -f1

请注意，如上所述，仅当目录列表由apache2具有基本选项的服务器生成时才有效，该基本选项的配置如下：

<Directory /var/www/html/folder>
 Options +Indexes 
 AllowOverride None
 Allow from all
</Directory>

在此配置中，wget会返回，index.html而没有任何特定的格式，但是当然也可以使用选项自定义目录列表：

IndexOptions +option1 -option2 ...

为了提供更精确的答案（如果特定的话）适合您的情况，我们需要一个示例index.html文件。

这也是一个Python版本：

from bs4 import BeautifulSoup
import requests

def get_listing() :
  dir='http://cdimage.debian.org/debian-cd/8.4.0-live/amd64/iso-hybrid/'
  for file in listFD(dir):
    print file.split("//")[2]

def listFD(url, ext=''):    
  page = requests.get(url).text
  print page
  soup = BeautifulSoup(page, 'html.parser')
  return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]

def main() :
  get_listing()


if __name__=='__main__' : 
  main()