我已经为此工作了一个星期,但仍然找不到有效的解决方案。我解析html文件,其中包含以UTF-8编码的波兰字母。提取信息后,我感兴趣的是将它们保存到文件中或打印到控制台,但是所有波兰语字符均无法正确显示。
我试图使用我在Stack Overflow和其他论坛上找到的所有内容,但是由于某些原因对他人有用的东西对我不起作用。我用了:
use open qw(:std :utf8);
use HTML::TreeBuilder qw( );
use Object::Destroyer qw( );
#and many others;
这是我的perl代码:
use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
use File::Find;
use Encode;
my $location="C:\\MyLocation";
open (MYFILE, '>>data.txt');
sub find_txt {
my $F = $File::Find::name;
if ($F =~ /index.html$/ ) {
my $tr = HTML::TreeBuilder->new->parse_file('index.html');
for my $div ($tr->look_down(_tag => 'h2', 'class' => 'featured')) {
say $div->as_text;
print (MYFILE $div->as_text);
}
for my $div ($tr->look_down(_tag => 'div', 'class' => 'post-content')) {
for my $t ($div->look_down(_tag => 'p')) {
say $t->as_text;
print (MYFILE $t->as_text);
}
}
for my $div ($tr->look_down(_tag => 'h4', 'class' => 'related-posts')) {
for my $t ($div->look_down(_tag => 'a')) {
say $t->as_text;
print (MYFILE $t->as_text);
}
}
}
}
find(\&find_txt, $location);
close (MYFILE);
这是导致问题的html文件:
<div class="post-content">
<p>(łac. abacus)</p>
<p>1. płyta będąca najwyższą częścią kolumny</p>
<p>2. w starożytności – deska do liczenia, pierwowzór liczydła</p>
我不确定您是否能够在浏览器中显示波兰语字符,但是这些字符是由unicode编码为104、106、118、141、143,D3、15A,179、17B,105、107、119, 142、144,F3、15B,17A,17C
您可以使用给定的字符集显式打开文件
...
open (my $MYFILE, '>>:utf8','index.html'); # explicitly open MYFILE with utf8 charset
...
my $tr = HTML::TreeBuilder->new->parse_file($MYFILE);
...
或使用IO :: HTML自动检测打开的文件的字符集。
...
use IO::HTML; # exports html_file by default
...
my $tr = HTML::TreeBuilder->new->parse_file(html_file('index.html'));
....
HTML :: TreeBuilder
parse_file
....
When you pass a filename to "parse_file", HTML::Parser opens it in binary mode,
which means it's interpreted as Latin-1 (ISO-8859-1). If the file
is in another encoding, like UTF-8 or UTF-16, this will not do the right thing.
....
SEE ALSO
....
For opening a HTML file with automatic charset detection: IO::HTML.
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句