假设在文本文件中有一本书,是一本公共领域的书籍,因此对其进行的操作没有任何限制,例如HG Wells(1898)的《世界大战》:
它开始像这样:
第一章
战争前夕
在十九世纪的最后几年,没有人会相信,这个世界正被比人类更强大的,却像他自己的人类一样致命的情报敏锐而密切地注视着……
计算下一个perl脚本中每个单词的出现次数:
perl -0777 -lape's/\s+/\n/g' worlds.txt | sort | uniq -c | sort -nr > occurenaces.txt
然后生成这样的文本文件:
4395 the
2317 and
2282 of
1524 a
1204 I
1155 to
901 in
830 was
707 that
557 had
432 with
411 my
402 as
...
将其绘制在使用的图形中:
gnuplot -e "set logscale y 2; set ytics 2; set grid; set title 'Occurenaces vs Word'; set xlabel
'Word Rank'; set ylabel 'Number of Occurenaces'; set terminal png size 800,600; set output 'occurenaces.png'; plot
'occurenaces.txt' with points pt 7 lc rgb 'red'; pause -1"
但是,我遇到了一些问题,例如,我的脚本计算的单词数不止一次,例如:
4395 the
340 The
或例如:
62 Martian
12 Martian,
4 Martian’s
3 Martian.
1 Martian!’
如何避免这个问题?
This is a task I give in Learning Perl classes. In fact, I gave it to an undergraduate intern once because he had this assignment for a statistical mechanics class. Most people used some short text and manually counted. So I had him do Moby Dick, and then the KJV Bible. My additional instructions was to reveal the results for the Bible only after he'd blow away everyone with Moby Dick. Good times. Zipf takes a huge book to explain all this: Human Behavior and the Principle of Least Effort.
First, you probably don't want a one liner for this. There's a bit that you need to do.
LINE: while( <> ) {
chomp;
my @words = map normalize($_), split /\s+/;
$Count{$_}++ for @words;
}
sub normalize {
my $s = lc shift;
$s = s/[^a-z]//ig; # might reduce too much
... whatever else you need ...
}
拥有哈希后,您可以按自己喜欢的任何方式输出它。键已经是唯一的了,我想您可以对它们进行排序,但是情节无关紧要。
从那里,您会注意到单词列表中其他奇怪的地方,位于尾部。您可以忽略它们,因为它们的数量很少,但是如果您想要高精度,则将进入针对特殊情况分别维护的字典。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句