使用Perl的Zipf法律书籍分析

debugcn 发表于 Dev

迪道芬

假设在文本文件中有一本书，是一本公共领域的书籍，因此对其进行的操作没有任何限制，例如HG Wells（1898）的《世界大战》：

它开始像这样：

第一章

战争前夕

在十九世纪的最后几年，没有人会相信，这个世界正被比人类更强大的，却像他自己的人类一样致命的情报敏锐而密切地注视着……

计算下一个perl脚本中每个单词的出现次数：

perl -0777 -lape's/\s+/\n/g' worlds.txt | sort | uniq -c | sort -nr > occurenaces.txt

然后生成这样的文本文件：

   4395 the
   2317 and
   2282 of
   1524 a
   1204 I
   1155 to
    901 in
    830 was
    707 that
    557 had
    432 with
    411 my
    402 as
    ...

将其绘制在使用的图形中：

gnuplot -e "set logscale y 2; set ytics 2; set grid; set title 'Occurenaces vs Word'; set xlabel 
'Word Rank'; set ylabel 'Number of Occurenaces'; set terminal png size 800,600; set output 'occurenaces.png'; plot 
'occurenaces.txt' with points pt 7 lc rgb 'red'; pause -1"

但是，我遇到了一些问题，例如，我的脚本计算的单词数不止一次，例如：

4395 the
340 The

或例如：

62 Martian 
12 Martian,
4 Martian’s
3 Martian.
1 Martian!’

如何避免这个问题？

布赖恩·德·福伊

This is a task I give in Learning Perl classes. In fact, I gave it to an undergraduate intern once because he had this assignment for a statistical mechanics class. Most people used some short text and manually counted. So I had him do Moby Dick, and then the KJV Bible. My additional instructions was to reveal the results for the Bible only after he'd blow away everyone with Moby Dick. Good times. Zipf takes a huge book to explain all this: Human Behavior and the Principle of Least Effort.

First, you probably don't want a one liner for this. There's a bit that you need to do.

Remove all "non-text" text. There's what H.G. Wells wrote and then meta text, such as "Chapter".
标准化单词。“ the”和“ the”相同但大小写不同。“ Martian”和“ Martian's”有点不同，因为它们代表不同的想法。您如何判断取决于您。
累积计数。这里不需要管道，因为Perl可以完成所有这些工作。使用单词作为哈希键已经可以处理唯一性部分。

LINE: while( <> ) {
    chomp;
    my @words = map normalize($_), split /\s+/;
    $Count{$_}++ for @words;
    }

sub normalize {
    my $s = lc shift;
    $s = s/[^a-z]//ig; # might reduce too much
    ... whatever else you need ...
    }

拥有哈希后，您可以按自己喜欢的任何方式输出它。键已经是唯一的了，我想您可以对它们进行排序，但是情节无关紧要。

从那里，您会注意到单词列表中其他奇怪的地方，位于尾部。您可以忽略它们，因为它们的数量很少，但是如果您想要高精度，则将进入针对特殊情况分别维护的字典。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-2

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

使用Perl的Zipf法律书籍分析

使用Perl的Zipf法律书籍分析

如何使用Perl分析数组的“邻居关系”

使用动态编程复制书籍

使用FITTED-LINE matplotlib构建Zipf分布

使用FITTED-LINE matplotlib构建Zipf分布

分析超时的Perl CGI脚本

分析超时的Perl CGI脚本

使用turn.js定期刷新书籍的动态页面

使用图像制作书籍阅读器[jquery]

使用turn.js定期刷新书籍的动态页面

使用数组和双链表存储书籍的数据结构

如何使用 SPARQL 查询从 Wikibooks 获取书籍列表

使用 Matlab 访问 Microsoft Onenote 书籍/章节/页面

使用 Struct 按价格对给定书籍进行排序

当某些书籍有多种类型时，您如何按类型（使用深度学习）对书籍进行分类？

Python正则表达式-使用法律形式匹配名称

使用ARM模板创建SendGrid帐户失败，但法律条款不被接受

如何使用Coq中的德摩根法律将“并非永远”替换为“存在”？

如何使用Monocle的内置法律实施方式来测试自己的镜头？

如何在 python 中使用正则表达式从法律描述中提取信息

在Perl中进行网络分析的问题

是arr [i] = i; 法律？

使用NumPy进行分析？

使用python分析数据

现代PHP书籍-使用命名空间时有关“使用”和“需求”的澄清

我可以使用已经学习过MariaDB的MySQL书籍吗？

学习Ruby on Rails书籍：使用TextEdit进行电子邮件配置

使用我的应用程序快速书籍将请求发送到Web连接器

如果isOnLoan为假，是否需要打印出书籍的详细信息？使用setters getters

我可以使用已经学习过MariaDB的MySQL书籍吗？