在Python中，如何最有效地对utf-8字符串进行分块以进行REST交付？

debugcn 发表于 Dev

和ff的怪胎

首先，我会说我有点理解'UTF-8'编码，它基本上但不是完全是unicode，并且ASCII是一个较小的字符集。我也明白，如果我有：

se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word tr <excess removed ...> JV"
print len(se_body)              #will return the number of characters in the string, in my case '1500'
print sys.getsizeof(se_body)    #will return the number of bytes, which will be 3050

我的代码利用了我无法控制的RESTful API。RESTful API的工作是解析传递的参数以获取文本中的圣经引用，并且有一个有趣的怪癖-一次只能接受2000个字符。如果发送了超过2000个字符，我的API调用将返回404。再次强调，我正在利用其他人的API，所以请不要告诉我“修复服务器端”。我不能:)
我的解决方案是将字符串分成少于2000个字符的块，然后扫描每个块，然后根据需要重新组装和标记。我想对服务表示友善，并传递尽可能少的块，这意味着每个块都应该很大。
当我传递带有希伯来语或希腊字符的字符串时，就会出现问题。（是的，圣经的答案经常使用希腊语和希伯来语！）如果我将块大小设置为低至1000个字符，那么我总是可以安全地传递它，但这似乎很小。在大多数情况下，我应该可以将其分块成更大的块。
我的问题是：在不使用过多英雄的情况下，将UTF-8分块为正确大小的最有效方法是什么？

这是代码：

# -*- coding: utf-8 -*-
import requests
import json

biblia_apikey = '************'
refparser_url = "http://api.biblia.com/v1/bible/scan/?"
se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as &quot;rest&quot; in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to &quot;cease doing&quot;. &gt; וַיִּשְׁבֹּת or by its root: &gt; שָׁבַת Here&#39;s BlueletterBible&#39;s concordance entry: [Strong&#39;s H7673][1] It is actually the same root word that is conjugated to mean &quot;[to go on strike][2]&quot; in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God &quot;rested&quot; in the sense of relieving exhaustion, as we would normally understand the term in English. The word &quot;rest&quot; in that sense is &gt; נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah&#39;s name). More here: [Strong&#39;s H5117][3] Jesus&#39; words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a &quot;work&quot; that does not cease). The institution of the Sabbath was not merely just so the Israelites would &quot;rest&quot; from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts (&quot;works&quot;) reach God&#39;s standard: &gt; Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped &quot;working&quot;, being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to &quot;stop doing&quot; and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&amp;t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&amp;t=KJV"

se_body = se_body.decode('utf-8')

nchunk_start=0
nchunk_size=1500
found_refs = []

while nchunk_start < len(se_body):
    body_chunk = se_body[nchunk_start:nchunk_size]
    if (len(body_chunk.strip())<4):
        break;

    refparser_params = {'text': body_chunk, 'key': biblia_apikey }
    headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'}
    refparse = requests.get(refparser_url, params = refparser_params, headers=headers)

    if (refparse.status_code == 200):
        foundrefs = json.loads(refparse.text)
        for foundref in foundrefs['results']:
            foundref['textIndex'] += nchunk_start
            found_refs.append( foundref ) 
    else:
        print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url)
        print "  returned text is: =>{0}<=".format(refparse.text)

    nchunk_start += (nchunk_size-50)
    #Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks


for ref in found_refs:
    print ref
    print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']]

我知道如何对字符串（body_chunk = se_body[nchunk_start:nchunk_size]）进行切片，但是我不确定如何根据UTF-8位的长度对相同的字符串进行切片。

完成后，我需要提取选定的引用（实际上，我将添加SPAN标签）。现在，这是输出的样子：

{u'textLength': 11, u'textIndex': 5, u'passage': u'Genesis 2:2'}
Genesis 2:2
{u'textLength': 11, u'textIndex': 841, u'passage': u'Genesis 8:9'}
Genesis 8:9

杰夫斯

可能有几种尺寸：

内存大小，sys.getsizeof()例如，
```
>>> import sys
>>> sys.getsizeof(b'a')
38
>>> sys.getsizeof(u'Α')
56
```
即，包含单个字节的字节串b'a'可能需要38内存中的字节。
除非您的本地计算机出现内存问题，否则您无需担心

编码为utf-8的文本中的字节数：

>>> unicode_text = u'Α' # greek letter
>>> bytestring = unicode_text.encode('utf-8')
>>> len(bytestring)
2

文本中的Unicode代码点数：

>>> unicode_text = u'Α' # greek letter
>>> len(unicode_text)
1

通常，您可能还对文本中的字素簇（“可视字符”）的数量感兴趣：

>>> unicode_text = u'ё' # cyrillic letter
>>> len(unicode_text) # number of Unicode codepoints
2
>>> import regex # $ pip install regex
>>> chars = regex.findall(u'\\X', unicode_text)
>>> chars
[u'\u0435\u0308']
>>> len(chars) # number of "user-perceived characters"
1

如果API限制由p定义。2（utf-8编码字节串中的字节数），则可以使用@Martijn Pieters链接的问题的答案：截断unicode，以便在为电汇编码时适合最大大小。第一个答案应该起作用：

truncated = unicode_text.encode('utf-8')[:2000].decode('utf-8', 'ignore')

长度也有可能受到网址长度的限制：

>>> import urllib
>>> urllib.quote(u'\u0435\u0308'.encode('utf-8'))
'%D0%B5%CC%88'

截断它：

import re
import urllib

urlencoded = urllib.quote(unicode_text.encode('utf-8'))[:2000]
# remove `%` or `%X` at the end
urlencoded = re.sub(r'%[0-9a-fA-F]?$', '', urlencoded) 
truncated = urllib.unquote(urlencoded).decode('utf-8', 'ignore')

可以使用'X-HTTP-Method-Override'http标头解决url长度的问题，如果服务支持，该标头将允许将GET请求转换为POST请求。这是使用Google Translate API的代码示例。

如果您的情况允许，则可以通过解码html引用并使用NFCUnicode规范化形式合并一些Unicode代码点来压缩html文本：

import unicodedata
from HTMLParser import HTMLParser

unicode_text = unicodedata.normalize('NFC', HTMLParser().unescape(unicode_text))

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-06-6

我来说两句

0条评论

登录后参与评论

来自分类Dev

检查UTF-8字符串在Qt中是否有效

来自分类Dev

如何有效地对Pytables中的数据进行插值

来自分类Dev

Javascript：如何最有效地将数组连接到少于64个字符的块中？

来自分类Dev

在Python中按位置有效地对坐标点列表进行分组

来自分类Dev

如何最有效地将特定字节从二进制文件转换为字符串

来自分类Dev

如何有效地提取C ++中的字符串模式？

来自分类Dev

为什么在Julia中不建议对UTF8字符串进行索引？

来自分类Dev

如何有效地排序R中字符串中的字符？

来自分类Dev

我们如何有效地检查Python中字符串是否为十六进制

来自分类Dev

如何有效地进行多次MongoDB旅行

来自分类Dev

如何在python中有效地对列表进行分类

来自分类Dev

如何有效地匹配两个数据帧中的字符串

来自分类Dev

如何有效地将字典中的字符串列表与Python中的另一个字典列表进行比较？

来自分类Dev

如何在R中有效地对字符串中的字母重新排序？

来自分类Dev

如何使用Python有效地从txt格式文件中删除制表符

来自分类Dev

有效地搜索字符串中的关键字

来自分类Dev

检查UTF-8字符串在Qt中是否有效

来自分类Dev

我如何有效地进行亲子配对？

来自分类Dev

如何有效地进行多线程

来自分类Dev

如何有效地提取C ++中的字符串模式？

来自分类Dev

如何在data.table中最有效地重组字符串以实现快速运行

来自分类Dev

如何最有效地使用Javascript或Jquery对它们进行排序

来自分类Dev

如何有效地对“ vectorize”函数进行cythonize（numpy库）-python

来自分类Dev

如何有效地从大txt文件中仅读取字符串

来自分类Dev

用另一个最有效地对数组进行排序

来自分类Dev

如何有效地进行多次随机试验？

来自分类Dev

如何有效地从 Scala 中的字符串中删除非单词字符？

来自分类Dev

如何有效地对 numpy 矩阵进行排序

来自分类Dev

如何在 Tornado python 中以畅通无阻的方式实时有效地进行日志拖尾

Related 相关文章

文章