首先,我会说我有点理解'UTF-8'编码,它基本上但不是完全是unicode,并且ASCII是一个较小的字符集。我也明白,如果我有:
se_body = "> Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word tr <excess removed ...> JV"
print len(se_body) #will return the number of characters in the string, in my case '1500'
print sys.getsizeof(se_body) #will return the number of bytes, which will be 3050
我的代码利用了我无法控制的RESTful API。RESTful API的工作是解析传递的参数以获取文本中的圣经引用,并且有一个有趣的怪癖-一次只能接受2000个字符。如果发送了超过2000个字符,我的API调用将返回404。再次强调,我正在利用其他人的API,所以请不要告诉我“修复服务器端”。我不能:)
我的解决方案是将字符串分成少于2000个字符的块,然后扫描每个块,然后根据需要重新组装和标记。我想对服务表示友善,并传递尽可能少的块,这意味着每个块都应该很大。
当我传递带有希伯来语或希腊字符的字符串时,就会出现问题。(是的,圣经的答案经常使用希腊语和希伯来语!)如果我将块大小设置为低至1000个字符,那么我总是可以安全地传递它,但这似乎很小。在大多数情况下,我应该可以将其分块成更大的块。
我的问题是:在不使用过多英雄的情况下,将UTF-8分块为正确大小的最有效方法是什么?
这是代码:
# -*- coding: utf-8 -*-
import requests
import json
biblia_apikey = '************'
refparser_url = "http://api.biblia.com/v1/bible/scan/?"
se_body = "> Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as "rest" in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to "cease doing". > וַיִּשְׁבֹּת or by its root: > שָׁבַת Here's BlueletterBible's concordance entry: [Strong's H7673][1] It is actually the same root word that is conjugated to mean "[to go on strike][2]" in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God "rested" in the sense of relieving exhaustion, as we would normally understand the term in English. The word "rest" in that sense is > נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah's name). More here: [Strong's H5117][3] Jesus' words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a "work" that does not cease). The institution of the Sabbath was not merely just so the Israelites would "rest" from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts ("works") reach God's standard: > Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped "working", being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to "stop doing" and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&t=KJV"
se_body = se_body.decode('utf-8')
nchunk_start=0
nchunk_size=1500
found_refs = []
while nchunk_start < len(se_body):
body_chunk = se_body[nchunk_start:nchunk_size]
if (len(body_chunk.strip())<4):
break;
refparser_params = {'text': body_chunk, 'key': biblia_apikey }
headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'}
refparse = requests.get(refparser_url, params = refparser_params, headers=headers)
if (refparse.status_code == 200):
foundrefs = json.loads(refparse.text)
for foundref in foundrefs['results']:
foundref['textIndex'] += nchunk_start
found_refs.append( foundref )
else:
print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url)
print " returned text is: =>{0}<=".format(refparse.text)
nchunk_start += (nchunk_size-50)
#Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks
for ref in found_refs:
print ref
print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']]
我知道如何对字符串(body_chunk = se_body[nchunk_start:nchunk_size]
)进行切片,但是我不确定如何根据UTF-8位的长度对相同的字符串进行切片。
完成后,我需要提取选定的引用(实际上,我将添加SPAN标签)。现在,这是输出的样子:
{u'textLength': 11, u'textIndex': 5, u'passage': u'Genesis 2:2'}
Genesis 2:2
{u'textLength': 11, u'textIndex': 841, u'passage': u'Genesis 8:9'}
Genesis 8:9
可能有几种尺寸:
内存大小,sys.getsizeof()
例如,
>>> import sys
>>> sys.getsizeof(b'a')
38
>>> sys.getsizeof(u'Α')
56
即,包含单个字节的字节串b'a'
可能需要38
内存中的字节。
除非您的本地计算机出现内存问题,否则您无需担心
编码为utf-8的文本中的字节数:
>>> unicode_text = u'Α' # greek letter
>>> bytestring = unicode_text.encode('utf-8')
>>> len(bytestring)
2
文本中的Unicode代码点数:
>>> unicode_text = u'Α' # greek letter
>>> len(unicode_text)
1
通常,您可能还对文本中的字素簇(“可视字符”)的数量感兴趣:
>>> unicode_text = u'ё' # cyrillic letter
>>> len(unicode_text) # number of Unicode codepoints
2
>>> import regex # $ pip install regex
>>> chars = regex.findall(u'\\X', unicode_text)
>>> chars
[u'\u0435\u0308']
>>> len(chars) # number of "user-perceived characters"
1
如果API限制由p定义。2(utf-8编码字节串中的字节数),则可以使用@Martijn Pieters链接的问题的答案:截断unicode,以便在为电汇编码时适合最大大小。第一个答案应该起作用:
truncated = unicode_text.encode('utf-8')[:2000].decode('utf-8', 'ignore')
长度也有可能受到网址长度的限制:
>>> import urllib
>>> urllib.quote(u'\u0435\u0308'.encode('utf-8'))
'%D0%B5%CC%88'
截断它:
import re
import urllib
urlencoded = urllib.quote(unicode_text.encode('utf-8'))[:2000]
# remove `%` or `%X` at the end
urlencoded = re.sub(r'%[0-9a-fA-F]?$', '', urlencoded)
truncated = urllib.unquote(urlencoded).decode('utf-8', 'ignore')
可以使用'X-HTTP-Method-Override'
http标头解决url长度的问题,如果服务支持,该标头将允许将GET
请求转换为POST
请求。这是使用Google Translate API的代码示例。
如果您的情况允许,则可以通过解码html引用并使用NFC
Unicode规范化形式合并一些Unicode代码点来压缩html文本:
import unicodedata
from HTMLParser import HTMLParser
unicode_text = unicodedata.normalize('NFC', HTMLParser().unescape(unicode_text))
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句