我收到了数千个要处理的 Excel 文件。当我打开它们时,数据似乎以一种我可以用 Python 读取和处理的方式进行编码。
但是,文件名被破坏了。我将文件名导入 sqlite,然后将它们的列表导出到 CSV 以尝试使用正确的编码导入 Excel。
这是它们在文件系统中的显示方式:
如果我告诉 Excel 导入为28596: Arabic (ISO)
,这就是名称的显示方式,我假设它映射到iso8859_6
python 3.5 编码。
导入后,Excel 本身无法正确显示它们。这就是它们的外观,我认为这是字体问题。
Anyhow, if I import these file names into Python, I can't encode/decode them without errors. If I set errors to ignore
then I don't see the file names.
Any idea how to encode these to a standard Unicode Arabic that will display properly alongside all of the other Arabic text I'm working with?
Here's one example of how it appears in the file explorer on Windows and Finder on MacOS.
½ñΘ Ω⌐αε δτßí ñáƒóƒ ƒΘª¼á ƒΘßá∩í Θ¼∞⌐ 4-2016.xlsx
Edit:
Here's what I've tried in code... I have the filenames in a sqlite database, so I fetch them from there. (By the way, I don't have a problem with 99.9% of the Arabic I'm dealing with -- just these file names.)
import dataset
db = dataset.connect("sqlite:///mydata.sqlite")
# Hit on one of the characters that appears in the garbled file names
res = db.query("SELECT * FROM files WHERE file_name LIKE '%Ω%'")
file_names = [r['file_name'] for r in res]
test = file_names[0]
print(test)
>> '½ñΘ Ω⌐αε δτßí ñáƒóƒ ƒΘª¼á ƒΘßá∩í Θ¼∞⌐ 4-2016.xlsx'
Trying a few things:
test.encode('iso8859_6')
That leads to an error.
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-10-9c734319c359> in <module>()
----> 1 test.encode('iso8859_6')
C:\ProgramData\Anaconda3\lib\encodings\iso8859_6.py in encode(self, input, errors)
10
11 def encode(self,input,errors='strict'):
---> 12 return codecs.charmap_encode(input,errors,encoding_table)
13
14 def decode(self,input,errors='strict'):
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined>
Try with the codecs library
import codecs
codecs.encode(test,encoding='iso8859_6')
Same error as above.
codecs.encode(test,encoding='iso8859_6',errors='ignore')
>> b' 4-2016.xlsx'
Another try:
codecs.encode(test,encoding='iso8859_6',errors='ignore').decode('utf-8')
>> ' 4-2016.xlsx'
Try the other way around to convert it to bytes and then to the iso format:
codecs.encode(test,encoding='utf-8',errors='ignore')
>> b'\xc2\xbd\xc3\xb1\xce\x98 \xce\xa9\xe2\x8c\x90\xce\xb1\xce\xb5 \xce\xb4\xcf\x84\xc3\x9f\xc3\xad \xc3\xb1\xc3\xa1\xc6\x92\xc3\xb3\xc6\x92 \xc6\x92\xce\x98\xc2\xaa\xc2\xbc\xc3\xa1 \xc6\x92\xce\x98\xc3\x9f\xc3\xa1\xe2\x88\xa9\xc3\xad \xce\x98\xc2\xbc\xe2\x88\x9e\xe2\x8c\x90 4-2016.xlsx'
Chaining with decode...
codecs.encode(test,encoding='utf-8',errors='ignore').decode('iso8859_6')
This error:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-22-4a3c96284d09> in <module>()
----> 1 codecs.encode(test,encoding='utf-8',errors='ignore').decode('iso8859_6')
C:\ProgramData\Anaconda3\lib\encodings\iso8859_6.py in decode(self, input, errors)
13
14 def decode(self,input,errors='strict'):
---> 15 return codecs.charmap_decode(input,errors,decoding_table)
16
17 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'charmap' codec can't decode byte 0xbd in position 1: character maps to <undefined>
所以......也许这是错误的编码?
老实说,我真的不知道从哪里开始,因为我不太熟悉阿拉伯语的各种编码。
这个很棘手。您的sqlite
数据库正在向您发送解码不当的数据。它使用代码页 437 而不是代码页 720。您可以通过反转错误的编码然后正确解码来解决此问题:
filename = '½ñΘ Ω⌐αε δτßí ñáƒóƒ ƒΘª¼á ƒΘßá∩í Θ¼∞⌐ 4-2016.xlsx'
filename_fixed = filename.encode('cp437').decode('cp720')
print(filename_fixed) # prints "سجل مرضى نقطة جباتا الخشب الطبية لشهر 4-2016.xlsx"
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句