solving the comparison of unicode input string in the file with unicode data

debugcn Published at Dev

Bishal Gautam

string1=" म नेपाली  हुँ"
string1=string1.split()
string1[0]
'\xe0\xa4\xae'

with codecs.open('nepaliwords.txt','r','utf-8') as f:
     for line in f:
             if string1[0] in line:
                     print "matched string found in file"

Traceback (most recent call last): File "", line 3, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

In the text files, I have large number of Nepali unicode.

Am I doing something wrong here comparing the two unicode string?

How can I print the matched unicode string?

Martijn Pieters

Your string1 is a byte string, encoded to UTF-8. It is not a Unicode string. But you used codecs.open() to have Python decode the file contents to unicode. Trying to then use your byte string with a containment test causes Python to implicitly decode the byte string to unicode to match types. This fails as the implicit decoding uses ASCII.

Decode string1 to unicode first:

string1 = " म नेपाली  हुँ"
string1 = string1.decode('utf8').split()[0]

or use a Unicode string literal instead:

string1 = u" म नेपाली  हुँ"
string1 = string1.split()[0]

Note the u at the start.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-11

Comments

0 comments

From Dev

String comparison and unicode

From Dev

Unicode elementwise string comparison in numpy

From Dev

PHP Japanese string comparison with Unicode

From Dev

Django template unicode String comparison

From Dev

Input unicode string with pyautogui

From Dev

comparing the unicode character from user input to unicode characters in file

From Dev

Unicode Comparison in Perl and Java

From Dev

How to print Unicode glyph names for input string?

From Dev

Unicode string comparison being interpreted as unequal (Python/Django app)

From Dev

Python: solving unicode hell with unidecode

From Dev

convert string representation of unicode to unicode

From Dev

Python: Searching a binary file (.PLM) for unicode string

From Dev

Form data comes through as unicode instead of string

From Dev

PHPExcel file "corrupt" when Unicode in data

From Dev

Extract Unicode data from CSV file

From Dev

Unicode comparison of Cyrillic 'С' and Latin 'C'

From Dev

caseless comparison of two unicode strings

From Dev

When is it better to use value comparison instead of identify comparison when checking if a string is unicode?

From Dev

input() and literal unicode parsing

From Dev

Convert user input to unicode

From Dev

Input to unicode in mysql and angular

From Dev

input() and literal unicode parsing

From Dev

Python unicode string to string?

From Dev

String to unicode string

From Dev

Python unittest AssertionError: unicode string is not unicode string

From Dev

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode

From Dev

Print unicode literal string as Unicode character

From Dev

How to parse a haskell unicode string into unicode character

From Dev

How to convert a string with unicode in it to unicode using python

Related Related

Article