Mapping of character encodings to maximum bytes per character

Iguananaut

I'm looking for a table that maps a given character encoding to the (maximum, in the case of variable length encodings) bytes per character. For fixed-width encodings this is easy enough, though I don't know, in the case of some of the more esoteric encodings, what that width is. For UTF-8 and the like it would also be nice to determine the maximum bytes per character depending on the highest codepoint in a string, but this is less pressing.

For some background (which you can ignore, if you're not familiar with Numpy, I'm working on a prototype for an ndarray subclass that can, with some transparency, represent arrays of encoded bytes (including plain ASCII) as arrays of unicode strings without actually converting the entire array to UCS4 at once. The idea is that the underlying dtype is still an S<N> dtype, where <N> is the (maximum) number of bytes per string in the array. But item lookups and string methods decode the strings on the fly using the correct encoding. A very rough prototype can be seen here, though eventually parts of this will likely be implemented in C. The most important thing for my use case is efficient use of memory, while repeated decoding and re-encoding of strings is acceptable overhead.

Anyways, because the underling dtype is in bytes, that does not tell users anything useful about the lengths of strings that can be written to a given encoded text array. So having such a map for arbitrary encodings would be very useful for improving the user interface if nothing else.

Note: I found an answer to basically the same question that is specific to Java here: How can I programatically determine the maximum size in bytes of a character in a specific charset? However, I haven't been able to find any equivalent in Python, nor a useful database of information whereby I might implement my own.

dan04

The brute-force approach. Iterate over all possible Unicode characters and track the greatest number of bytes used.

def max_bytes_per_char(encoding):
    max_bytes = 0
    for codepoint in range(0x110000):
        try:
            encoded = chr(codepoint).encode(encoding)
            max_bytes = max(max_bytes, len(encoded))
        except UnicodeError:
            pass
    return max_bytes


>>> max_bytes_per_char('UTF-8')
4

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Ruby incompatible character encodings

From Dev

Character Encodings compatibility with ASCII

From Dev

Weird results with character encodings

From Dev

Declaring character encodings in HTML

From Dev

Declaring character encodings in HTML

From Dev

VIM - Bat to convert character encodings

From Dev

Which character encodings are supported by posix?

From Dev

VIM - Bat to convert character encodings

From Dev

How are character encodings related to fonts?

From Java

Read a file in R with mixed character encodings

From Dev

Why some character codes lack a character in Windows-125* encodings?

From Dev

character mapping in edittext android

From Dev

How to force Java to use only 2 bytes per character for Unicode characters (eg. 'ł')?

From Dev

Catch-all way of handling all character encodings in Java?

From Dev

Ordering a list per character length?

From Java

Set the maximum character length of a UITextField

From Dev

MySQL maximum character length storage

From Dev

Is there a maximum character limit to random seed?

From Dev

MySQL maximum character length storage

From Dev

Minimum & Maximum character amounts Regex

From Dev

Refresh output per character, not per line

From Dev

Get the UTF-8 Encoding of a Character in Bytes

From Dev

Add a character between 2 bytes in javascript

From Dev

Java serialization - how many bytes for a character?

From Dev

Python Read Certain Number of Bytes After Character

From Dev

How is the dot character represented within the bytes of a packet?

From Dev

Grails URL Mapping and slash character in URL

From Dev

how mapping text separated by a character in elasticsearch?

From Dev

Number-Character mapping in Haskell with list comprehension

Related Related

HotTag

Archive