[pygtk] coding of characters (in general and greek in particular)

abel deuring adeuring at gmx.net
Sun Aug 20 21:37:52 WST 2006


Pascal DUCHATELLE wrote:
> I am trying to understand character coding and python (unicode...). Can
> someone tell me why in IDLE I have to enter the sequence u'\u03B1' to
> get an alpha greek sign displayed while in the texttest.py file that
> comes with th pyGTK demo package there are some 'cabalistic' characters
> (in the greek example section) that are displayed like greek symbols
> just right.

I could not find texttest.py anywhere, and I have never worked with
IDLE, so this is just a general remark:

The Python source code compiler needs to know, what encoding is used
for the source code, in order to properly build a Unicode string
object from an expression like u'umlauts äöü' or u'αβγδ' (the latter
should be the first four lower case letters of the Greek alphabet; I
am not sure, if they will be properly represented, when I send this
mail...). Without any further hint, the source code is, from the
"perspective" Python interpreter, just a sequence of bytes, where
the byte values <= 127 are well defined by the ASCII standard, but
where the "meaning" of values > 127 depends on the source encoding.

If you have a source code file with only the line:

print u'umlauts äöü and and greek αβγδ'

the Python interpreter will print the warning

Non-ASCII character '\xc3' in file pytest.py on line 1, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for
details

On the machine I'm writing this (quite standard Suse 9.3
installation), where most editors use/assume UTF8 encoding, the
script gives this output:

umlauts äöü and and greek αβγδ

The umlauts and the Greek characters are UTF8-encoded in two bytes
by the editor, but the Python interpreter does not know this, and
assumes by default, I believe, iso8859-1 encoding.

PEP 263 describes how to fix this:

#-*- coding: utf-8 -*-
print u'umlauts äöü and and greek αβγδ'

This script gives -- because is was written with an editor that
writes a UTF8-encoded file -- the expected output, because the
"comment" in the first line tells the interpreter explicitly, how to
convert string constants into Unicode objects.

Another option would be to explictly create a Python unicode object
from a "normal" Python string:

s = unicode('umlauts äöü and and greek αβγδ', 'utf-8')
print s

When this one-liner is run, the Python interpreter gives also the
deprecation warning, but the text is printed correctly.

Abel


More information about the pygtk mailing list