[pygtk] coding of characters (in general and greek in particular)
abel deuring
adeuring at gmx.net
Sun Aug 20 21:37:52 WST 2006
Pascal DUCHATELLE wrote:
> I am trying to understand character coding and python (unicode...). Can
> someone tell me why in IDLE I have to enter the sequence u'\u03B1' to
> get an alpha greek sign displayed while in the texttest.py file that
> comes with th pyGTK demo package there are some 'cabalistic' characters
> (in the greek example section) that are displayed like greek symbols
> just right.
I could not find texttest.py anywhere, and I have never worked with
IDLE, so this is just a general remark:
The Python source code compiler needs to know, what encoding is used
for the source code, in order to properly build a Unicode string
object from an expression like u'umlauts äöü' or u'αβγδ' (the latter
should be the first four lower case letters of the Greek alphabet; I
am not sure, if they will be properly represented, when I send this
mail...). Without any further hint, the source code is, from the
"perspective" Python interpreter, just a sequence of bytes, where
the byte values <= 127 are well defined by the ASCII standard, but
where the "meaning" of values > 127 depends on the source encoding.
If you have a source code file with only the line:
print u'umlauts äöü and and greek αβγδ'
the Python interpreter will print the warning
Non-ASCII character '\xc3' in file pytest.py on line 1, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for
details
On the machine I'm writing this (quite standard Suse 9.3
installation), where most editors use/assume UTF8 encoding, the
script gives this output:
umlauts äöü and and greek αβγδ
The umlauts and the Greek characters are UTF8-encoded in two bytes
by the editor, but the Python interpreter does not know this, and
assumes by default, I believe, iso8859-1 encoding.
PEP 263 describes how to fix this:
#-*- coding: utf-8 -*-
print u'umlauts äöü and and greek αβγδ'
This script gives -- because is was written with an editor that
writes a UTF8-encoded file -- the expected output, because the
"comment" in the first line tells the interpreter explicitly, how to
convert string constants into Unicode objects.
Another option would be to explictly create a Python unicode object
from a "normal" Python string:
s = unicode('umlauts äöü and and greek αβγδ', 'utf-8')
print s
When this one-liner is run, the Python interpreter gives also the
deprecation warning, but the text is printed correctly.
Abel
More information about the pygtk
mailing list