Python Codec Registry Error

Search the world s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you re looking for.

If you d like to upload plug-ins, please create an account below. You do not need one for download. Enjoy :-.

Defining Your Own EncodingВ

Since Python comes with a large number of standard codecs already, it

is unlikely that you will need to define your own. If you do, there

are several base classes in codecs to make the process easier.

The first step is to understand the nature of the transformation

described by the encoding. For example, an invertcaps encoding

converts uppercase letters to lowercase, and lowercase letters to

uppercase. Here is a simple definition of an encoding function that

performs this transformation on an input string:

importstringdefinvertcaps text : Return new string with the case of all letters switched. return. join c.upper ifcinstring.ascii_lowercaseelsec.lower ifcinstring.ascii_uppercaseelsecforcintext if__name__ __main__ :printinvertcaps ABC.def printinvertcaps abc.DEF

In this case, the encoder and decoder are the same function as with

ROT-13.

python codecs_invertcaps.py

abc.DEF

ABC.def

Although it is easy to understand, this implementation is not

efficient, especially for very large text strings. Fortunately,

codecs includes some helper functions for creating character

map based codecs such as invertcaps. A character map encoding is

made up of two dictionaries. The encoding map converts character

values from the input string to byte values in the output and the

decoding map goes the other way. Create your decoding map first,

and then use make_encoding_map to convert it to an encoding

map. The C functions charmap_encode and

charmap_decode use the maps to convert their input data

efficiently.

importcodecsimportstring Map every character to itselfdecoding_map codecs.make_identity_dict range 256 Make a list of pairs of ordinal values for the lower and upper case letterspairs zip ord c forcinstring.ascii_lowercase, ord c forcinstring.ascii_uppercase Modify the mapping to convert upper to lower and lower to upper.decoding_map.update dict upper,lower for lower,upper inpairs decoding_map.update dict lower,upper for lower,upper inpairs Create a separate encoding map.encoding_map codecs.make_encoding_map decoding_map if__name__ __main__ :printcodecs.charmap_encode abc.DEF, strict, encoding_map printcodecs.charmap_decode abc.DEF, strict, decoding_map printencoding_map decoding_map

Although the encoding and decoding maps for invertcaps are the same,

that may not always be the case. make_encoding_map detects

situations where more than one input character is encoded to the same

output byte and replaces the encoding value with None to mark the

encoding as undefined.

python codecs_invertcaps_charmap.py

ABC.def, 7

u ABC.def, 7

True

The character map encoder and decoder support all of the standard

error handling methods described earlier, so you do not need to do any

extra work to comply with that part of the API.

importcodecsfromcodecs_invertcaps_charmapimportencoding_maptext u pi: ПЂ forerrorin ignore, replace, strict :try:encoded codecs.charmap_encode text,error,encoding_map exceptUnicodeEncodeError,err:encoded str err print :7 : . format error,encoded

Because the Unicode code point for ПЂ is not in the encoding map,

the strict error handling mode raises an exception.

python codecs_invertcaps_error.py

ignore : PI: , 5

replace: PI. , 5

strict : charmap codec can t encode character u u03c0 in position

4: character maps to

After that the encoding and decoding maps are defined, you need to set

up a few additional classes and register the encoding.

register adds a search function to the registry so that when a

user wants to use your encoding codecs can locate it. The

search function must take a single string argument with the name of

the encoding, and return a CodecInfo object if it knows the

encoding, or None if it does not.

importcodecsimportencodingsdefsearch1 encoding :print search1: Searching for:, encodingreturnNonedefsearch2 encoding :print search2: Searching for:, encodingreturnNonecodecs.register search1 codecs.register search2 utf8 codecs.lookup utf-8 print UTF-8:, utf8try:unknown codecs.lookup no-such-encoding exceptLookupError,err:print ERROR:, err

You can register multiple search functions, and each will be called in

turn until one returns a CodecInfo or the list is exhausted.

The internal search function registered by codecs knows how to

load the standard codecs such as UTF-8 from encodings, so those

names will never be passed to your search function.

python codecs_register.py

UTF-8:

search1: Searching for: no-such-encoding

search2: Searching for: no-such-encoding

ERROR: unknown encoding: no-such-encoding

The CodecInfo instance returned by the search function tells

codecs how to encode and decode using all of the different

mechanisms supported: stateless, incremental, and stream.

codecs includes base classes that make setting up a character

map encoding easy. This example puts all of the pieces together to

register a search function that returns a CodecInfo instance

configured for the invertcaps codec.

importcodecsfromcodecs_invertcaps_charmapimportencoding_map,decoding_map Stateless encoder/decoderclassInvertCapsCodec codecs.Codec :defencode self,input,errors strict :returncodecs.charmap_encode input,errors,encoding_map defdecode self,input,errors strict :returncodecs.charmap_decode input,errors,decoding_map Incremental formsclassInvertCapsIncrementalEncoder codecs.IncrementalEncoder :defencode self,input,final False :returncodecs.charmap_encode input,self.errors,encoding_map 0 classInvertCapsIncrementalDecoder codecs.IncrementalDecoder :defdecode self,input,final False :returncodecs.charmap_decode input,self.errors,decoding_map 0 Stream reader and writerclassInvertCapsStreamReader InvertCapsCodec,codecs.StreamReader :passclassInvertCapsStreamWriter InvertCapsCodec,codecs.StreamWriter :pass Register the codec search functiondeffind_invertcaps encoding : Return the codec for invertcaps. ifencoding invertcaps :returncodecs.CodecInfo name invertcaps, encode InvertCapsCodec. encode,decode InvertCapsCodec. decode,incrementalencoder InvertCapsIncrementalEncoder,incrementaldecoder InvertCapsIncrementalDecoder,streamreader InvertCapsStreamReader,streamwriter InvertCapsStreamWriter, returnNonecodecs.register find_invertcaps if__name__ __main__ : Stateless encoder/decoderencoder codecs.getencoder invertcaps text abc.DEF encoded_text,consumed encoder text print Encoder converted to , consuming characters. format text,encoded_text,consumed Stream writerimportsyswriter codecs.getwriter invertcaps sys.stdout print StreamWriter for stdout: , writer.write abc.DEF print Incremental decoderdecoder_factory codecs.getincrementaldecoder invertcaps decoder decoder_factory decoded_text_parts forcinencoded_text:decoded_text_parts.append decoder.decode c,final False decoded_text_parts.append decoder.decode, final True decoded_text. join decoded_text_parts print IncrementalDecoder converted to . format encoded_text,decoded_text

The stateless encoder/decoder base class is Codec. Override

encode and decode with your implementation in this

case, calling charmap_encode and charmap_decode

respectively. Each method must return a tuple containing the

transformed data and the number of the input bytes or characters

consumed. Conveniently, charmap_encode and

charmap_decode already return that information.

IncrementalEncoder and IncrementalDecoder serve as

base classes for the incremental interfaces. The encode and

decode methods of the incremental classes are defined in such

a way that they only return the actual transformed data. Any

information about buffering is maintained as internal state. The

invertcaps encoding does not need to buffer data it uses a one-to-one

mapping. For encodings that produce a different amount of output

depending on the data being processed, such as compression algorithms,

BufferedIncrementalEncoder and

BufferedIncrementalDecoder are more appropriate base classes,

since they manage the unprocessed portion of the input for you.

StreamReader and StreamWriter need encode

and decode methods, too, and since they are expected to return

the same value as the version from Codec you can use multiple

inheritance for the implementation.

python codecs_invertcaps_register.py

Encoder converted abc.DEF to ABC.def, consuming 7 characters

StreamWriter for stdout: ABC.def

IncrementalDecoder converted ABC.def to abc.DEF

See alsocodecsThe standard library documentation for this module.localeAccessing and managing the localization-based configuration

settings and behaviors.ioThe io module includes file and stream wrappers that

handle encoding and decoding, too.SocketServerFor a more detailed example of an echo server, see the

SocketServer module.encodingsPackage in the standard library containing the encoder/decoder

implementations provided by Python..Unicode HOWTOThe official guide for using Unicode with Python 2.x.Python Unicode ObjectsFredrik Lundh s article about using non-ASCII character sets

in Python 2.0.How to Use UTF-8 with PythonEvan Jones quick guide to working with Unicode, including XML

data and the Byte-Order Marker.On the Goodness of UnicodeIntroduction to internationalization and Unicode by Tim Bray.On Character StringsA look at the history of string processing in programming

languages, by Tim Bray.Characters vs. BytesPart one of Tim Bray s essay on modern character string

processing for computer programmers. This installment covers

in-memory representation of text in formats other than ASCII

bytes.The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets No Excuses. An introduction to Unicode by Joel Spolsky.EndiannessExplanation of endianness in Wikipedia.

The codecs module provides stream and file interfaces for transcoding data in your program. It is most commonly used to work with Unicode text, but other encodings.

Messages 124 msg58487 - Author: Mark Summerfield mark Date: 2007-12-12 ; I am not sure if this is a Python bug or simply a limitation of cmd.exe.

I started to develop SPE when I was still using Windows back in 2002. By using Python and wxPython, I hoped SPE would run smoothly on all platforms.

Messages 124 msg58487 - view Author: Mark Summerfield mark Date: 2007-12-12 I am not sure if this is a Python bug or simply a limitation of cmd.exe.

I am using Windows XP Home.

I run cmd.exe with the /u option and I have set my console font to

Lucida Console the only TrueType font offered, and I run chcp 65001

to set the utf8 code page.

When I run the following program:

for x in range 32, 2000 :

print 0:5X 0:c. format x

one blank line is output.

But if I do chcp 1252 the program prints up to 7F before hitting a

unicode encoding error.

This is different behaviour from Python 2.5.1 which with a suitably

modified print line after chcp 65001 prints up to 7F and then fails

with IOError: Errno 0 Error. msg58621 - view Author: Mark Summerfield mark Date: 2007-12-14 I ve looked into this a bit more, and from what I can see, code page

65001 just doesn t work---so it is a Windows problem not a Python problem.

A possible solution might be to read/write UTF16 which managed Windows

applications can do.msg58651 - view Author: Christian Heimes christian.heimes Date: 2007-12-15 We are aware of multiple Windows related problems. We are planing to

rewrite parts of the Windows specific API to use the widechar variants.

Maybe that will help.msg87086 - view Author: Antoine Pitrou pitrou Date: 2009-05-03 Yes, it is a Windows problem. There simply doesn t seem to be a true

Unicode codepage for command-line apps. Recommend closing.msg88059 - view Author: Χρήστος Γεωργίου Christos Georgiou tzot Date: 2009-05-19 Just in case it helps, this behaviour is on Win XP Pro, Python 2.5.1:

First, I added an alias for cp65001 to utf_8 in

Lib/encodings/aliases.py.

Then, I opened a command prompt with a bitmap font.

c: windows system32 python

Python 2.5.1 r2863, Apr 18 2007, :08 MSC v.1310 32 bit

Intel on

win32

Type help, copyright, credits or license for more information.

print u N EM DASH

â

I switched the font to Lucida Console, and retried without exiting the

python interpreter, although the behaviour is the same when exiting and

entering again:

Traceback most recent call last :

File , line 1, in

IOError: Errno 13 Permission denied

Then I tried by pressing Alt 0233 for é, which is invalid in my normal

cp1253 codepage :

print u née

and the interpreter exits without any information. So it does for:

a u née

Then I created a UTF-8 text file named test65001.py :

- - coding: utf_8 - -

a u néeα

print a

and tried to run it directly from the command line:

c: windows system32 python d: src PYTHON test65001.py

néeαTraceback most recent call last :

File d: src PYTHON test65001.py, line 4, in

print a

IOError: Errno 2 No such file or directory

You see. It printed all the characters before failing.

Also the following works:

c: windows system32 echo heéε

heéε

and

c: windows system32 echo heéε D: src PYTHON dummy.txt

creates successfully a UTF-8 file without any UTF-8 BOM marks at the

beginning.

So it s possible that it is a python bug, or at least something can be

done about it.msg88077 - view Author: Amaury Forgeot d Arc amaury.forgeotdarc Date: 2009-05-19 an immediate thing to do is to declare cp65001 as an encoding:

Index: Lib/encodings/aliases.py

--- Lib/encodings/aliases.py revision 72757

Lib/encodings/aliases.py working copy

-511,6 511,7

utf8 : utf_8,

utf8_ucs2 : utf_8,

utf8_ucs4 : utf_8,

cp65001 : utf_8,

uu_codec codec

uu : uu_codec,

This is not enough unfortunately, because the win32 API function

WriteFile returns the number of characters written, not the number of

utf8 bytes:

print u0124 u0102 abc

ĤĂabc

c

44420 refs

Additionally, there is a bug in the ReadFile, which returns an empty

string and no error when a non-ascii character is entered, which is

the behavior of an EOF condition

Maybe the solution is to use the win32 console API directlymsg92854 - view Author: Χρήστος Γεωργίου Christos Georgiou tzot Date: 2009-09-19 Another note:

if one creates a dummy Stream object having a softspace attribute and a

write method that writes using os.write, as in

1432462

to replace sys.stdout and sys.stderr, then writes occur correctly,

without issues. Pre-requisites:

chcp 65001, Lucida Console font and cp65001 as an alias for UTF-8 in

encodings/aliases.py

This is Python 2.5.4 on Windows.msg94445 - view Author: Glenn Linderman v python Date: 2009-10-25 With Python 3.1.1, the following batch file seems to be necessary to use

UTF-8 successfully from an XP console:

set PYTHONIOENCODING UTF-8

cmd /u /k chcp 65001

set PYTHONIOENCODING

exit

the cmd line seems to be necessary because of Windows having

compatibility issues, but it seems that Python should notice the cp65001

and not need the PYTHONIOENCODING stuff.msg94480 - view Author: Mark Summerfield mark Date: 2009-10-26 Glenn Linderman s fix pretty well works for me on XP Home. I can print

every Unicode character up to and including U D7FF although most just

come out as rectangles, at least I don t get encoding errors.

It fails at U D800 with message:

UnicodeEncodeError: utf-8 codec can t encode character ud800 in

position 17: surrogates not allowed

I also tried U D801 and got the same error.

Nonetheless, this is much better than before.msg94483 - view Author: Marc-Andre Lemburg lemburg Date: 2009-10-26 Mark Summerfield wrote:

Mark Summerfield added the comment:

Glenn Linderman s fix pretty well works for me on XP Home. I can print

every Unicode character up to and including U D7FF although most just

come out as rectangles, at least I don t get encoding errors.

It fails at U D800 with message:

UnicodeEncodeError: utf-8 codec can t encode character ud800 in

position 17: surrogates not allowed

I also tried U D801 and got the same error.

That s normal and expected: D800 is the start of the surrogate

ranges which are only allows in pairs in UTF-8.msg94496 - view Author: Glenn Linderman v python Date: 2009-10-26 The choice of the Lucida Consola or the Consolas font cures most of the

rectangle problems. Those are just a limitation of the selected font

for the console window.msg108173 - view Author: Christoph Burgmer christoph Date: 2010-06-19 Will this bug be tackled or Python2.7.

And is there a way to get hold of the access denied error.

Here are my steps to reproduce:

I started the console with cmd /u /k chcp 65001

_______________________________________________________________________

Aktive Codepage: 65001.

C: Dokumente und Einstellungen root set PYTHONIOENCODING UTF-8

C: Dokumente und Einstellungen root d:

D: cd Python31

D: Python31 python

Python 3.1.2 r3149, Mar 21 2010, :52 MSC v.1500 32 bit Intel on win32

print u573a

I see a rectangle on screen but obviously c p works.msg108228 - view Author: STINNER Victor haypo Date: 2010-06-20 Maybe the solution is to use the win32 console API directly

Yes, it is the best solution because it avoids the horrible mbcs encoding.

About cp65001: it is not exactly the same encoding than utf-8 and so it cannot be used as an alias to utf-8: see issue 6058.msg116801 - view Author: Mark Lawrence BreamoreBoy Date: 2010-09-18 Brian/Tim what s your take on this.msg120414 - view Author: STINNER Victor haypo Date: 2010-11-04 I wrote a small function to call WriteConsoleOutputA and WriteConsoleOutputW in Python to do some tests. It works correclty, except if I change the code page using chcp command. It looks like the problem is that the chcp command changes the console code page and the ANSI code page, but it should only changes the ANSI code page and not the console code page.

chcp command

The chcp command changes the console code page, but in practice, the console still expects the OEM code page eg. cp850 on my french setup. Example:

C: python.exe -c import sys; print sys.stdout.encoding

cp850

C: chcp 65001

C: python.exe

Fatal Python error: Py_Initialize: can t initialize sys standard streams

LookupError: unknown encoding: cp65001

C: SET PYTHONIOENCODING utf-8

import sys

sys.stdout.write xe9 n

Ã

2

sys.stdout.buffer.write xe9 n. encode utf8

3

sys.stdout.buffer.write xe9 n. encode cp850

é

os.device_encoding 1 uses GetConsoleOutputCP which gives 65001. It should maybe use GetOEMCP instead. Or chcp command should be fixed.

Set the console code page looks to be a bad idea, because if I type é using my keyboard, a random character eg. U 0002 is displayed instead

WriteConsoleOutputA and WriteConsoleOutputW

Without touching the code page

------------------------------

If the character can be rendered by the current font eg. U 00E9 : WriteConsoleOutputA and WriteConsoleOutputW work correctly.

If the character cannot be rendered by the current font, but there is a replacment character eg. U 0141 replaced by U 0041 : WriteConsoleOutputA cannot be used U 0141 cannot be encoded to the code page, WriteConsoleOutputW writes U 0141 but the console contains U 0041 I checked using ReadConsoleOutputW and U 0041 is displayed. It works like the mbcs encoding, the behaviour looks correct.

If the character cannot be rendered by the current font, but there is a replacment character eg. U 042D : WriteConsoleOutputA cannot be used U 042D cannot be encoded to the code page, WriteConsoleOutputW writes U 042D but U 003d . is displayed instead. The behaviour looks correct.

chcp 65001

----------

Using chcp 65001 command set PYTHONIOENCODING utf-8 to avoid the fatal error, it becomes worse: the result depends on the font

Using raster font:

- ANSI write xe9. encode cp850 using WriteConsoleOutputA displays U 00e9 é, whereas the console output code page is cp65001 I checked using GetConsoleOutputCP

- ANSI write xe9. encode utf-8 using WriteConsoleOutputA displays à mojibake.

- UNICODE write xe9 using WriteConsoleOutputW displays a random character U 0002, U 0008, U 0069, U 00b0,

Using Lucida TrueType font :

- ANSI write xe9. encode cp850 using WriteConsoleOutputA displays U 0000. .

- UNICODE write xe9 using WriteConsoleOutputW works correctly display U 00e9, even with u0141, it works correctly display U 0141 msg120415 - view Author: STINNER Victor haypo Date: 2010-11-04 sys_write_stdtout.patch: Create sys.write_stdout test function to call WriteConsoleOutputA or WriteConsoleOutputW depending on the input types bytes or str. msg120416 - view Author: Χρήστος Γεωργίου Christos Georgiou tzot Date: 2010-11-04

If you want any kind of Unicode output in the console, the font must be an official MS console TTF official as defined by the Windows version ; I believe only Lucida Console and Consolas are the ones with all MS private settings turned on inside the font file.msg120700 - view Author: STINNER Victor haypo Date: 2010-11-08 I don t understand exactly the goal of this issue. Different people described various bugs of the Windows console, but I don t see any problem with Python here. It looks like it s just not possible to display correctly unicode with the Windows console the whole unicode charset, not the current code page subset.

- 65001 code page: it s not the same encoding than utf-8 and so it cannot be set as an alias to utf-8 see 6058 nothing to do, or maybe document that PYTHONIOENCODING utf-8 workaround But if you do that, you may get strange errors when writing to stdout or stderr like IOError: Errno 13 Permission denied or IOError: Errno 2 No such file or directory

- chcp command sets the console encoding, which is stupid because the console still expects text encoded to the previous code page Windows chcp command bug, chcp command should not be used it doesn t solve any problem, it just makes the situation worse

- use the console API instead of read /write to fix this issue: it doesn t work, the console is completly buggy msg120414 Windows console bug

- use Lucida Console font avoids some issue I don t think that the Python interpreter should configure the console using SetCurrentConsoleFontEx., it s not the role of Python

To me, there is nothing to do, and so I close the bug.

If you would like to fix a particular Python bug, open a new specific issue. If you consider that I m wrong, Python should fix this issue and you know how, please reopen it.msg125823 - view Author: David-Sarah Hopwood davidsarah Date: 2011-01-09 It is certainly possible to write Unicode to the console successfully using WriteConsoleW. This works regardless of the console code page, including 65001. The code here does so it s for Python 2.x, but you d be calling WriteConsoleW from C anyway.

WriteConsoleW has one bug that I know of, which is that it fails when writing more than 26608 characters at once. That s easy to work around by limiting the amount of data passed in a single call.

Fonts are not Python s problem, but encoding is. It doesn t make sense to fail to output the right characters just because some users might not have selected fonts that can display those characters. This bug should be reopened.

For completeness, it is possible to display Unicode on the console using fonts other than Lucida Console and Consolas, but it requires a registry hack. msg125824 - view Author: Glenn Linderman v python Date: 2011-01-09 Interesting.

I was able to tweak David-Sarah s code to work with Python 3.x, mostly doing things that 2to3 would probably do: changing unicode to str, dropping u from u , etc.

I skipped the unmangling of command-line arguments, because it produced an error I didn t understand, about needing a buffer protocol. But I ll attach David-Sarah s code tweaks a test case showing output of the Cyrillic alphabet to a console with code page 437 at least, on my Win7-64 box, that is what it is.

Nice work, David-Sarah. I m quite sure this is not in a form usable inside Python 3, but it shows exactly what could be done inside Python 3 to make things work and gives us a workaround if Python 3 is not fixed.msg125826 - view Author: David-Sarah Hopwood davidsarah Date: 2011-01-09 Glenn Linderman wrote:

I skipped the unmangling of command-line arguments, because it produced an error I didn t understand, about needing a buffer protocol.

If I understand correctly, that part isn t needed on Python 3 because issue2128 is already fixed there.msg125833 - view Author: STINNER Victor haypo Date: 2011-01-09 It is certainly possible to write Unicode to the console

successfully using WriteConsoleW

Did you tried with characters not encodable to the code page and with character that cannot be rendeded by the font.

See msg120414 for my tests with WriteConsoleOutputW.msg125852 - view Author: David-Sarah Hopwood davidsarah Date: 2011-01-09 haypo wrote:

davidsarah wrote:

It is certainly possible to write Unicode to the console

Did you tried with characters not encodable to the code page and with character that cannot be rendeded by the font.

Yes, characters not encodable to the code page do work as confirmed by Glenn Linderman, since code page 437 does not include Cyrillic.

Characters that cannot be rendered by the font print as missing-glyph boxes, as expected. They don t cause any other problem, and they can be cut-and-pasted to other Unicode-aware applications, showing up as the original characters.

See msg120414 for my tests with WriteConsoleOutputW

Even if it handled encoding correctly, WriteConsoleOutputW 28v vs.85 29.aspx would not be the right API to use in any case, because it prints to a rectangle of characters without scrolling. WriteConsoleW does scroll in the same way that printing to a console output stream normally would. Redirection to a non-console stream can be detected and handled differently, as the code in unicode2.py does. msg125877 - view Author: Glenn Linderman v python Date: 2011-01-10 I would certainly be delighted if someone would reopen this issue, and figure out how to translate unicode2.py to Python internals so that Python s console I/O on Windows would support Unicode out of the box.

Otherwise, I ll have to include the equivalent of unicode2.py in all my Python programs, because right now, I m including instructions for the use to 1 choose Lucida or Consolas font if they can t figure out any other font that gets rid of the square boxes 2 chcp 65001 3 set PYTHONIOENCODING UTF-8

Having this capability inside Python or my programs will enable me to eliminate two-thirds of the geeky instructions for my users. But it seems like a very appropriate capability to have within Python, especially Python 3.x with its preference and support Unicode in so many other ways.msg125889 - view Author: David-Sarah Hopwood davidsarah Date: 2011-01-10 I ll have a look at the Py3k I/O internals and see what I can do.

Reopening a bug appears to need Coordinator permissions. msg125890 - view Author: Tim Golden tim.golden Date: 2011-01-10 Reopening as there seems to be some possibility of progressmsg125898 - view Author: Amaury Forgeot d Arc amaury.forgeotdarc Date: 2011-01-10 The script unicode2.py uses the console STD_OUTPUT_HANDLE iff sys.stdout.fileno 1.

But is it always the case. What about pythonw.exe.

Also some applications may redirect fd 1: I m sure that py.test does this setting-capturing-methods-or-disabling-capturing and IIRC Apache also redirects file descriptors.msg125899 - view Author: STINNER Victor haypo Date: 2011-01-10 amaury The script unicode2.py uses the console STD_OUTPUT_HANDLE iff

amaury sys.stdout.fileno 1

Interesting article about the Windows console:

There is an example which has many tests to check that stdout is the windows

console and not a pipe or something else. msg125938 - view Author: David-Sarah Hopwood davidsarah Date: 2011-01-10 The script unicode2.py uses the console STD_OUTPUT_HANDLE iff sys.stdout.fileno 1.

You may have missed if not_a_console hStdout : real_stdout False.

not_a_console uses GetFileType and GetConsoleMode to check whether that handle is directed to something other than a console.

But is it always the case.

The technique used here for detecting a console is almost the same as the code for IsConsoleRedirected at or in WriteLineRight at I got it from that blog, can t remember exactly which page.

This code will give a false positive in the strange corner case that stdout/stderr is redirected to a console input handle. It might be better to use GetConsoleScreenBufferInfo instead of GetConsoleMode, as suggested by 3650507.

What about pythonw.exe.

I just tested that, using pythonw run from cmd.exe with stdout redirected to a file; it works as intended. It also works for both console and non-console cases when the handles are inherited from a parent process.

Incidentally, what s the earliest supported Windows version for Py3k. I see that mentions Windows ME. I can fairly easily make it fall back to never using WriteConsoleW on Windows ME, if that s necessary.msg125942 - view Author: David-Sarah Hopwood davidsarah Date: 2011-01-10 Note: Michael Kaplan s code checks whether GetConsoleMode failed due to ERROR_INVALID_HANDLE. My code intentionally doesn t do that, because it is correct and conservative to fall back to the non-console behaviour when there is any error from GetConsoleMode. It could also fail due to not having the GENERIC_READ right on the handle, for example. msg125947 - view Author: Amaury Forgeot d Arc amaury.forgeotdarc Date: 2011-01-10 Even if python.exe starts normally, py.test for example uses os.dup2 to redirect the file descriptors 1 and 2 to temporary files. sys.stdout.fileno is still 1, the STD_OUTPUT_HANDLE did not change, but normal print now goes to a file; but the proposed script won t detect this and will write to the console

Somehow we should extract the file handle from the file descriptor, with a call to _get_osfhandle for example.msg125956 - view Author: David-Sarah Hopwood davidsarah Date: 2011-01-10 os.dup2

Good point, thanks.

It would work to change os.dup2 so that if its second argument is 0, 1, or 2, it calls _get_osfhandle to get the Windows handle for that fd, and then reruns the console-detection logic. That would even allow Unicode output to work after redirection to a different console.

Programs that directly called the CRT dup2 or SetStdHandle would bypass this. Can we consider such programs to be broken. Methinks a documentation patch for os.dup2 would be sufficient, something like:

When fd1 refers to the standard input, output, or error handles 0, 1 and 2 respectively, this function also ensures that state associated with Python s initial sys. stdin,stdout,stderr streams is correctly updated if needed. It should therefore be used in preference to calling the C library s dup2, or similar APIs such as SetStdHandle on Windows. msg126286 - view Author: Terry J. Reedy terry.reedy Date: 2011-01-14 says

Name: Win9x, WinME, NT4

Unsupported in: Python 2.6 warning in 2.5 installer

Code removed in: Python 2.6

Only xp now. email sent to webmaster

Even if the best fix only applies to win7, please include it.msg126288 - view Author: Brian Curtin brian.curtin Date: 2011-01-14 I think we even agreed to drop 2000, although the PEP hasn t been updated and I couldn t find the supposed email where this was said.

For implementing functionality that isn t supported on all Windows versions or architectures, you can look at PC/winreg.c for a few examples. DisableReflectionKey is a good example off the top of my head.msg126303 - view Author: STINNER Victor haypo Date: 2011-01-14 Here are some results of my test of unicode2.py. I m testing py3k on Windows XP, OEM: cp850, ANSI: cp1252.

Raster fonts

------------

With a fresh console, unicode2.py displays . .. input accepts characters encodable to the OEM code page.

If I set the code page to 65001 chcp program set PYTHONIOENCODING utf-8; or SetConsoleCP SetConsoleOutputCP, it displays weird characters. input accepts ASCII characters, but non-ASCII characters encodable to the console and OEM code pages display weird characters smileys. control characters..

Lucida console

--------------

With my system code page OEM: cp850, characters not encodable to the code pages are displayed correctly. I can type some non-ASCII characters encodable to the code page. If I copy/paste characters non encodable to the code page, there are replaced by similar glyph eg. Ł L or. . .

If I set the code page to 65001, all characters are still correctly displayed. But I cannot type non-ASCII characters anymore: input fails with EOFError I suppose that Python gets control characters.

Redirect output to a pipe

-------------------------

I patched unicode2.py to use sys.stdout.buffer instead of sys.stdout for UnicodeOutput stream. I also patched UnicodeOutput to replace n by r n.

It works correctly with any character. No UTF-8 BOM is written. But Here 1 is written at the end. I suppose that sys.stdout should be flushed before the creation of UnicodeOutput.

But it always use UTF-8. I don t know if UTF-8 is well supported by any application on Windows.

Without unicode2.py, only characters encodable to OEM code page are supported, and n is used as end of line string.

Let s try to summarize

----------------------

Tests:

d1 Display characters encodable to the console code page

t1 Type characters encodable to the console code page

d2 Display characters not encodable to any code page

t2 Type characters not encodable to any code page

I m using Windows with OEM cp850 and ANSI cp1252. For test t2, I copy -Ł and paste it to the console right click on the window title Edit Paste.

Raster fonts, console cp850:

d1 ok

t1 ok

d2 FAIL: -Ł is displayed. -L

t2 FAIL: -Ł is read as. -L

Raster fonts, console cp65001:

d1 FAIL: é is displayed as 2 strange glyphs

t1 FAIL: EOFError

d2 FAIL: only display unreadable glyphs

t2 FAIL: EOFError

Lucida console, console cp850:

d2 ok

Lucida console, console cp65001:

So, setting the console code page to 65001 doesn t solve any issue, but it breaks the input input with the keyboard or pasting text.

With Raster fonts or Lucida console, it s possible to display characters encodable to the code page. But it is not new, it s already possible with Python 3. But for characters not encodable to the code page, it works with unicode2.py and Lucida console, with is something new :-

For the input, I suppose that we need also to use a Windows console function, to support unencodable characters.msg126304 - view Author: STINNER Victor haypo Date: 2011-01-14 , because right now, I m including instructions for the use to

1 choose Lucida or Consolas font if they can t figure out

any other font that gets rid of the square boxes

2 chcp 65001

3 set PYTHONIOENCODING UTF-8

Why do you set the code page to 65001. In all my tests on Windows XP, it always break the standard input.msg126308 - view Author: Glenn Linderman v python Date: 2011-01-15 Victor said:

Why do you set the code page to 65001. In all my tests on Windows XP, it always break the standard input.

My response:

Because when I searched Windows for Unicode and/or UTF-8 stuff, I found 65001, and it seems like it might help, and it does a bit. And then I find PYTHONIOENCODING, and that helps some. And that got me something that works better enough than what I had before, so I quit searching.

You did a better job of analyzing and testing all the cases. I will have to go subtract the 65001 part, and confirm your results, maybe it is useless now that other pieces of the puzzle are in place. Certainly with David-Sarah s code it seems to not be needed, whether it was a necessary part of the previous workaround I am not sure, because of the limited number of cases I tried trying to find something that worked well enough, but not having enough knowledge to find David-Sarah s solution, nor a good enough testing methodology to try the pieces independently.

Thank your for your interest in this issue.msg126319 - view Author: sorin sorin Date: 2011-01-15 remeber that cp65001 cannot be set on windows. Also please read and contact the author, Michael Kaplan from Microsoft, if you have more questions. I m sure he will be glad to help.msg127782 - view Author: David-Sarah Hopwood davidsarah Date: 2011-02-03 Feedback from Julie Solon of Microsoft:

These console functions share a per-process heap that is 64K. There is some overhead, the heap can get fragmented, and calls from multiple threads all affect how much is available for this buffer.

I am working to update the documentation for this function WriteConsoleW and other affected functions with information along these lines, and will post it within the next week or two.

I replied thanking her and asking for clarification:

When you say that the heap can get fragmented, is this true only when

there are concurrent calls to the console functions, or can it occur

even with single-threaded use. I m trying to determine whether acquiring

a process-global lock while calling these functions would be sufficient

to ensure that the available heap space will not be unexpectedly low.

This assumes that the functions not used outside the lock by other

libraries in the same process.

ReadConsoleW seems also to be affected, incidentally.

I ve asked for clarification about whether acquiring a process-global lock when using these functions

Juliemsg131657 - view Author: STINNER Victor haypo Date: 2011-03-21 I did some tests with WriteConsoleW :

- with raster fonts, U 00E9 is displayed as é, U 0141 as L and U 042D as. good work as expected

- with TrueType font Lucida, U 00E9 is displayed as é, U 0141 as Ł and U 042D as Э perfect. all characters are rendered correctly

Now I agree that WriteConsoleW is the best solution to fix this issue.

My test code added to Python/sysmodule.c :

---------

static PyObject

sys_write_stdout PyObject self, PyObject args

PyObject textobj;

wchar_t text;

DWORD written, total;

Py_ssize_t len, chunk;

HANDLE console;

BOOL ok;

if . PyArg_ParseTuple args, U:write_stdout, textobj

return NULL;

console GetStdHandle STD_OUTPUT_HANDLE ;

if console INVALID_HANDLE_VALUE

PyErr_SetFromWindowsErr GetLastError ;

text PyUnicode_AS_UNICODE textobj ;

len PyUnicode_GET_SIZE textobj ;

total 0;

while len. 0

if len 10000

/ WriteConsoleW is limited to 64 KB 32,768 UTF-16 units, but

this limit depends on the heap usage. Use a safe limit of 10,000

UTF-16 units.

/

chunk 10000;

else

chunk len;

ok WriteConsoleW console, text, chunk, written, NULL ;

if . ok

break;

text written;

len - written;

total written;

return PyLong_FromUnsignedLong total ;

The question is now how to integrate WriteConsoleW into Python without breaking the API, for example:

- Should sys.stdout be a TextIOWrapper or not.

- Should sys.stdout.fileno returns 1 or raise an error.

- What about sys.stdout.buffer: should sys.stdout.buffer.write calls WriteConsoleA or sys.stdout should not have a buffer attribute. I think that many modules and programs now rely on sys.stdout.buffer to write directly bytes into stdout. There is at least python -m base64.

- Should we use ReadConsoleW for stdin.msg131854 - view Author: David-Sarah Hopwood davidsarah Date: 2011-03-23 For anyone wondering about the hold-up on this bug, I ended up switching to Ubuntu. Not to worry, I now have Python 3 building in XP under VirtualBox -- which is further than I ever got with my broken Vista install :-/ It seems to behave identically to native XP as far as this bug is concerned.

Victor STINNER wrote:

The question is now how to integrate WriteConsoleW into Python without breaking the API, for example:

It pretty much has to be a TextIOWrapper for compatibility. Also it s easier to implement it that way, because the text stream object has to be able to fall back to using the buffer if the fd is redirected.

Return sys.stdout.buffer.fileno, which is 1 unless redirected.

This is the Right Thing because in Windows, fds are an abstraction of the C runtime library, and the C runtime allows an fd to be associated with a console. In that case, from the application s point of view it is still writing to the same fd. In fact, we d be implementing this by calling the WriteConsoleW win32 API directly in order to avoid bugs in the CRT s Unicode support, but that s an implementation detail.

- What about sys.stdout.buffer: should sys.stdout.buffer.write calls WriteConsoleA or sys.stdout should not have a buffer attribute.

I was thinking that sys.std out,err. buffer would still be set up exactly as they are now. Then if an app writes to that buffer, it will get interleaved with any writes via the text stream. The writes to the buffer go to the underlying fd, which probably ends up calling WriteFile at the win32 level.

I think that many modules and programs now rely on sys.stdout.buffer to write directly bytes into stdout. There is at least python -m base64.

That would just work. The only caveat would be that if you write a partial line to the buffer object or if you set the buffer object to be fully buffered and write to it, and then write to the text stream, the buffer wouldn t be flushed before the text is written. I think that is fine as long as it is documented.

If an app sets the. buffer attribute of sys.std out,err, it would fall back to using that buffer in the same way as when the fd is redirected.

- Should we use ReadConsoleW for stdin.

Yes. I ll probably start with a patch that just handles std out,err, though.msg132060 - view Author: David-Sarah Hopwood davidsarah Date: 2011-03-25 I wrote:

The only caveat would be that if you write a partial line to the buffer object or if you set the buffer object to be fully buffered and write to it, and then write to the text stream, the buffer wouldn t be flushed before the text is written.

Actually it looks like that already happens because the sys.std out,err TextIOWrappers are line-buffered separately to their underlying buffers, so it would not be an incompatibility:

python3 -c import sys; sys.stdout.write foo ; sys.stdout.buffer.write b bar ; sys.stdout.write baz n

barfoobazmsg132061 - view Author: David-Sarah Hopwood davidsarah Date: 2011-03-25 I wrote:

barfoobaz

Hmm, the behaviour actually would differ here: the proposed implementation would print

foobaz

bar

the foobaz n is written by a call to WriteConsoleW and then the bar gets flushed to stdout when the process exits.

But since the naive expectation is foobarbaz n and you already have to flush after each call in order to get that, I think this change in behaviour would be unlikely to affect correct applications.msg132062 - view Author: Glenn Linderman v python Date: 2011-03-25 Presently, a correct application only needs to flush between a sequence of writes and a sequence of buffer.writes.

Don t assume the flush happens after every write, for a correct application.msg132064 - view Author: David-Sarah Hopwood davidsarah Date: 2011-03-25 Glenn Linderman wrote:

Presently, a correct application only needs to flush between a sequence of writes and a sequence of buffer.writes.

Right. The new requirement would be that a correct app also needs to flush between a sequence of buffer.writes that end in an incomplete line, or always if PYTHONUNBUFFERED or python -u is used, and a sequence of writes.

Don t assume the flush happens after every write, for a correct application.

It s rather hard to implement this without any change in behaviour. Or rather, it isn t hard if the TextIOWrapper were to flush its underlying buffer before each time it writes to the console, but I d be concerned about the extra overhead of that call. I d prefer not to do that unless the new requirement above leads to incompatibilities in practice.msg132065 - view Author: Glenn Linderman v python Date: 2011-03-25 Would it suffice if the new scheme internally flushed after every buffer.write. It wouldn t be needed after write, because the correct application would already do one there.

Am I off-base in supposing that the performance of buffer.write is expected to include a flush because it isn t expected to be buffered. msg132067 - view Author: STINNER Victor haypo Date: 2011-03-25 Le vendredi 25 mars 2011 à 0000, David-Sarah Hopwood a écrit :

David-Sarah Hopwood added the comment:

I wrote:

python3 -c import sys; sys.stdout.write foo ;

sys.stdout.buffer.write b bar ; sys.stdout.write baz n

barfoobaz

Hmm, the behaviour actually would differ here: the proposed

implementation would print

foobaz

bar

the foobaz n is written by a call to WriteConsoleW and then the

bar gets flushed to stdout when the process exits.

But since the naive expectation is foobarbaz n and you already have

to flush after each call in order to get that, I think this change in

behaviour would be unlikely to affect correct applications.

I would not call this naive. foobaz nbar is really weird. I think

that sys.stdout and sys.stdout.buffer will both have to flush after each

write, or they may be desynchronized.

Some developers already think that adding sys.stdout.flush after

print Processing.. , end is too hard 11633. So I cannot imagine

how they would react if they will have to do it explicitly after all

print, sys.stdout.write and sys.stdout.buffer.write. msg132184 - view Author: David-Sarah Hopwood davidsarah Date: 2011-03-25 First a minor correction:

The new requirement would be that a correct app also needs to flush between a sequence of buffer.writes that end in an incomplete line, or always if PYTHONUNBUFFERED or python -u is used, and a sequence of writes.

That should be and only if PYTHONUNBUFFERED or python -u is not used.

I also said:

If an app sets the. buffer attribute of sys.std out,err, it would fall back to using that buffer in the same way as when the fd is redirected.

but the. buffer attribute is readonly, so this case can t occur.

Glenn Linderman wrote:

Would it suffice if the new scheme internally flushed after every buffer.write. It wouldn t be needed after write, because the correct application would already do one there.

Yes, that would be sufficient.

Am I off-base in supposing that the performance of buffer.write is expected to include a flush because it isn t expected to be buffered.

It is expected to be line-buffered. So an app might expect that printing characters one-at-a-time will have reasonable performance.

In any case, given that the buffer of the initial std out,err will always be a BufferedWriter object since. buffer is readonly, it would be possible for the TextIOWriter to test a dirty flag in the BufferedWriter, in order to check efficiently whether the buffer needs flushing on each write. I ve looked at the implementation complexity cost of this, and it doesn t seem too bad.

A similar issue arises for stdin: to maintain strict compatibility, every read from a TextIOWrapper attached to an input console would have to drain the buffer of its buffer object, in case the app has read from it. This is a bit tricky because the bytes drained from the buffer have to be converted to Unicode, so what happens if they end part-way through a multibyte character. Ugh, I ll have to think about that one.

Some developers already think that adding sys.stdout.flush after

print Processing.. , end is too hard 11633.

IIUC, that bug is about the behaviour of print, and didn t suggest to change the fact that sys.stdout is line-buffered.

By the way, are these changes going to be in a major release. If I understand correctly, the layout of structs for standard library types not prefixed with _, such as buffered in bufferedio.c or textio in textio.c can change with major releases but not with minor releases, correct.msg132191 - view Author: David-Sarah Hopwood davidsarah Date: 2011-03-26 I wrote:

A similar issue arises for stdin: to maintain strict compatibility, every read from a TextIOWrapper attached to an input console would have to drain the buffer of its buffer object, in case the app has read from it. This is a bit tricky because the bytes drained from the buffer have to be converted to Unicode, so what happens if they end part-way through a multibyte character. Ugh, I ll have to think about that one.

It seems like there is no correct way for an app to read from both sys.stdin, and sys.stdin.buffer even without these console changes. It must choose one or the other.msg132208 - view Author: Glenn Linderman v python Date: 2011-03-26 David-Sarah said:

So if flush checks that bit, maybe TextIOWriter could just call buffer.flush, and it would be fast if clean and slow if dirty. Calling it at the beginning of a Text level write, that is, which would let the char-at-a-time calls to buffer.write be fast.

And I totally agree with msg132191msg132266 - view Author: David-Sarah Hopwood davidsarah Date: 2011-03-26 Glenn wrote:

So if flush checks that bit, maybe TextIOWriter could just call buffer.flush, and it would be fast if clean and slow if dirty.

Yes. I ll benchmark how much overhead is added by the calls to flush; there s no point in breaking the abstraction boundary of BufferedWriter if it doesn t give a significant performance benefit. I suspect that it might not, because Windows is very slow at scrolling a console, which might make the cost of flushing insignificant in comparison. msg132268 - view Author: Glenn Linderman v python Date: 2011-03-26 David-Sarah wrote:

Windows is very slow at scrolling a console, which might make the cost of flushing insignificant in comparison.

Just for the record, I noticed a huge speedup in Windows console scrolling when I switched from WinXP to Win7 on a faster computer :

How much is due to the XP- 7 switch and how much to the faster computer, I cannot say, but it seemed much more significant than other speedups in other software. The point. Benchmark it on Win7, not XP.msg145898 - view Author: STINNER Victor haypo Date: 2011-10-19 I done more tests on the Windows console. I focused my tests on output.

To sum up, if we implement sys.stdout using WriteConsoleW and sys.stdout.buffer.raw using WriteConsoleA :

- print will not fail anymore on unencodable characters, because the string is no longer encoded to the console code page

- if you set the console font to a TrueType font, most characters will be displayed correctly

- you don t need to change the console code page to CP_UTF8 65001 anymore if you just use print

- you still need cp65001 if the output stdout and/or stderr is redirected or if you use directly sys.stdout.buffer or sys.stderr.buffer

Other facts:

- locale.getpreferredencoding returns the ANSI code page

- sys.stdin.encoding is the console encoding GetConsoleCP

- sys.stdout.encoding and sys.stderr.encoding are the console output code page GetConsoleOutputCP

- sys.stdout is not a TTY if the output is redirect, e.g. python script.py more

- sys.stderr is not a TTY if the output is redirect, e.g. python script.py 2 1 more this example redirects stdout and stderr, I don t know how to redirect only stderr

- WriteConsoleW is not affected by the console output code page GetConsoleOutputCP

- WriteConsoleA is indirectly affected by the console output code page: if a string cannot be encoded to the console output code page e.g. sys.stdout.encoding, you cannot call WriteConsoleA with the result

- If the console font is a raster font and and the font doesn t contain a character, the console tries to find a similar glyph, or it falls back to the character .

- If the console font is a TrueType font, it is able to display most Unicode charactersmsg145899 - view Author: STINNER Victor haypo Date: 2011-10-19 unicode3.py replaces sys.stdout, sys.stdout.buffer, sys.stderr and sys.stderr.buffer to use WriteConsoleW and WriteConsoleA. It displays also a lot of information about encodings and displays some characters I wrote my tests for cp850, cp1252 and cp65001. msg145963 - view Author: STINNER Victor haypo Date: 2011-10-19 win_console.patch: a more complete prototype

patch the site module to replace sys.stdout and sys.stderr by UnicodeConsole and BytesConsole classes which use WriteConsoleW and WriteConsoleA

UnicodeConsole inherits from io.TextIOBase and BytesConsole inherits from io.RawIOBase

Revert the workaround for WriteConsoleA bug from io.FileIO

sys.stdout and/or sys.stderr are only replaced if there are not redirected.msg145964 - view Author: STINNER Victor haypo Date: 2011-10-19 test_win_console.py: Small script to test win_console.patch. Write some characters into sys.stdout.buffer WriteConsoleA and sys.stdout WriteConsoleW. The test is written for cp850, cp1252 and cp65001 code pages.msg146471 - view Author: STINNER Victor haypo Date: 2011-10-26 I added a cp65001 codec to Python 3.3: see issue 13216.msg148990 - view Author: Matt Mackall Matt.Mackall Date: 2011-12-07 The underlying cause of Python s write exceptions with cp65001 is:

The ANSI C write function as implemented by the Windows console returns the number of _characters_ written rather than the number of _bytes_, which Python reasonably interprets as a short write error. It then consults errno, which gives the effectively random error message seen.

This can be bypassed by using os.write sys.stdout.fileno, utf8str, which will a succeed and b return a count len utf8str.

With os.write and an appropriate font, the Windows console will correctly display a large number of characters.

Possible workaround: clear errno before calling write, check for non-zero errno after. The vast majority of non-Python applications never check the return value of write, so don t encounter this problem.msg157569 - view Author: STINNER Victor haypo Date: 2012-04-05 The issue 14227 has been marked as a duplicate of this issue. Copy of msg155149:

This is on Windows 7 SP1. Run chcp 65001 then Python from a console. Note the extra characters when non-ASCII characters are in the string. At a guess it appears to be using the UTF-8 byte length of the internal representation instead of the character count.

Python 3.3.0a1 default, Mar 4 2012, :59 MSC v.1500 32 bit Intel on win32

print hello

hello

print p u012bny u012bn

pīnyīn

n

print u012b 10

īīīīīīīīīī

īīīī

īmsg160812 - view Author: Glenn Linderman v python Date: 2012-05-16 Has something incompatible changed between 3.2.2 and 3.2.3 with respect to this bug.

I have a program that had an earlier version of the workaround Michael s original, I think, and it worked fine, then I upgraded from 3.2.2 to 3.2.3 due to testing for issue 14811 and then the old workaround started complaining about no attribute errors.

So I grabbed unicode3.py, but it does the same thing:

AttributeError: UnicodeConsole object has no attribute errors

I have no clue how to fix this, other than going back to Python 3.2.2msg160813 - view Author: Glenn Linderman v python Date: 2012-05-16 Oh, and is this issues going to be fixed for 3.3, so we don t have to use the workaround in the future.msg160897 - view Author: Terry J. Reedy terry.reedy Date: 2012-05-16 Glenn, I do not know what you are using the interactive interpreter for, but for the unicode BMP, the Idle shell generally works better. I only use CommandPrompt for cross-checking behavior.msg161151 - view Author: Giampaolo Rodola giampaolo.rodola Date: 2012-05-19 Not sure whether a solution has already been proposed because the issue is very long, but I just bumped into this on Windows and come up with this:

from __future__ import print_function

import sys

def safe_print s :

try:

print s

except UnicodeEncodeError:

if sys.version_info 3, :

print s.encode utf8. decode sys.stdout.encoding

else:

print s.encode utf8

safe_print u N EM DASH

Couldn t python do the same thing internally.msg161153 - view Author: David-Sarah Hopwood davidsarah Date: 2012-05-19 Giampaolo: See msg120700 for why that won t work, and the subsequent comments for what will work instead basically, using WriteConsoleW and a workaround for a Windows API bug. Also see the prototype win_console.patch from Victor Stinner: msg145963msg161308 - view Author: Glenn Linderman v python Date: 2012-05-21 I actually had to go back to 3.1.2 to get it to run, I guess I had never run with Unicode output after installing 3.2. So it isn t an incompatibility between 3.2.2 and 3.2.3, but more likely a change between 3.1 and 3.2 that invalidates this patch and workaround. At least it is easier to keep 3.1.x and 3.2.x on the same system.

Terry, applications for non-programmers that want to emit Unicode on the console so the IDLE shell isn t appropriate.msg161651 - view Author: Glenn Linderman v python Date: 2012-05-26 A little more empirical info: the missing errors attribute doesn t show up except for input. print works fine.msg164572 - view Author: Glenn Linderman v python Date: 2012-07-03 For the win_console.patch, it seems like adding the line

self.errors strict

inside UnicodeOutput.__init__ resolves the problem with input causing exceptions.

Not sure if the sys_write_stdout.patch has the same sort of problem. Sure home this issue makes it into 3.3.msg164578 - view Author: Terry J. Reedy terry.reedy Date: 2012-07-03 3.3b0, Win7, 64 bit. Original test script stops at

File C: Programs Python33 lib encodings cp437.py, line 19, in encode

return codecs.charmap_encode input,self.errors,encoding_map 0

UnicodeEncodeError: charmap codec can t encode character x80 in position 6:

I am slightly puzzled because cp437 is an extended ascii codepage and there is a character for 0x80

https://en.wikipedia.org/wiki/Code_page_437

If I add. encode latin1, it does not print the pentagon for 0x7e, but does print x7e to xff.

Someone wrote elsewhere that 3.3 could use cp65001. True.msg164580 - view Author: Glenn Linderman v python Date: 2012-07-03 My fix for this errors error, might be similar to what is needed for issue 12967, although I don t know if my fix is really correct just that it gets past the error, and strict is the default for TextIOWrapper.

I m not at all sure why there is now since 3.2 an interaction between input on stdin and the particulars of the output class for stdout. But I m not at all an expert in Python internals or Python IO.

I m not sure whether or not you applied the patch to your b0, if not, that is what I m running, too but using the win_console.patch as supporting code. The original test script didn t use the supporting code.

If you did patch your b0 bwith unicode3.py, then you shouldn t need to do a chcp to write any Unicode characters; someone reported that doing a chcp caused problems, but I don t know how to apply the patch or build a Python with it, so can t really test all the cases. Victor did add a cp65001 codec using a different issue, not sure how that is relevant here, other than for the tests he wrote.msg164618 - view Author: Terry J. Reedy terry.reedy Date: 2012-07-03 I was reporting stock, as distributed 3.3b0.

Is unicode3.py something to run once or import in each app that wants unicode output. Either way, if it is possible to fix the console, why is it not distribute it with the fix.

Terry, applications for non-programmers that want to emit Unicode on the console so the IDLE shell isn t appropriate.

Someone just posted on python-list about a problem with that.

Hmm. Maybe IDLE should gain a batch-mode console window -- basically a stripped down version of the current shell -- a minimal auto-gui for apps.msg164619 - view Author: Glenn Linderman v python Date: 2012-07-03 Terry said:

Is unicode3.py something to run once or import in each app that wants unicode output.

I say:

The latter import it.

Terry said:

Either way, if it is possible to fix the console, why is it not distribute it with the fix.

Not sure what you are asking here. Yes it is possible to fix the console, but this fix depends on the version-specific internals of the Python IO system so unicode3.py works with Python 3.1, but not Python 3.2 or 3.3. I haven t tested to see if my patched unicode3.py still works on Python 3.1 I imagine it would, due to the nature of the fix just adding something that Python 3.1 probably would ignore.

So my opinion is the fix is better done inside Python than inside the application.msg170899 - view Author: Adam Bartoš Drekin Date: 2012-09-21 Hello, I m trying to handle Unicode input and output in Windows console and found this issue. Will this be solved in 3.3 final. I tried to write a solution file attached based on solution here – rewriting sys.stdin and sys.stdout so it uses ReadConsoleW and WriteConsoleW.

Output works well, but there are few problems with input. First, the Python interactive interpreter actually doesn t use sys.stdin but standard C stdin. It s implemented over file pointer PyRun_InteractiveLoopFlags, PyRun_InteractiveOneFlags in pythonrun. But still the interpreter uses sys.stdin.encoding assigning sys.stdin something, that doesn t have encoding None freezes the interpreter. Wouldn t it make more sense if it used sys.__stdin__.encoding.

However, input which uses stdin.readline works as expected. There s a small problem with KeyboardInterrupt. Since signals are processed asynchronously, it s raised at random place and it behaves wierdly. time.sleep 0.01 after the C call works well, but it s an ugly solution.

When code.interact is used instead of standard interpreter, it works as expected. Is there a way of changing the intepreter loop. Some hook which calls code.interact at the right place. The patch can be applied in site or sitecustomized, but calling code.iteract there obviously doesn t work.

Some other remarks:

- When sys.stdin or sys.stdout doesn t define encoding and errors, input raises TypeError: bad argument type for built-in operation.

- input raises KeyboardInterrupt on Ctrl-C in Python 3.2 but not in Python 3.3rc2.msg170915 - view Author: STINNER Victor haypo Date: 2012-09-21 Will this issue be solved in 3.3 final.

No. It would be an huge change and the RC2 was already released. No

new feature are accepted after the version 3.3.0 beta 1:

I m not really motivated to work on this issue, because it is really

hard to get something working in all cases. Using

ReadConsoleW/WriteConsoleW helps, but it doesn t solve all issues as

you said.msg170999 - view Author: Adam Bartoš Drekin Date: 2012-09-22 I have finished a solution working for me. It bypasses standard Python interactive interpreter and uses its own repl based on code.interact. This repl is activated by an ugly hack since PYTHONSTARTUP doesn t apply when some file is run python -i somefile.py. Why it works like that. Startup script could find out if a file is run or not. If anybody knows how to get rid of time.sleep used for wait for KeyboardInterrupt or how to get rid of PromptHack, please let me know. The patch can be activated by win_unicode_console_2.enable change_console True, use_hack True in site or sitecustomize or usercustomize.msg185135 - view Author: Adam Bartoš Drekin Date: 2013-03-24 Hello. I have made a small upgrade of the workaround.

win_unicode_console.enable_streams sets sys.stdin, stdout and stderr to custom filelike objects which use Windows functions ReadConcoleW and WriteConsoleW to handle unicode data properly. This can be done in sitecustomize.py to take effect automatically.

Since Python interactive console doesn t use sys.stdin for getting input still don t know reason for this, there is an alternative repl based on code.interact. win_unicode_console.IntertactiveConsole.enable sets it up. To set it up automatically, put the enabling code into a startup file and set PYTHONSTARTUP environment variable. This works for interactive session just running python with no script.

Since there is no hook to run InteractiveConsole.enable when a script is run interactively -i flag, that is after the script and before the interactive session, I have written a helper script i.py. It just runs given script and then enters an interactive mode using InteractiveConsole. Just put i.py into site-packages and run py -m i script.py arguments instead of py -i script.py arguments.

It s a shame that in the year 2013 one cannot simply run Python console on Windows and enter Unicode characters. I m not saying it s just Python fault, but there is a workaround on Python side.msg197700 - view Author: Adam Bartoš Drekin Date: 2013-09-14 Hello again. I have rewritten the custom stdio objects and implemented them as raw io reading and writing bytes in UTF-16-LE encoding. They are then wrapped in standard BufferedReader/Writer and TextIOWrapper objects. This approach also solves a bug of wrong string length given to WriteConsoleW when the string contained supplementary character. Since we are waiting for Ctrl-C signal to arrive, this implmentation doesn t suffer from It seems to work when main script is executed however it doesn t work in Python interactive REPL since the REPL doesn t use sys.stdin for input. However it uses its encoding which results in mess when sys.stdin is changed to object with different encoding like UTF-16-LE. See msg197751 - view Author: Glenn Linderman v python Date: 2013-09-15 Hi Drekin. Thanks for your work in progressing this issue. There have been a variety of techniques proposed for this issue, but it sounds like yours has built on what the others learned, and is close to complete, together with issue 17620.

Is this in a form that can be used with Python 3.3. or 3.4 alpha. Can it be loaded externally from a script, or must it be compiled into Python, or both.

I ve been using a variant of davidsarah s patch since 2 years now, but would like to take yours out for a spin. Is there a Complete Idiot s guide to using your patch. : msg197752 - view Author: Terry J. Reedy terry.reedy Date: 2013-09-15 From reading the module,

import stream; stream.enable

replaces sys.stdin/out/err with new classes.msg197773 - view Author: Adam Bartoš Drekin Date: 2013-09-15 Glenn Linderman: Yes I have built on what the others learned. For your question, I made it and tested it in Python 3.3, it should also work in 3.4 and what I ve tried, it actually works. As Terry J. Reedy says you can just load the module and enable the streams. I do this automatically on startup using sitecustomize. However as I said currently this meeses up the interactive session because of I have made some workaround – custom REPL built on stdlib module code. And also a helper script which runs the main script and then runs the custom REPL I couldn t find any stadard hook which would run the custom REPL. I m uploding full code. I will delete it if this isn t appropriate place.

Things like this could be fixed more easily if more core interpreter logic took place in stdlib. E. g. the code for interactive REPL. Few days ago I started some discussion on python ideas: https://mail.python.org/pipermail/python-ideas/2013-August/023000.html. msg221175 - view Author: Nick Coghlan ncoghlan Date: 2014-06-21 The fact Unicode doesn t work at the command prompt makes it look like Unicode on Windows just plain doesn t work, even in Python 3. Steve, if you or a colleague could provide some insight on getting this to work properly, that would be greatly appreciated.msg221178 - view Author: Steve Dower steve.dower Date: 2014-06-21 My understanding is that the best way to write Unicode to the console is through WriteConsoleW, which seems to be where this discussion ended up. The only apparent sticking point is that this would cause an ordering incompatibility with stdout.write ; stdout.buffer.write ; stdout.write.

Last I heard, the official advice was to use PowerShell. Clearly everyone s keen to jump on that I m not even sure it s an instant fix either - PS is a much better shell for file manipulation and certainly handles encoding better than type/echo/etc., but I think it will still go back to the OEM CP for executables.

One other point that came up was UTF-8 handling after redirecting output to a file. I don t see an issue there - UTF-8 is going to be one of the first guesses with or without a BOM for text that is not UTF-16, and apps that assume something else are no worse off than with any other codepage.

So I don t have any great answers, sorry. I d love to see the defaults handle it properly, but opt-in scripts like Drekin s may be the best way to enable it broadly.msg223403 - view Author: Adam Bartoš Drekin Date: 2014-07-18 I have made some updates in the streams code. Better error handling getting errno by GetLastError and raising exception when zero bytes are written on non-zero input. This prevents the infinite loop in BufferedIOWriter.flush when there is odd number of bytes WriteConsoleW accepts UTF-16-LE so only even number of bytes is written. It also prevents the same infinite loop when the buffer is too big to write at once see . The limit of 32767 bytes was added to raw write.msg223404 - view Author: STINNER Victor haypo Date: 2014-07-18 Drekin: Please don t send ZIP files to the bug tracker. It would be much better to have a project on github, Mercurial or something else, to have the history of the source code. You may try tp list all people who contributed to this code.

You may also create a project on pypi.python.org to share your code. This bug tracker is not the best place for that.

When the code will be consider mature well tested, widely used, we can try to integrate it into Python.msg223507 - view Author: Adam Bartoš Drekin Date: 2014-07-20 Victor Stinner: You are right. So I did it. Here are the links to GitHub and PyPI: https://github.com/Drekin/win-unicode-console, https://pypi.python.org/pypi/win_unicode_console.

I also tried to delete the files, but it seems that it is only possible to unlink a file from the issue, but the file itself remains. Is it possible to manage the files.msg223509 - view Author: Nick Coghlan ncoghlan Date: 2014-07-20 Thanks Drekin - I ll point folks to your project as a good place to provide initial feedback, and if that seems promising we can look at potentially integrating the various fixes into Python 3.5msg223945 - view Author: Mark Summerfield mark Date: 2014-07-25 I used pip to install the win_unicode_console package on windows 7 python 3.3.

It works but wouldn t freeze with cx_freeze because there s no __init__.py file in the win_unicode_console directory.msg223946 - view Author: Nick Coghlan ncoghlan Date: 2014-07-25 Hmm, I m not sure if that would be a bug in cxFreeze or CPython - I don t think we ve tried freezing or zipimporting namespace packages either way, adding the __init__.py to win_unicode_console would likely be the quickest fix msg223947 - view Author: STINNER Victor haypo Date: 2014-07-25 Since there is now an external project fixing the support of Windows console, I suggest to close this issue as wontfix. In a few months, if we get enough feedback on this project, we may reconsider integrating it into Python. What do you think.

https://pypi.python.org/pypi/win_unicode_console.

I used pip to install the win_unicode_console package

Please don t use Python bug tracker to report bugs to the package.msg223948 - view Author: Nick Coghlan ncoghlan Date: 2014-07-25 The poor interaction with the Windows command line is still a bug in CPython - we could mark it closed/later but I don t see any value in doing so.

I see Drekin s win_unicode_console module as similar to my own contextlib2 - used to prove the concept, and perhaps iterate on some of the details, but the ultimate long term solution is to fix CPython itself.msg223949 - view Author: STINNER Victor haypo Date: 2014-07-25 The poor interaction with the Windows command line is still a bug in CPython - we could mark it closed/later but I don t see any value in doing so.

I don t see any value in keeping the issue open since nobody worked on it last 7 years. I just want to make it clear that we will not fix this issue.

Well, in fact I spent a lot of hours trying to find a way to fix the issue, and my conclusion is that it s not possible to handle correctly Unicode input and output in a Windows console. Please read the whole issue for the detail.

The win_unicode_console project may improve the Unicode support, but I m convinced that it still has various issues because it is just not possible to handle all cases.

A workaround is to not use the Windows console, but use IDLE or another shell Try maybe PowerShell. But PowerShell has at least an issue with the code page 65001 Microsoft UTF-8 : see the issue 21927.msg223951 - view Author: Nick Coghlan ncoghlan Date: 2014-07-25 Based on Steve s last post, the main challenge is that the IO model assumes a bytes-based streaming API - it isn t really set up to cope with a UTF-16 buffering layer.

However, that s not substantially different from the situation when the standard streams are replaced with StringIO objects, and they don t have an underlying buffer object at all. That may be a suitable model for Windows console IO as well - present it to the user in a way that doesn t expose an underlying bytes-based API at all.

Now, it may not be feasible to implement this until we get the startup code cleaned up, but I m not going to squash interest in improving the situation when it s one of the major culprits behind the Unicode is even more broken in Python 3 than it is in Python 2 meme.msg223952 - view Author: Nick Coghlan ncoghlan Date: 2014-07-25 Changing targets to Python 3.5, since this is almost certainly going to be too invasive for a maintenance release.msg224019 - view Author: Glenn Linderman v python Date: 2014-07-26 This bug deserves to stay open with its high priority for whatever good that does these last seven years, although I appreciate all the efforts put forth, and have been making heavy use of the workarounds in the patches, because when working with Unicode data in programs, even exception messages are not properly displayed instead, they cause a secondary exception of not being able to display the data of the original exception to the console.

And writing Unicode data to the console as part of an interactive or command line program has to either be done with the hopes that the data only includes characters in the console, to avoid the failures, or with lots of special encoding calls and character substitutions for code points not in the console repertoire. Remember that the console is supposed to be human readable, not encoded numerically as ascii would do.

ascii is sort of OK for for exception messages, but since that doesn t happen by default, the initial message to the console with Unicode data often doesn t appear, and an extra repetition after a failed message and a rework of the message parameters is required, which impedes productivity.msg224086 - view Author: Adam Bartoš Drekin Date: 2014-07-26 I have deleted all my old files and added only my current implementation of the stream objects as the only relevant part to this issue.

Mark Summerfield: I have added __init__.py to the new version of win_unicode_console. If there is any problem, you can start an issue on project GitHub site or contact me.

Victor Stinner, Nick Coghlan: What s wrong with looking on Windows wide strings as on UTF-16-LE encoded bytes and building the raw stream objects around this.msg224095 - view Author: Nick Coghlan ncoghlan Date: 2014-07-27 Drekin, you re right, that s a much better way to go, I just didn t think it through : msg224596 - view Author: Mark Lawrence BreamoreBoy Date: 2014-08-02 To ensure that we re all talking about the same thing, is everybody using the /u unicode output option or /a ansi which I m assuming is the default when running cmd.msg224605 - view Author: Glenn Linderman v python Date: 2014-08-03 Mark, the /U and /A switches to CMD only affect as the help messages say the output of internal CMD commands. So they would only affect interoperability between internal command output piped to a Python program. The biggest issue in this bug, however, is the output of Python programs not being properly displayed by the console window often thought of or described as the CMD shell window.

While my biggest concerns have been with output, I suppose input can be an issue also, and running the output of echo, or other internal commands, into Python could be an issue as well. I have pasted a variety of data into Python programs beyond ASCII, but I m not sure I ve gone beyond ANSI or beyond Unicode BMP. Obviously, once output is working properly, input should also be tested and fixed, although I think output is more critical.

With the impetus of your question I just took some text supplied in another context that has a bunch of characters from different repertoires, including non-BMP, and tried to paste it into the console window. Here is the text:

こんにちは世界 - fine on Linux, all boxes on Windows all boxes in Chrome on Linux too

مرحبا العالم. - fine on Linux and Windows

안녕하세요, 세계. - fine on Linux, just boxes and punctuation on Windows

likewise in Chrome

Привет, мир. - fine on Linux and Windows

Αυτή είναι μια δοκιμή - fine on both, but Google Translate has a

problem with this. It returned Hello, world. as the Greek for

Hello, world. so I tried again with This is a test.

, . - not actually a language, but this is astral

In the console window, which I have configured using the Consolas font, the glyphs for the non-ASCII characters in the first two and last lines were boxes likely Consolas doesn t support those characters. I had written a Python equivalent of echo, including some workarounds originally posted in this issue, and got exactly the same output as input, with no errors produced. So it is a bit difficult to test characters outside the repertoire of whatever font is configured for the console window. Perhaps someone that has Chinese or Korean fonts configured for their console window could report on further testing of the above or similar strings.msg224690 - view Author: Adam Bartoš Drekin Date: 2014-08-04 I think that boxes are ok, it s just missing font. Without active workaroud there is just UnicodeEncodeError with cp852 for me. There is problem with astral characters – I m getting each box twice. It is possible that Windows console doesn t handle astral characters at all – it doesn t interpret surrogate pairs.msg227329 - view Author: Stefan Champailler wiz21 Date: 2014-09-23 I don t know if this is 100 related, but here I go. Here s a session in a windows console cmd.exe :

Microsoft Windows Version 6.1.7601

Copyright c 2009 Microsoft Corporation. All rights reserved.

C: Users stc chcp 65001

Active code page: 65001

C: Users stc PORT-STCA2 opt python3 python

Python 3.4.1 v3.4.1:c0e311e010fc, May 18 2014, :22 MSC v.1600 32 bit Intel on win32

print

C: Users stc

So basically, the python interpreters just quits without any message. Windows doesn t comply about python crashing though

Best regards,

Stefanmsg227330 - view Author: Stefan Champailler wiz21 Date: 2014-09-23 In my previous comment, I ve shown :

print

which is not valid python 3.4.1 don t why the interpreter didn t complaing though. So I tested again with missing parenthesis added :

C: PORT-STCA2 pl-PRIVATE horse chcp 65001

C: PORT-STCA2 pl-PRIVATE horse python

C: PORT-STCA2 pl-PRIVATE horse echo PROCESSOR_IDENTIFIER

Intel64 Family 6 Model 42 Stepping 7, GenuineIntel

Exactly the same behaviour.msg227332 - view Author: Nick Coghlan ncoghlan Date: 2014-09-23 Drekin, it would be good to be able to incorporate some of your improvements for Python 3.5. Before we could do that, we d need to review and agree to the PSF Contributor Agreement at https://www.python.org/psf/contrib/contrib-form/

The underlying licensing situation for CPython is a little messy albeit in a way that doesn t impact users or redistributors, so we use the contributor agreement to ensure we continue to have the right to distribute Python under its current license without making the history any messier, and to preserve the option of switching to a simpler standard license at some point in the future if it ever becomes feasible to do so. msg227333 - view Author: Adam Bartoš Drekin Date: 2014-09-23 Stefan Champailler:

The crash you see is maybe not a crash at all. First it has nothing to do with printing, the problem is reading of your input line. That explains why Python exited even before printing the traceback of the SyntaxError. If you try to read input using sys.stdin.buffer.raw.read 100 and type Unicode characters, it returns just empty bytes b. So maybe Python REPL then thinks the input just ended and so standardly exits the interpreter.

Why are you using chcp 65001. As far as I know, it doesn t give you the ability to use Unicode in the console. It somehow helps with printing, but there are some issues. print N euro sign prints the right character, but it prints additional blank line. sys.stdout.write N euro sign and sys.stdout.buffer.write N euro sign. encode cp65001 does the same, but sys.stdout.buffer.raw.write N euro sign. encode cp65001 works as expected.

If you want to enter and display Unicode in Python on Windows console, try my package win_unicode_console, which tries to solve the issues. See https://pypi.python.org/pypi/win_unicode_console.msg227337 - view Author: Adam Bartoš Drekin Date: 2014-09-23 Nick Coghlan: Ok, done.msg227338 - view Author: Nick Coghlan ncoghlan Date: 2014-09-23 Drekin: thanks. That should get processed by the PSF Secretary before too long, and the to indicate you have signed it will appear by your name.msg227347 - view Author: Stefan Champailler wiz21 Date: 2014-09-23 Dear Drekin,

The crash you see is maybe not a crash at all. First it has nothing

to do with printing, the problem is reading of your input line.

I guessed that, but thanks for pointing out.

So maybe Python REPL then thinks the input just ended and so standardly exits the interpreter.

Yes. I have showed that because the line of code seemed perfectly valid and innocuous I moved to Python3 because I need good unicode/encodings support. The answer from the REPL is, to me, very suprising. I would have expected a badly displayed character at least and a syntax error at worst. I consider myself quite aware of unicode issues but without any output from the repl, I d have very hard times figuring out what went wrong, hence my bug report.

So even though this might not qualify as the worse bug in Python, I d say it is actually quite misleading. But see no complaint here, I m very happy with Python in general. It s just that I thought I had to tell it to the dev team.

Why are you using chcp 65001.

I thought it d help me with printing unicode I tried CP437 but problem is the EURO sign is not there, and I do need eurosign :-. But I ll readily admit I didn t read all the stuff about encoing issues on Windows console before trying.

try my package win_unicode_console, which tries to solve the issues.

I ll certainly do that.

Thank you for your answer

Stefanmsg227354 - view Author: Mark Hammond mhammond Date: 2014-09-23 The crash you see is maybe not a crash at all.

I d call it a crash - the repl shouldn t exit. But it s not necessarily part of this bug.msg227373 - view Author: Terry J. Reedy terry.reedy Date: 2014-09-23 Stefan, the Idle Shell handles the BMP subset of Unicode quite well.

It is superior to the Windows console in other ways too. For instance, cut and paste work normally as for other Windows windows.

cp65001 is know to be buggy and essentially useless. Check the results in any search engine. msg227374 - view Author: Adam Bartoš Drekin Date: 2014-09-23 Idle shell handles Unicode characters well, but one cannot enter them using deadkey combinations. See - view Author: Stefan Champailler wiz21 Date: 2014-09-24 Thank you all for your quick and good answers. This level of responsiveness is truly amazing.

I ve played a bit with IPython and it works just fine. I can type the eurosign drectly with Alt Gr - E so I didn t enter a unicode code. So the bug is basically solved for me. But the python-repl behaviour still looks strange to me. So here s a successful IPython session :

Active code page: 65001

C: PORT-STCA2 pl-PRIVATE horse ipython

Python 3.4.1 v3.4.1:c0e311e010fc, May 18 2014, :22 MSC v.1600 32 bit Intel

Type copyright, credits or license for more information.

IPython 2.2.0 -- An enhanced Interactive Python.

. - Introduction and overview of IPython s features.

quickref - Quick reference.

help - Python s own help system.

object. - Details about object, use object.. for extra details.

In 1 : print

In 2 :msg227450 - view Author: Nick Coghlan ncoghlan Date: 2014-09-24 Aye, IPython has the advantage of running in a fully initialised browser, with the backend in a fully initialised Python environment.

CPython s setting up the standard streams for the default REPL at a much lower level, and there are quite a few problems with the way we re currently doing it.

I think Drekin s pointed the way towards substantially improving the situation for 3.5, though.msg228191 - view Author: stijn stijn Date: 2014-10-02 New here, but I think this is the correct issue to get info about this unicode problem. On the windows console:

chcp

Active code page: 437

type utf.txt

ƒ Ç é

chcp 65001

Привет

python --version

Python 3.5.0a0

cat utf.py

f open utf.txt

l f.readline

print l

print len l

python utf.py

ÐŸÑ Ð Ð ÐµÑ

еÑ

13

cat utf_explicit.py

import codecs

f codecs.open utf.txt, encoding utf-8, mode r

python utf_explicit.py

ет

7

I partly read through the page but these things are a bit above my head. Could anyone explain

- how to figure out what codec files returned by open.

- is there a way to change it globally to utf-8.

- the last case is almost correct: it has the correct number of characters, but the print still does something wrong. I got this working by using the stream patch, but got another example on which is is not correct, see below. Any way around this.

type utf2.txt

aαbβcγdδ

cat utf2.py

import streams

streams.enable

f codecs.open utf2.txt, encoding utf-8, mode r

print f.read 1

print f.read 2

print f.read 4

python utf2.py

a

α

bβc

γdδmsg228208 - view Author: Adam Bartoš Drekin Date: 2014-10-02 stijn: You are mixing two issues here. One is reading text from a file. There is no problem with it. You just call open path, encoding the_encoding_of_the_file. Since the encoding of the file depends on the file, you should provide the information about it.

Another issue is interactively entering and displaying Unicode characters in Python REPL in Windows console. That s what is this issue about. The streams code you use is outdated, for recent version see https://pypi.python.org/pypi/win_unicode_console and https://github.com/Drekin/win-unicode-console. It s an installable package which tries to solve the issue. The readme also contains a summary of the issue. Try the package and let me know if there is any problem.msg228210 - view Author: stijn stijn Date: 2014-10-02 Drekin: you re right for both input and output. Using encoding with plain open works just fine and using the latest win-unicode-console does give correct output for the second example as well. Thanks.msg233347 - view Author: Glenn Linderman v python Date: 2015-01-03 Just to note that another side effect of this bug is that stepping through code where the source contains non-ASCII characters results in pdb producing an error when trying to print the source lines. This makes stepping through such source code impossible.

I mention it, because it hasn t been mentioned before, and debuggers are mysterious and low-level enough, that solutions that might work for normal code, may not solve working with the debuggermsg233350 - view Author: Adam Bartoš Drekin Date: 2015-01-03 I tried the following code:

import pdb

pdb.set_trace

print 1 2

print αβγ

When run in vanilla Python it indeed ends with UnicodeEncodeError as soon as it hits the line with non-ASCII characters. However, the solution via win_unicode_console package seems to work correctly. There is just an issue when you keep calling next even after the main program ended. It ends with a RuntimeError after a few iterations. I didn t know that pdb can continue debugging after the main program has ended.msg233916 - view Author: Dainis Jonitis Jonitis Date: 2015-01-13 Drekins module at https://github.com/Drekin/win-unicode-console is great, but there is small issue with it when running within debugger in Visual Studio Python Tools for Visual Studio 2.1 installed. Debugger already wraps stdout and stderr inside the visualstudio_py_debugger._DebuggerOutput wrapper and it does not have the fileno method which win-unicode-console stream.py check_stream expects. I ve created potential fix for it at https://github.com/Drekin/win-unicode-console/pull/4/commits that checks whether object has old_out and uses it to get to fileno. There might be much more robust ways to check for wrappers. I just wanted to make you aware, if this code will be used as basis for Python 3.5.msg233937 - view Author: Steve Dower steve.dower Date: 2015-01-13 It sounds like the script should handle the case where someone has already changed stdout better. We wrap the streams in PTVS so we can forward the output into the IDE where Unicode will display properly anyway.

Our wrapper missing fileno is a bug in our side, but finding the original one will break output forwarding.msg234019 - view Author: Adam Bartoš Drekin Date: 2015-01-14 Note that win-unicode-console replaces the stdio streams rather than wraps them. So the desired state would be Unicode stream objects wrapped by PTVS. There would be no problem if win-unicode-console stream replacement occured before PTVS wraps them, which should be the case when Unicode streams for Windows are hadled by Python 3.5 itself. Is there any way to run custom Python code like sitecustomize before PTVS wraps the stdio streams.msg234020 - view Author: Dainis Jonitis Jonitis Date: 2015-01-14 Presumably Unicode streams would also fix file redirects. Currently, if you want to redirect stdout output to file it throws. For example PowerShell:

C: Python34 python.exe. test.py out-file -Encoding utf8 -FilePath test.txt msg234096 - view Author: Adam Bartoš Drekin Date: 2015-01-15 File redirection has nothing to do with win-unicode-console and this issue. When stdout is redirected, it is not a tty so win-unicode-console doesn t replace the stream object, which is the right thing to do. You got UnicodeEncodeError because Python creates sys.stdout with encoding based on your locale. In my case it is cp1250 which cannot encode whole Unicode. You can control the encoding used by setting PYTHONIOENCODING environment variable. For example, if you have a script producer.py, which prints some Unicode characters, and consumer.py, which just does print input, then py producer.py py consumer.py shows that redirection works when PYTHONIOENCODING is set e.g. to utf-8. msg234371 - view Author: Mark Hammond mhammond Date: 2015-01-20 File redirection has nothing to do with win-unicode-console

Thank you, that comment is spot on - there are multiple issues being conflated here. This bug is purely about the tty/console behaviour.msg242884 - view Author: Nick Coghlan ncoghlan Date: 2015-05-11 It sounds like fixing this properly requires fixing issue 17620 first so the interactive interpreter actually uses sys.stdin, so I ve flagged that as a dependency.msg254405 - view Author: dead1ne Date: 2015-11-09 I ve tried addressing the output problem by subclassing TextIOWrapper to use the windows functions GetConsoleOutputCP and WideCharToMultiByte.

I ve tested this as well as I can without figuring out how to install a better font for the windows console. It appears to work on both python 3.4 and 2.7 although there may be an issue with 2.7 and CJK Extension B and higher codepoints.

Hopefully this is useful in finally resolving the issue. Also I think some maintenance patch for 2.7 is in order as currently it fails utterly if you set the console to 65001 since it doesn t recognize it. Had to wrap all print statements in try/except so it wouldn t fail before testing the wrapper.msg254407 - view Author: Adam Bartoš Drekin Date: 2015-11-09 dead1ne: Hello, I m maintaining a package that tries to solve this issue: https://github.com/Drekin/win-unicode-console. There are actually many related problems.