Ticket #11 (closed defect: fixed)

Opened 8 years ago

Last modified 8 years ago

Unicode support

Reported by: edemaine@… Owned by: xi
Priority: normal Component: pyyaml
Severity: normal Keywords:


I would like to bring up two issues with Unicode support in PyYAML's emitter. First, it emits a type annotation of !!python/unicode whenever emitting a unicode string that can be encoded in ASCII:

>>> print yaml.dump(u'Fran\xe7ais')

>>> print yaml.dump(u'hello')
!!python/unicode 'hello'

I assume this is to force the value to be a unicode string when read back in. However, it makes for rather ugly files. In my case, and I imagine many others, I really don't care whether a string is stored as a 'str' or as a 'unicode' object in Python. And in YAML, the native string type is Unicode anyway. So it seems strange to have this distinction at the level of the YAML file. On the other hand, I understand the desire to have yaml.load(yaml.dump(x)) == x. Perhaps this should be another configuration option? (Of course, I could just convert my ASCII-encodable unicode objects to str objects...)

The second issue is that the emitter escapes non-ASCII characters even when all characters are printable (according to 'c-printable' in the YAML spec) when using an encoding (UTF8) that supports such characters. I don't find this as elegant as could be. Instead of the "Fran\xE7ais" output above, I would have hoped for the UTF8-encoded byte string Fran\xc3\xa7ais\n.

I guess this is as stylistic an issue as the previous one. It makes me wonder again whether there should be a Style object that can specify various emitting options, instead of many keyword arguments...


Change History

comment:1 Changed 8 years ago by xi

  • Status changed from new to closed
  • Resolution set to fixed

You are right about me wanting type(yaml.load(yaml.dump(x))) to be equal to type(x). Still it can be easily overridden. The easiest way is to use safe_dump:

>>> print yaml.safe_dump(u'hello')

safe_dump is "safe" because it produces only standard YAML tags, no !!python/something tags are emitted. If you still want to use dump, you may change the unicode representer:

>>> yaml.add_representer(unicode, lambda dumper, value: dumper.represent_scalar(u',2002:str', value))
>>> print yaml.dump(u'hello')

You might need to change the str representer too, but the corresponding code will be longer. Check SafeRepresenter.represent_str.

The second issue is already addressed, try:

>>> print yaml.dump(u'Fran\xe7ais', allow_unicode=True)

The default is to escape non-ASCII characters because they will produce garbage in non-utf8 terminals.

The latter issue is stylistic, but the former is definitely not a stylistic issue. Different tags imply that the corresponding scalar nodes are different while the scalar style does not affect equality of nodes. You may be right about some kind of a Style object, but I need more use cases before introducing it.

I'm closing the ticket, but feel free to reopen it if you feel your issues are not completely solved.

comment:2 Changed 8 years ago by edemaine@…

Wow, that was a fast response. I didn't realize that's what the Safe line of dumpers did; thanks. And I obviously didn't realize the allow_unicode option; exactly what I wanted. Thanks so much!

The only thing more I could hope for is documentation of all these features (other than reading through the code). Is this in process? Can I help?

comment:3 Changed 8 years ago by xi

Well, I'm writing the docs now, check PyYAMLDocumentation. But it's just a rough draft.

As I'm not a native speaker, writing English prose is a PITA for me and the result is mediocre, so any help will be greatly appreciated. If you find a mistake or an unclear expression, feel free to fix it. Well, I would be glad if someone wrote the docs for me, but it's not going to happen. :)

Anyway, you don't need to check it now since I'm modifying it. But if you are willing to review it later, I would really appreciate it.

comment:4 Changed 8 years ago by edemaine@…

I have been reading that documentation, and it seems well written. (But I'm also happy to review it--send me email when you would like me to.) It just doesn't yet describe all of the features (particularly all the options), which I can understand :-). Some documentation about the design of the system would be helpful too, in particular, which classes do what, so it's clearer how to extend/modify.

P.S. allow_unicode is working great.

comment:5 Changed 8 years ago by edemaine@…

  • Status changed from closed to reopened
  • Resolution fixed deleted

I found a bug with allow_unicode = True:

>>> yaml.load(yaml.dump(u'\udd00'))
>>> yaml.load(yaml.dump(u'\udd00',allow_unicode=True))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/", line 59, in load
    loader = Loader(stream)
  File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/", line 34, in __init__
    Reader.__init__(self, stream)
  File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/", line 114, in __init__
  File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/", line 167, in determine_encoding
  File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/", line 201, in update
  File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/", line 176, in check_printable
    'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #xdd00: special characters are not allowed
  in "<string>", position 0

I believe the offending lines are 962-964 of (Emitter.write_double_quoted):

            if ch is None or ch in u'"\\\x85\u2028\u2029\uFEFF' \
                    or not (u'\x20' <= ch <= u'\x7E'
                            or (self.allow_unicode and ch > u'\x7F')):

Compare this with line 169 of (Reader.NON_PRINTABLE):

    NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')

The latter is consistent with 'c-printable' in the YAML spec (except that it doesn't include #x10000-#x10FFFF--no support for 32-bit?). The former only seems to support 8-bit unicode properly...

comment:6 Changed 8 years ago by xi

  • Status changed from reopened to closed
  • Resolution set to fixed

Thanks, fixed in [153].

Python does not support 32-bit Unicode values.


Add a comment

Modify Ticket

Change Properties
<Author field>
as closed
The resolution will be deleted. Next status will be 'reopened'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.