Modify

Ticket #138 (new enhancement)

Opened 5 years ago

Last modified 5 years ago

Patch to read/write invalid UTF-8

Reported by: spitzak@… Owned by: xi
Priority: normal Component: libyaml
Severity: blocker Keywords:
Cc:

Description

Would like to losslessly store arbitrary byte strings in files in fields that are *LIKELY* to be text (and thus we would like it to be visible/editable as text). This is impossible unless invalid UTF-8 is allowed. Obvious examples are URLs, Unix filenames, strings that are not actually UTF-8 stored in fields expected to be UTF-8, etc.

The following patch encodes each byte of an invalid portion of UTF-8 as a new \XNN sequence (capital 'X'), so the output file is legal UTF-8 and can also be written in UTF-16 form. It also removes the output of \xNN (it writes \u00NN instead) so that this escape may be used for this in the future. The reader is modified to accept \XNN and also to accept raw invalid UTF-8 strings from a UTF-8 encoded input file.

This patch also makes it read/write invalid UTF-16, which can easily occur in Windows filenames and other apis that use 16-bit words for strings. This has not been tested much as I am not using it, but was a simple fix to just remove the validity tests.

It also reads/writes invalid UTF-8 in tags, by printing all the bytes with %NN notation. This matches how invalid UTF-8 in URL's are done.

Considerable simplification by moving by single bytes in all cases where it knows the character is one byte or it knows that the pattern it is testing against will fail when pointing at the middle of a UTF-8 string. In most cases you do not need to know the width of the characters to process UTF-8.

Attachments

patch Download (36.3 KB) - added by spitzak@… 5 years ago.
Patch to enable invalid UTF-8 and UTF-16 in scalars and tags
patch.2 Download (37.8 KB) - added by spitzak@… 5 years ago.
New patch that fixes handling of %nn in tags
new.patch Download (37.8 KB) - added by spitzak@… 5 years ago.
Same patch but renamed so it displays correctly

Change History

Changed 5 years ago by spitzak@…

Patch to enable invalid UTF-8 and UTF-16 in scalars and tags

comment:1 Changed 5 years ago by spitzak@…

  • Component changed from pyyaml to libyaml
  • Severity changed from normal to blocker

Changing component to libyaml. Also setting this to "blocker" as I cannot use YAML without this. This does not mean it has to be added as I simply will use my own file format (probably very similar to YAML) without this. I do feel this would be a very good idea to add to the standard and that it must be a blocking problem for many other projects that would like to use YAML.

Changed 5 years ago by spitzak@…

New patch that fixes handling of %nn in tags

Changed 5 years ago by spitzak@…

Same patch but renamed so it displays correctly

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as new
as The resolution will be set. Next status will be 'closed'
to The owner will be changed from xi. Next status will be 'new'
The owner will be changed from xi to anonymous. Next status will be 'assigned'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.