Ticket #138 (new enhancement)
Patch to read/write invalid UTF-8
| Reported by: | spitzak@… | Owned by: | xi |
|---|---|---|---|
| Priority: | normal | Component: | libyaml |
| Severity: | blocker | Keywords: | |
| Cc: |
Description
Would like to losslessly store arbitrary byte strings in files in fields that are *LIKELY* to be text (and thus we would like it to be visible/editable as text). This is impossible unless invalid UTF-8 is allowed. Obvious examples are URLs, Unix filenames, strings that are not actually UTF-8 stored in fields expected to be UTF-8, etc.
The following patch encodes each byte of an invalid portion of UTF-8 as a new \XNN sequence (capital 'X'), so the output file is legal UTF-8 and can also be written in UTF-16 form. It also removes the output of \xNN (it writes \u00NN instead) so that this escape may be used for this in the future. The reader is modified to accept \XNN and also to accept raw invalid UTF-8 strings from a UTF-8 encoded input file.
This patch also makes it read/write invalid UTF-16, which can easily occur in Windows filenames and other apis that use 16-bit words for strings. This has not been tested much as I am not using it, but was a simple fix to just remove the validity tests.
It also reads/writes invalid UTF-8 in tags, by printing all the bytes with %NN notation. This matches how invalid UTF-8 in URL's are done.
Considerable simplification by moving by single bytes in all cases where it knows the character is one byte or it knows that the pattern it is testing against will fail when pointing at the middle of a UTF-8 string. In most cases you do not need to know the width of the characters to process UTF-8.
Attachments
Change History
comment:1 Changed 4 years ago by spitzak@…
- Component changed from pyyaml to libyaml
- Severity changed from normal to blocker
Changing component to libyaml. Also setting this to "blocker" as I cannot use YAML without this. This does not mean it has to be added as I simply will use my own file format (probably very similar to YAML) without this. I do feel this would be a very good idea to add to the standard and that it must be a blocking problem for many other projects that would like to use YAML.


Patch to enable invalid UTF-8 and UTF-16 in scalars and tags