Ticket #138 (new enhancement)
Patch to read/write invalid UTF-8
|Reported by:||spitzak@…||Owned by:||xi|
Would like to losslessly store arbitrary byte strings in files in fields that are *LIKELY* to be text (and thus we would like it to be visible/editable as text). This is impossible unless invalid UTF-8 is allowed. Obvious examples are URLs, Unix filenames, strings that are not actually UTF-8 stored in fields expected to be UTF-8, etc.
The following patch encodes each byte of an invalid portion of UTF-8 as a new \XNN sequence (capital 'X'), so the output file is legal UTF-8 and can also be written in UTF-16 form. It also removes the output of \xNN (it writes \u00NN instead) so that this escape may be used for this in the future. The reader is modified to accept \XNN and also to accept raw invalid UTF-8 strings from a UTF-8 encoded input file.
This patch also makes it read/write invalid UTF-16, which can easily occur in Windows filenames and other apis that use 16-bit words for strings. This has not been tested much as I am not using it, but was a simple fix to just remove the validity tests.
It also reads/writes invalid UTF-8 in tags, by printing all the bytes with %NN notation. This matches how invalid UTF-8 in URL's are done.
Considerable simplification by moving by single bytes in all cases where it knows the character is one byte or it knows that the pattern it is testing against will fail when pointing at the middle of a UTF-8 string. In most cases you do not need to know the width of the characters to process UTF-8.