|Subject:||Inconsistent error message returned on invalid UTF-8 character|
This arose from an issue I had with the W3C validator, and I was referred to here as the error messages are generated by Encode. Please see two test cases here:For the first example, the validator sees the bytes \xED\xA0\x80 and complains "utf8 "\xD800" does not map to Unicode". For the second, the validator sees the bytes \xC1\xAA and complains "utf8 "\xC1" does not map to Unicode". The error messages are inconsistent. In the first, the error message complains about the hypothetical code point \xD800 which the bytes would otherwise map to, and in the second the error message complains about the actual byte in the data that wasn't valid. After discussion on the email@example.com list we decided that the first error message is at fault. Follow conversation here: The error message "utf8 "\xD800" does not map to Unicode" is output when the sequence of bytes \xED\xA0\x80 is encountered, making finding the source of error difficult as \xD800 doesn't appear in the document, except in the sense that those bytes would represent \xD800 if it were otherwise allowed in UTF-8. The error message should return the actual bytes encountered which aren't valid, rather than something like \xD800. Tested on: W3C Markup Validator 0.8.2, I am sorry that I don't know which Perl and Encode version it's running.