Talk:Tech/News/2014/24

"Non-English" metadata in Ogg files

Latest comment: 10 years ago1 comment1 person in discussion

A comment recently added says: "Re-word ogg line. Transcoded is technically inaccurate - All files that I've seen have correctly been encoded utf-8. Issue is that we had some incorrect coding in the db. The reason the ar example works is because I specificly purged that specific file.)"

This is incorrect. In fact the DB did not contain incorrect encoding, but the Ogg reader was incorrectly parsing Ogg files : it correctly assumed that the files was encoded in UTF-8 when processing it, but it decoded that input from the Ogg file from UTF-8 to the internal encoding of the extension, which then outputed it using ISO-8859-1. This worked only if the Ogg file was written in English. It caused garbled output if there was anything else.

The code of the extension was fixed by not assuming ISO8859-1 output which also caused garbled data on output even if the input was correct. This was absolutely not a problem in the database or in the cache (to be purged) but in the code of the Ogg file parser of the Mediahandler extension and this caused problem as well in the MediaViewer that used the same Ogg file parser.

The problem in the code has been fixed, may be the description page needs to be purged (but it is purged anyway very rapidly from the LRU list of the cache adter less than 1 week). But there effectively remains some Ogg files whose internal metadata is not encoded in UTF-8, and notably many that have been produced with tools using the local encoding of the local OS or of the local user environment where the file encoder was created or edited to add this metadata. So ffectively thre are Ogg files hwose meta data is internally encoded as various Windows codepages (wndows-1252 is frequent) or some ISO-8859 variants (when the encoder was in a device running some embedded version of Linux, or when MEtadata was added by some online services).

The initial sentence suggested to drop these non-English characters, and now the restored sentence is even worse as it states that this metadata should be using English only and this is wrong. Checlng that the metadata uses the correct encoding is enough. But got now the Ogg file parser used in the MediaWiki extnesion does not ever check that this metadata is valid UTF-8 before using it (and it also does not even check if that data contains some malicious characters ebcause it does not make sure they are treated literally it could cause HTML tags or javascript to be generated on output by reading it from metadata.

So all remaining issues are in the Ogg files themselves. Even if there would be other corrections/checks in the Ogg file parser used by the MediaWiki Extensions.

There are exactly the same issues with the metadata tags of MP3 files, or other media files (like SVG images) that we currently accept. If we accept these files wit htheir metadatan we must make sure that they will be processed as litterals and never be interpreted as HTML. This means that the extensions *must* escape the metadata extracted. And there should be some file validator that will detect malicious files. The escaping must first make sure that the internal encoding is valid UTF-8 (otherwise it should replace the invalid sequences. Then it should HTML-encode some ASCII characters like amperand, and lower-than sign. and replace some C0 and C1 controls by whitespace or drop most of them. verdy_p (talk) 00:27, 10 June 2014 (UTC)Reply