Internationalization

Internationalization — using extended character sets with a libiptcdata application

Supporting Internationalization with libiptcdata

The IPTC IIM standard supports storing data with nearly any character set. According to the standard, the data of Record 1 should be in plain ASCII, but data of the following records should follow the character set established by dataset 1:90, the "character set" dataset. This dataset contains control functions according to the ISO 2022 standard, which allow for switching between different character sets. However, there are several problems with this approach:

  • Nearly all IPTC-aware applications written previously do not follow this standard. They usually force all characters to ASCII or use the Latin-1 character set, without identifying it in dataset 1:90.

  • The ISO 2022 standard is very complicated and lacks a free reference implementation. In addition, the standard is rarely used since Unicode provides a superior alternative.

libiptcdata does not implement the complete ISO 2022 standard (in fact, it implements almost none of it), but can still be used successfully with multiple character sets. Here's how:

  1. When IPTC data is added to an image file for the first time, always store data in the UTF-8 character set. It is the responsibility of the application to make sure that the iptc_dataset_set_data() function is always called with UTF-8 encoded data. This generally happens automatically with modern toolkits such as gtk+. The application should call iptc_data_set_encoding_utf8() which sets the value of dataset 1:90, indicating that UTF-8 is being used.

  2. When reading or modifying IPTC data saved by another application, first use the iptc_data_get_encoding() function to find out what encoding the data has been stored in. If it's not UTF-8, it may be hard to identify the character set, since the ISO 2022 standard is generally not followed. Often, a good guess is ISO-8859-1. However, if new data is added, it is probably wise to start using the UTF-8 encoding.