InterBase Collation Kits
by David Brookstone Schnepper
Introduction
This paper describes how to create a custom collation or character set driver for InterBase. An example using InterBase 6.0 is included.
This document uses international characters for illustration. The encoding for this document is Character set Windows 1252 (a.k.a Latin 1, or Western European).
More information also available at Brookstone Systems
Contents
- makefile.bc: Makefile for Borland C
- makefile.so: Makefile for Solaris
- makefile.hp: Makefile for HP (not completed)
- ld2.c: Example linkage function between engine and international module
- cs_example.c: Character Set example
- cv_example.c: Text conversion example
- lc_example.c: Collation example
- cs_latin1.h: Mapping tables for Character Set example
- country_codes.h: Defines numeric codes for countries
- charsets.h: InterBase numeric codes for character sets
- tx865_lat1.h: Translation tables between DOS865 and ISO Latin-1 character sets
- intlobj.h: Interface definition between InterBase Engine and International modules.
- cs_example.sql: SQL script defining & testing example character set & collation.
- cs_util.sql: Utility procedures for creating character sets and collations.
To download the above as a (30.5kb) (.zip) file
History
Prior to InterBase v3.3, InterBase defined CHARACTER(n) data as simple 8-bit bytes. It was up to the application programmer to know what character set was used to store the data, and how to convert between different character sets. Only binary collation order was supported for CHARACTER(n) data.
In June 1992, InterBase v3.3 was released, which included the first support for multiple character sets in a database and alternate collations for text. Only single-byte character sets and European-style collations were supported at that time. Version 3.3 also supported the ability to extend the International functionality by adding additional character set or collation modules.
Separately, a version 3.2J (Japanese) was released in 1992. The 3.2J version supported Japanese data stored in EUC and SJIS formats. It only supported codepoint order collation. There was no ability to extend the functionality of the international parts of the product.
In 1993, InterBase 4.0 was released. InterBase 4.0 built on the foundation established by v3.3 and v3.2J and combined both single-byte and multi-byte character set support into a single product.
Unlike the 3.3 architecture, InterBase 4.0 used the then-standard SQL-92 Internationalization architecture.
This document describes how to create drivers in the InterBase Internationalization architecture. Examples are provided for InterBase 6.0. These examples should also work for any InterBase 4.x or 5.x version – as the architecture has not changed since originally implemented for InterBase 4.0.
Many of the International collations supported in InterBase were originally designed for Borland’s Paradox and/or dBASE product lines. The collations were re-implemented in InterBase and validated against the original product. An important design decision was that InterBase collations must match the corresponding dBASE/Paradox collation PRECISELY, to make them compatible. InterBase duplicated known bugs in dBASE/Paradox collations to preserve compatibility. InterBase defines several collations that are “the same, but different”. For instance, there are German collations for dBASE, for Paradox, and an InterBase specific collation “DE_de”. The general rule is dBASE & Paradox collations perform identically to dBASE & Paradox while InterBase specific collations follow the national standard for the language, if one exists.
An InterBase v3.3 Collation Kit document was written in 1993, which documents creating an InterBase 3.3 collation driver. This document covers InterBase 4.x through 6.x – using the architecture developed for V4.x.
SQL Internationalization Definitions
Internationalization is full of ambiguous terminology: the statements are the same but all the definitions are different. Here are the definitions from the SQL approach to Internationalization. These definitions are painfully precise in the SQL 92 definition – this documents takes some liberties with that precision in the interest of clarity. (See the SQL 92 standard for the original definitions). The SQL definitions are different from the “intuitive” definitions many computer programmers use.
Character Repertoire
A character repertoire is a set of characters used for a particular culture. For instance, English uses A, B, C … Z. French uses A, Á, À, B, C, Ç, D … Z.
Character Set
A character set is a collection of characters that includes at least one character repertoire. For instance, ISO Latin 1 is a character set that includes the English repertoire and the French repertoire.
Form of Use
A form of use is the mapping of a character set to a particular computer based representation. It includes the numerical value assigned to a particular character, and the way to encode the value into bits for computer use. Examples of encoding include single byte, multi-byte sequence, 16-bit value, etc.
Collation
A collation is a method of ordering sequences of characters in a culturally acceptable fashion.
Character
A character is a particular member of a character repertoire. Characters can change “form of use” (e.g.: encoding) but never meaning.
InterBase Internationalization Definitions
This section defines the terminology used in InterBase’s internationalization implementation.
Narrow Character – NCHAR
A narrow character set represents text as a sequence of bytes, each byte representing a unique character.
typedef unsigned char NCHAR;
ISO Latin1, DOS437, and similar character sets are represented as NCHAR.
Multi-byte Character – MBCHAR
A multi-byte character set represents text as a sequence of bytes, with some byte-values indicating that one or more of the following bytes determine the next character.
typedef unsigned char MBCHAR;
SJIS, EUC-J, BIG-5, and similar character sets are represented as MBCHAR.
Wide Character – WCHAR
A method of representing a character set as a sequence of integers, in which the integer is larger than an 8-bit byte. Typically, a platform specific WCHAR implementation matches the integer size for the platform.
InterBase defines WCHAR as an unsigned short – a 16-bit value.
typedef unsigned short WCHAR;
InterBase does not fully support character sets in WCHAR format. It does not swap bytes between platforms of different integer architectures (e.g.: little-end-ian and big-end-ian platforms). This limitation can create problems when the client and server have different integer architectures.
InterBase uses WCHAR as an internal processing format – InterBase will typically convert a MBCHAR string to WCHAR for text matching functions.
UNICODE, in WCHAR format, is used internally for charset conversion functions.
Character Set
An InterBase character set consists of a SQL character set and a SQL Form of Use. For instance, the InterBase character Set ISO-8859-1 is precisely called: “The set of characters defined by ISO Standard 8859-1, encoded by the values defined in ISO standard 8859-1, having each value represented by a single 8-bit byte”.
InterBase character sets are generally defined by national standards or industry standards. Examples: the International Standards Organization defines ISO-8859-1, and Microsoft defines Windows1252.
InterBase is precise in following defined standards. Windows 1252 is a superset of ISO-8859-1 (commonly known as Latin-1). In InterBase, WIN1252 and LATIN_1 are two separate character sets because Windows 1252 defines characters in positions that Latin-1 specifically states are “not a character”.
Text Type
An InterBase text type could also be termed a “locale”. It consists of all culturally significant information about processing text. This includes: character set, representation (form of use), collation method, comparison method, and uppercasing method.
Typical “locale” information also includes such methods as conversion of numeric values to string values, time representation, and date representation. These methods and formats are NOT included in the InterBase Text Type.
Transliteration
A transliteration occurs when characters in one InterBase character set are converted to another character set. Note that this process occurs when changing from one SQL “form of use” to another. InterBase terms the conversion from Unicode in the multi-byte FSS format to Unicode in WCHAR format a transliteration. The conversion from DOS437 to ISO-8859-1 is also a transliteration.
InterBase transliterations are “precise.” InterBase will report a transliteration error if the character in the input set does not have an exact representation in the output set. None of the existing set of InterBase transliterations will transliterate a “missing character” to another character.
This restriction provides the ability to perform “round-trip conversions”. A conversion is “round-trip” if it converts from character set A to character set B, then back to A, producing the original sequence of characters in the original representation.
InterBase transliterations preserve “character fidelity”.
Custom transliterations can perform alternate conversions for “missing characters.” For instance, a custom transliteration could convert the character <ae> to the string “ae” when the character <ae> does not exist in the destination character set.
Complexity of Collation
InterBase provides collations of varying complexity. The most sophisticated collations provide full dictionary sorting appropriate to the language being compared. Many InterBase collations provide a 4-weighted collation, where characters are compared by:
- Primary shape (A vs. B vs. C)
- Accents (A vs. Á vs. Ä)
- Case (A vs. a)
- Punctuation weight (punctuation is ignored in the first three comparison).
In a 4-weight collation Primary differences take precedence over Accent differences, which take precedence over Case differences, which take precedence over punctuation differences.
Most languages use left-to-right comparisons for all 4 weightings. French as used in France requires right-to-left for Accent distinctions and left-to-right for all other distinctions.
The ordering of Primary Shapes, Accents, Case, and punctuation is culturally specific:
Collation | Primary Shapes | Accents | Case | Punctuation |
---|---|---|---|---|
English | Digits 0 1 2 … 9 … before letters A B C … Z | Uppercase before lowercase | comma dash period | |
Danish | Letters A B C … Z Å … before digits 0 1 2 … 9 | grave acute circumflex | lowercase before uppercase | period dash comma |
French | 0 1 2 … 9 … A B C Ç … Z | circumflex acute grave |
Note
The examples used in this section are for illustration purposes only, and not meant to be definitive of any collation sequence.
The “Primary Shape” of a character is culturally specific:
Collation | Character | Treatment |
---|---|---|
English | å | Same primary shape as a |
Danish | å | Unique primary shape, sort after Z |
English | ç | Same primary shape as c |
French | ç | Unique primary shape, sort between c and d |
Danish | Æ | Unique primary shape, sort after Z |
German | Æ | Sort as if written “AE” |
Each InterBase collation is specific to a character set. All characters in the character set must be sorted. If a character is not used in a particular culture it can be collated to any location, but it must still be collated, as it is valid input. Icelandic Thorn (Þ) and eth (Ð) are not used in English – and the EN_UK driver sorts them near T and D.
Collation | Character | Expansion | Special Rule |
---|---|---|---|
German | ß | ss | Sort after “ss” as an “accent difference” |
Dutch | ÿ | ij | Sort before “ij” as an “accent difference” |
Collation | Sequence | Special Rule |
---|---|---|
Spanish | CH | Primary shape between C and D |
Spanish | LL | Primary shape between L and M |
Danish | AA | Primary shape as Å, but treat as an accented difference after Å |
The complexities of culturally correct sorting are beyond the scope of this document. Designing a collation order requires skill and insight; find a good text book on the subject and be sure you understand the issues before you begin.
InterBase Internationalization Classes
In JAVA object-oriented terminology; every character set supported by InterBase is represented by a class that implements the InterBase character set Interface.
Every InterBase text type is represented by a class that implements the texttype interface.
Every transliteration is represented by a class that implements the conversion interface.
InterBase does not implement international support in an object-oriented language. It uses an object-oriented design implemented in standard C.
The methods in each internationalization interface are not very well divided. In particular, many methods defined in texttype belong in charset (mb_to_wc in particular).
Charset
A class implementing the InterBase charset interface must provide an internal name for the character set and:
- The value for the SPACE character
- The length of the SPACE character, in bytes
- Max & min bytes per character
- A method to convert from the character set to UNICODE in WCHAR
- A method to convert from UNICODE in WCHAR to the character set
Charsets are defined by the structure charset in intlobj.h. The structures defined in intlobj.h are explained starting in Texttype objects.
Texttype
A class implementing the InterBase texttype interface must provide an internal name for the texttype and functions that:
- Calculate the key length for a given string length
- Calculate the collation key for a given string
- Compare two strings
- Convert a string to uppercase
- Convert a string to lowercase
- Convert a string to WCHAR format
- Fetch the next character in a string
Texttypes are defined by the structure texttype in intlobj.h.
Conversion
A class implementing the InterBase conversion interface must provide an internal name for the conversion and a function to convert FROM a charset TO another charset.
Conversions are defined by the structure csconvert in intlobj.h
Data Types for International Object Identifiers
Character sets and text types are identified by numeric values. Once a numeric value is established for a character set or text type, the value will never be re-designated by InterBase. Preserving those numbers protects backup/restore and compatibility between servers.
Some character set and collation identifiers are reserved for user implementations. If you are creating a text type for your own application, you should use values in the range of 250 to 255 for character set identifiers and 250 to 249 for collation identifiers. If you are creating a text type that you expect to share or sell, you should use a value below 250 for each identifier, and register your identifiers with InterBase support.
Character Set Identifiers
Character set identifiers range from 0 to 255. The range from 250 to 255 is reserved for user implementations. All numbers below 250 must be registered with InterBase support to prevent conflicts with other vendors of InterBase international objects.
typedef unsigned char CHARSETID;
CHARSET ID 127 is reserved for internal InterBase use.
Texttype identifiers
A texttype is identified by a 16-bit value. The most-significant byte is the collation identifier (COLLATIONID). The least significant byte is the character set identifier (CHARSETID).
Collation IDs range from 0 to 254. Collation identifiers from 250 to 254 are reserved for user-defined collations. Value 255 is not supported. If you use a collation identifier below 250, you must register it with InterBase support to avoid conflicts with other vendors of InterBase international objects. By convention Collation ID 0 is the default collation for a given character set. The default collation for a character set is binary collation, where characters are ordered by the binary value specified by their form of use.
typedef unsigned char COLLATIONID;
typedef unsigned short TEXTTYPEID;
#define GET_CHARSETID(t) ((t) & 0xFF)
#define GET_COLLATIONID(t) (((t) >> 8) & 0xFF)
Collation identifiers are unique within a character set. Texttype ID’s are unique across the database.
Linking Numeric Identifiers and Symbolic Names
InterBase searches internal system metadata tables for the symbolic names to attach to charset and texttype objects. Symbolic Names for Character Set ID’s
The following query gives the correspondence between charset ID’s and charset names.
SELECT RDB$CHARACTER_SET_ID, RDB$CHARACTER_SET_NAME
FROM RDB$CHARACTER_SETS;
This name is the “official” name for a character set. However, naming conventions for character sets vary. InterBase character sets have aliases to conform to the naming conventions on different platforms.
SELECT RDB$TYPE, RDB$TYPE_NAME
FROM RDB$TYPES
WHERE RDB$FIELD_NAME = ‘RDB$CHARACTER_SET_NAME’;
RDB$TYPES in the internal table used to define symbolic constants for various internal values.
Symbolic Names for Collation ID’s
Collation names are also defined in a system table. Retrieving them requires a more complicated query
SELECT RDB$COLLATION_NAME, RDB$COLLATION_ID,
RDB$CHARACTER_SET_ID,
RDB$COLLATION_ID * 256 + RDB$CHARACTER_SET_ID
AS TEXTTYPEID
FROM RDB$COLLATIONS;
RDB$COLLATIONS stores the collation name, the collation identifier, and the character set identifier for each supported collation. The TEXTTYPEID is calculated from the COLLATION_ID and the CHARACTER_SET_ID.
COLLATION_NAME must be unique in a database. Each texttype has a unique collation name. Aliases are not supported for collation names.
Naming Conventions
Collation names often follow the convention XX_yy.
- XX is the two-letter Language code, per ISO Standard 639 (ftp://ftp.std.com/obi/Standards/ISO/ISO_639).
- yy is the two-letter Country code, per ISO Standard 3166 (http://www.niso.org/3166.html).
Thus, DE_DE is the collation name for German as used in Germany. FR_FR is French as used in France, and FR_CA is French as used in Canada.
InterBase collations that emulate Paradox or dBASE collations have names like “PDOXxxxx”, “PXWxxx” or “DBxxxx”. The names should match the names used in Paradox and dBASE.
A special collation exists for each character set. This collation has the same name as the character set and implements binary collation for the character set. Binary collation sorts the character set by the numeric codes used to represent the characters.
Character set names follow the standard that defines them (for example: WIN1252, DOS437). Character set alias names support the differences in these names between platforms (for example: WIN1252, WIN_1252).
Searching for an Object
InterBase searches for an international implementation object when it loads its metadata. It searches these three locations in this order:
- The engine itself
- The international library – gdsintl
- The supplemental international library – gdsintl2
Objects for character sets ASCII, NONE, OCTETS, and UNICODE_FSS are built into the engine. Without these character set definitions the engine cannot load a database. The default collations for these character sets are also implemented within the engine.
All other InterBase international object implementations are in the gdsintl library.
The gdsintl2 library is available for user-defined international objects.
GDSINTL Libraries
Gdsintl and gdsintl2 are platform-dependant shared-object libraries. The suffix for the filename is platform dependant.
Platform | Suffix |
---|---|
Windows | DLL |
Solaris | SO |
HP | SL |
The implementation of the library is platform-dependant. Libraries must be recompiled when moved to a new platform.
Some platforms do not support shared-object libraries. Novell NLM is an example. For those platforms, the entire gdsintl library is linked with the engine. User-defined international objects cannot be added to InterBase on platforms that do not support shared-object libraries.
The InterBase installation procedure places gdsintl in the intl, bin or lib directory in the InterBase root directory. gdsintl2 must be stored in the same directory as gdsintl. The location of the gdsintl library is dependant on the platform and InterBase version.
Platform | Version | Location |
---|---|---|
Windows | 4.x & 5.x | bin |
All others | 4.x & 5.x | lib |
All | 6.x | intl |
Single Entry Point
Each international library has a single entry point, which is LD_lookup for gdsintl, and LD_lookup2 for gdsintl. Aside from the difference in entry point names, the libraries are functionally identical.
Historical note: Early versions of IBM’s AIX platform supported only one entry point in shared-object libraries. That limitation is the reason for limiting the lookup function to the single entry point for InterBase international libraries.
Historical note: The Apollo platform required that identifiers be unique across all shared-object libraries loaded by a program. That limitation is the reason for requiring the different entry point names in the two libraries.
The LD2_lookup Function
The lookup function for the gdsintl2 library is passed the following:
- Type of object to lookup
- Numerical identifier to lookup
It returns
- 1 if object was not found
- 0 if object was found and a pointer to a function used to initialize the object
USHORT LD2_lookup (
USHORT objtype,
FPTR_SHORT *fun,
SSHORT parm1,
SSHORT parm2)
Objtype is one of type_texttype, type_charset, or type_csconvert. Any other input is an error.
Objtype | Parm1 | Parm2 |
---|---|---|
Type_texttype | TEXTTYPEID | 0 |
Type_charset | CHARSETID | 0 |
Type_csconvert | CHARSETID (from) | CHARSETID (to) |
Any other value | ERROR | ERROR |
An example LD2_lookup function is provided in ld2.c
Initialization Functions
When an international implementation is found, the call to LD2_lookup returns a pointer to the class initialization function. The engine calls this initialization function with a pointer to a 0-initialized data structure, which the function will fill. This data structure defines the object to the engine.
The initialization is part of database startup, not server startup! If your server supports more than one database, the initialization function will be called for every database that shares a set of international objects. Even in a single-database server, the initialization function will be called each time the engine shuts down and reloads a database that was temporarily inactive.
The initialized object is not released until the database is unloaded. The engine releases the object as a normal part of database shutdown.
Here is a sample initialization function for a texttype:
unsigned short LC_win1252_example(
TEXTTYPE cache,
SSHORT parm1,
SSHORT dummy)
{
cache->texttype_version = IB_LANGDRV_VERSION;
/* … Other initialization code … */
}
Each charset, each texttype, and each conversion supported in a gdsintl library must have an initialization function.
Texttype Objects
Texttype objects implement an InterBase texttype. They provide collation and other text manipulation functions. This is the definition of texttype in intlobj.h:
typedef struct texttype {
struct blk texttype_blk;
USHORT texttype_version; /* version ID of object */
USHORT texttype_flags; /* miscellaneous flags */
TTYPE_ID texttype_type; /* Interpretation ID */
ASCII *texttype_name;
CHARSET_ID texttype_character_set; /* ID of base character set */
SSHORT texttype_country; /* ID of base country values */
BYTE texttype_bytes_per_char; /* max bytes per character */
ULONG texttype_license_mask; /* required bits for license */
/* MUST BE ALIGNED */
FPTR_SHORT texttype_fn_init;
FPTR_SHORT texttype_fn_key_length;
FPTR_SHORT texttype_fn_string_to_key;
FPTR_short texttype_fn_compare;
FPTR_SHORT texttype_fn_to_upper; /* convert one ch to uppercase */
FPTR_SHORT texttype_fn_to_lower; /* One ch to lowercase */
FPTR_short texttype_fn_str_to_upper; /* Convert string to uppercase */
FPTR_SHORT texttype_fn_to_wc; /* convert string to wc */
FPTR_SHORT texttype_fn_contains; /* s1 contains s2? */
FPTR_SHORT texttype_fn_like; /* s1 like s2? */
FPTR_SHORT texttype_fn_matches; /* s1 matches s2 */
FPTR_SHORT texttype_fn_sleuth_check; /* s1 sleuth s2 */
FPTR_SHORT texttype_fn_sleuth_merge; /* aux function for sleuth */
FPTR_short texttype_fn_mbtowc; /* get next character */
BYTE *texttype_collation_table;
BYTE *texttype_toupper_table;
BYTE *texttype_tolower_table;
BYTE *texttype_expand_table;
BYTE *texttype_compress_table;
BYTE *texttype_misc; /* Used by some drivers */
ULONG *texttype_unused [4]; /* spare space */
} *TEXTTYPE;
Many of the fields in this structure are for use only by the engine, some are for use by the object itself, and some form the interface between the InterBase engine and the texttype implementation.
Field Name | Description |
---|---|
texttype_blk | Used for the engine’s memory management – do NOT touch! |
texttype_version | Set to IB_LANGDRV_VERSION |
texttype_flags | For engine use |
texttype_name | Pointer to ASCII internal name for texttype |
texttype_character_set | CHARSET_ID of the character set this object is implemented for. |
texttype_country | Country ID of locale (not required) |
texttype_bytes_per_char | Max bytes per char (duplicate information from charset object) |
texttype_license_mask | No longer used. |
texttype_fn_init | Pointer to function used to initialize the data structure |
texttype_fn_key_length | Calculates key length for a input string length. |
texttype_fn_string_to_key | Converts a string to a collation key |
texttype_fn_compare | Compares two strings of unlimited length. |
texttype_fn_to_lower | Converts to lowercase |
texttype_fn_to_upper, texttype_fn_str_to_upper | Converts to uppercase |
texttype_fn_to_wc, texttype_fn_mbtowc | Converts string to WCHAR (belongs with charset object!) |
texttype_fn_contains, texttype_fn_like, texttype_fn_matches, texttype_fn_sleuth_check, texttype_fn_sleuth_merge | The engine will initialize these pointers with functions used to implement LIKE, CONTAINS, and the GDML matching language. Do not set them in the initialization function. |
texttype_collation_table, texttype_toupper_table, texttype_tolower_table, texttype_expand_table, texttype_compress_table, texttype_misc, texttype_unused | These fields provide pointers to data tables that the implementation uses to provide its functionality. The engine never refers to them directly; they are for the object’s use. |
Functions to Perform Operations on Text
The first parameter of all functions implemented within the texttype is a pointer to the data structure initialized by the initialization function. This pointer provides access to the data tables: texttype_collation_table, texttype_toupper_table, etc. Functions can be shared among different texttypes implemented in the same gdsintl2 shared library. The result of the function is determined by the data tables, which are specific to a texttype.
The comments in lc_example.c provide more detail on each function.
Collation Keys
A collation key maps a string with a non-default collation to a set of bytes that can be compared with unsigned-byte comparison preserving the culturally correct collation order. InterBase stores collation keys in indexes. Index keys are always compared with unsigned byte comparisons, starting with the high order byte and continuing until a difference or the end of the key is found. When creating an index key, InterBase transforms values that do not produce correct ordering using unsigned byte comparisons to values that do.
The mapping from a string to a collation key can be irreversible. The original string is never recreated from the collation key.
When writing a textype_fn_string_to_key function, the resulting key must consist of unsigned bytes.
In a binary collation, the collation key is identical to the string.
Collation | Input | Collation Key |
---|---|---|
Binary | ÁbC | ÁbC |
Case Insensitive binary | ÁbC | ÁBC |
Accent Insensitive binary | ÁbC | AbC |
Case and Accent Insensitive binary | ÁbC | ABC |
English Book Title | The Practice | Practice, The |
Telephone Book (USA) | McDonald | MacDonald |
Telephone Book (USA) | 1 Way | ONE WAY |
Lexical Ordering (German) | ÁbCê | ABCE’ ^AbCe |
Lexical Ordering (French) | ÁbCê | ABCE^ ‘AbCe |
Lexical Ordering (German) | Cheß | CHESS 1chess |
Lexical Ordering (Spanish) | Cheß | XHEß Xheß where X represents a special value between C and D. |
Collation Keys versus Comparison
A texttype must provide functions for string comparison and creating collation keys. String comparison is used for equality and inequality Boolean evaluations. Collation keys are used in indexes. Comparing strings and comparing collation keys must produce identical results.
The two functions have a significant difference. Index keys have a fixed maximum length, currently 254 bytes. String comparison must be able to handle strings of unlimited length.
Ignore Trailing Spaces
SQL requires that trailing spaces be ignored when comparing two strings of different lengths. It is important that both the key creation function and the compare function ignore trailing spaces on strings.
Note
The space character is determined by the character set.
Monocase operations
A texttype must contain functions to implement the SQL UPPER(x) and LOWER(x) functions. The result of UPPER & LOWER is culturally specific, and the rules are determined by the collation sequence defined with the column.
The to_upper and to_lower functions can be implemented using data tables or directly in code.
InterBase does not currently support the SQL LOWER(x) function, however it is necessary to implement it in each defined texttype.
Form of Use Changes
A texttype contains functions for changing the “form of use” of a character string from NCHAR or MBCHAR to WCHAR. These functions, strictly speaking, belong with the charset object. A design decision in the Internationalization architecture placed them with texttype.
Text Matching Functions
Entries exist in texttype for custom implementation of SQL LIKE, CONTAINS, and GDML MATCHING functions. The InterBase engine will fill these entries with internal function pointers after the texttype initialization function is called. Your driver should not initialize these function pointers.
Debugging a Collation Driver
Debugging a collation driver should be done before making it into a GdsIntl2 driver. While the collation driver is a standalone entity, it can print the collation keys it creates to check for correctness.
After installing the driver, sort and index several tables, compare indexed and non-indexed strings and perform UPPER() operations to ensure the driver is working properly.
The example test file is cs_example.sql.
Example
The example texttype driver lc_example.c implements a simple version of English Title Sort Order. When titles are sorted, articles such as “The”, “An”, or “A” at the beginning of the title are ignored. Thus “The Wind and the Lion” sorts as if it were written “Wind and the Lion”.
In this driver, the Collation Key created is simply the input string, minus any articles at the beginning of the string. This collation does not change the default ordering of accented characters or upper and lower case characters.
Conversion Objects
InterBase uses conversion objects to transform one character set to another. A conversion object is defined by the data structure csconvert in intlobj.h
typedef struct csconvert {
struct blk csconvert_blk;
USHORT csconvert_version;
USHORT csconvert_flags;
SSHORT csconvert_id;
ASCII *csconvert_name;
CHARSET_ID csconvert_from;
CHARSET_ID csconvert_to;
FPTR_SHORT csconvert_convert;
BYTE *csconvert_datatable;
BYTE *csconvert_misc;
ULONG *csconvert_unused [2];
} *CSCONVERT;
Field Name | Definition |
---|---|
csconvert_blk | Used by the engine’s memory management. Do not modify. |
csconvert_flags | Engine use – do not modify |
csconvert_id | No longer used |
csconvert_name | Internal name – can be ignored. |
csconvert_from | CharsetID being converted FROM |
csconvert_to | CharsetID being converted TO |
csconvert_convert | Function to perform the conversion |
csconvert_datatable, csconvert_misc, csconvert_unused | Available for object implementation to use for pointer storage, etc. Not directly accessed by the engine. |
Conversion API
The conversion object has a single method, the function that converts from one character set to another.
static USHORT conversion_function (
CSCONVERT obj,
BYTE *dest_ptr,
USHORT dest_len,
BYTE *src_ptr,
USHORT src_len,
SSHORT *err_code,
USHORT *err_position);
obj | The conversion object |
---|---|
dest_ptr | Location for the converted string, or NULL when an estimate of output length is requested. |
dest_len | Length of the dest_ptr buffer, in bytes |
str_ptr | The input string |
src_len | Length of input string, in bytes. |
err_code | See below |
err_position | See below |
Returns: The length of the converted string, in bytes.
If dest_ptr is NULL, then the routine returns a maximum estimate of the space required for the output string if an input string of src_len bytes is converted.
Error Returns from Conversion Function
The conversion function can return errors to the engine. Generally the engine will report these to the user as a “transliteration error”. The error code is returned via *err_code. The position in the input string where the error was detected is returned via *err_position.
Result in err_code | Meaning |
---|---|
0 | No Error |
CS_CONVERT_ERROR | Input character does not exist in output character set. |
CS_BAD_INPUT | Input string was not properly formed per rules of the input character set. |
CS_TRUNCATION_ERROR | Not enough space available in the output buffer. |
Addressed by FROM charsetid & TO charsetid
Unlike character sets and collations, conversions have no symbolic names. A conversion is identified by the charsetID is it converting from and the charsetID it is converting to. An InterBase aficionado will recognize the similarity to blob filters.
If no direct conversion object has been created to convert from charset A to charset B, the engine converts charset A to UNICODE and then UNICODE to charset B.
Example
In the sample code provided, cv_example.c is an example of a conversion object. Conversion functions are generally table driven. . The example contains several different functions. The functions are code, and produce a specific conversion when the the data table for that conversion is given.
For conversion tables, check files at www.unicode.org
Charset object
InterBase uses a charset object to represent information for a character set. Much of the information inside a charset object is for engine use only.
charsets are defined by the structure charset in intlobj.h
typedef struct charset {
struct blk charset_blk;
USHORT charset_version;
USHORT charset_flags;
CHARSET_ID charset_id;
ASCII *charset_name;
BYTE charset_min_bytes_per_char;
BYTE charset_max_bytes_per_char;
BYTE charset_space_length;
BYTE *charset_space_character;
/* Must be aligned */
FPTR_SHORT charset_well_formed;
struct csconvert charset_to_unicode;
struct csconvert charset_from_unicode;
VEC charset_converters;
VEC charset_collations;
ULONG *charset_unused [2];
} *CHARSET;
Field | Definition |
---|---|
charset_blk | Used by engine’s memory manager. Do not modify. |
charset_version | Version ID of InterBase international interface. |
charset_id | The CHARSETID for this character set |
charset_name | Internal name for charset |
charset_min_bytes_per_char | Minimal number of bytes per character. |
charset_max_bytes_per_char | Maximum bytes per character. InterBase will allocate this number of bytes for each character in a SQL CHAR(n) definition. |
charset_space_length | Length, in bytes, of the space character |
charset_space_character | Pointer to the definition of the space character |
charset_well_formed | Not used. This was intended to be a pointer to a function that would validate that a string was well formed by the rules of a character set. |
charset_to_Unicode | A conversion object that transforms the character set TO Unicode (WCHAR). |
charset_from_Unicode | A conversion object that transforms UNICODE into the character set. |
charset_converters | For engine use only. (The Engine keeps a list of converters to/from this character set here). |
charset_collations | For engine use only. (The Engine keeps a list of collations defined for this character set here). |
charset_unused | May be used by object implementation. |
Two Conversion Objects
Every charset must contain two conversion definitions: a conversion from Unicode to the charset and a conversion to Unicode from the charset. All InterBase character sets must be able to convert themselves to and from Unicode.
When the engine cannot find a direct conversion between two character sets, it uses the converters here. Unicode is the intermediate conversion step.
One Texttype Object
The SQL standard requires that all character sets have a default collation. By convention in InterBase, the default collation for a character set has the same name as the character set. It performs a collation in binary, or codepoint order.
The character set default texttype object is loaded when the character set is loaded. It must be defined separately in the gdsintl2 library.
The SPACE character
The SQL standard requires that all character sets define a “space character”. This character is used to pad strings assigned to CHAR(n) columns to the length of the column. Trailing space characters are ignored in all string comparison operations.
The overwhelming majority of modern character sets use 0x20 or 0x0020 for the space character. InterBase character sets follow that rule, except for character set OCTETS. OCTETS instantiates binary byte data, rather than character data, and uses 0x00 for the space character.
Example
The example character set in cs_example.c is identical to the InterBase provided ISO_LATIN_1 character set.
The critical structures in a character set definition are the tables that map the character set to and from Unicode. InterBase’s mapping tables were derived from the tables provided by Unicode, Inc. at their website - www.unicode.org
Building gdsintl2
The build process for gdsintl2 is platform specific. In general terms, gdsintl2 must be a shared library with minimal requirements for other libraries and a single entrypoints. Building a gdsintl2 library is very similar to building an InterBase UDF library.
The platform specific makefiles provided in the collation kit files are examples of building gdsintl2.
Special Notes
Backup/Restore
InterBase will backup and restore user defined character sets and collations, and data in columns defined on them without problems.
If you restore a backup file with that includes user defined international objects on a server other than the one on which it was backed up, you must ensure that the character sets and collations referenced in the backup file exist on the destination server. The CHARSETIDs and COLLATIONIDs must be identical on the two servers.
Testing a GDSINTL2
When you test a new GDSINTL2, you must be aware that the engine does not release a shared library until the engine itself is unloaded. Each time you update your gdsintl2, you must shut down the server, install the new gdsintl2, and reload the server.
Platform Provided Internationalization functions
Many platforms provide internationalization libraries that perform the functions needed in gdsintl2. For instance, JAVA provides the Collator class; Windows provides CompareString and similar functions.
InterBase recommends against using a platform provided Internationalization library as it may:
- be platform specific. Moving the database to another platform will require re-implementation of the gdsintl2 library and may result in a different behavior if the two platform implementations are not precisely identical.
- be OS version specific. Updating your OS may install a new version of the library – which may perform differently. As the key values are stored in InterBase indices your database will not operate correctly until all indices are rebuilt.
- limit input string size. InterBase requires that comparison of Collation Keys and String Comparison return identical results – String Comparison would not be able to use the Collation Key function if it has a limit on input string size.
- not be thread safe. Many platforms use the process-wide locale setting for the international library. The texttype implementation would then have to set the process locale before each text operation. This may result in other, unintended, differences in InterBase behavior.
- have poor performance. An OS collation key function that loads data tables from disk each time will greatly reduce database performance.
- require a library not loaded by the InterBase server If the library isn’t already loaded by the InterBase server, your gdsintl2 library will not be able to access the OS provided functions.
Signed and Unsigned values
A CHARSETID and TEXTTYPEID are unsigned values. However, the LD_lookup interface between the engine and the gdsintl library is defined as a signed short. You must use a C cast to refer to the value as an unsigned short inside of your gdsintl2 module.
The error given when the engine is unable to locate a texttype is “Implementation of text subtype %d not located.” – the TEXTTYPEID / CHARSETID value is printed as a signed value instead of unsigned.
InterBase Source Files
The files provided in the InterBase collation kit were derived from InterBase source files. InterBase source files are best viewed with tab-stops set to 4.
InterBase was written in Kernighan &Richie C, and gradually moved toward ANSI C. Some inconsistencies remain; for example, some function definitions have prototypes, some do not.
Building an International Object, Step-by-Step
This section is a cookbook for creating a new InterBase texttype.
First, build and install a gdsintl2 shared library from the examples that accompany this document.
- Install the example files.
- Copy the platform specific makefile.platform to makefile.
- Make the example gdsintl2 module (“make gdsintl2”)
- Install the example gdsintl2 module with “make install”
- Test the module using cs_example.sql
Now that you are comfortable with the mechanics of building and using a new stored library, begin working on your character set.
Define a collation object that implements binary collation.
Create your character set mapping tables. Obtain a mapping table to and from Unicode from Unicode.com or another reliable source.
Define a symbol for your charsetid in cs_example.h
Modify cs_example in cs_example.c to use your mapping tables instead of the example ones.
Modify the initialization function cs_example in cs_example.c setting the parameters to reflect your character set.
Pick character set and collation identifiers. If you intend to share or sell your texttypes, you should pick a value below 250 for the character set and collation identifiers and register them with InterBase support. If your new texttypes are for use only in your controlled environment, pick a value between 250 and 255 for the character set and between 250 and 254 for the collation. Those values do not require registration.
Modify the linkage function in ld2.c to reflect your picked parameters.
Test the character set using a standalone test method.
Make and install the character set as a gdsintl2 driver.
Add the character set definition procedures defined in cs_util.sql to a newly created test database for the character set.
Declare the character set to the database.
execute procedure create_character_set ( ‘<NEW_CHARSET>’, <your id>);
Declare alias names for the character set.
execute procedure create_character_set_alias ( ‘new name’, /* for */ ‘existing name’);
Test the newly created character set. Create a column with all characters from the character set. Map the column to a Unicode column and back again. Sort the column, index it.
Modify the collation function in lc_example.c to support your collation.
Rebuild and re-install gdsintl2. You will need to restart the InterBase server for it to load the new gdsintl2 module (as the prior version is already loaded).
Declare the collation to the database.
execute procedure create_collation ( ‘collation name’, <collation id>, ‘charset name’);
Test the collation.
Appendix
Existing Character Sets
SELECT RDB$CHARACTER_SET_NAME, RDB$CHARACTER_SET_ID
FROM RDB$CHARACTER_SETS
ORDER BY RDB$CHARACTER_SET_ID;