InterBase Collation Kits

by David Brookstone Schnepper

Introduction

This paper describes how to create a custom collation or character set driver for InterBase. An example using InterBase 6.0 is included.

This document uses international characters for illustration. The encoding for this document is Character set Windows 1252 (a.k.a Latin 1, or Western European).

More information also available at Brookstone Systems

Contents

  • makefile.bc: Makefile for Borland C
  • makefile.so: Makefile for Solaris
  • makefile.hp: Makefile for HP (not completed)
  • ld2.c: Example linkage function between engine and international module
  • cs_example.c: Character Set example
  • cv_example.c: Text conversion example
  • lc_example.c: Collation example
  • cs_latin1.h: Mapping tables for Character Set example
  • country_codes.h: Defines numeric codes for countries
  • charsets.h: InterBase numeric codes for character sets
  • tx865_lat1.h: Translation tables between DOS865 and ISO Latin-1 character sets
  • intlobj.h: Interface definition between InterBase Engine and International modules.
  • cs_example.sql: SQL script defining & testing example character set & collation.
  • cs_util.sql: Utility procedures for creating character sets and collations.

To download the above as a (30.5kb) (.zip) file

History

Prior to InterBase v3.3, InterBase defined CHARACTER(n) data as simple 8-bit bytes. It was up to the application programmer to know what character set was used to store the data, and how to convert between different character sets. Only binary collation order was supported for CHARACTER(n) data.

In June 1992, InterBase v3.3 was released, which included the first support for multiple character sets in a database and alternate collations for text. Only single-byte character sets and European-style collations were supported at that time. Version 3.3 also supported the ability to extend the International functionality by adding additional character set or collation modules.

Separately, a version 3.2J (Japanese) was released in 1992. The 3.2J version supported Japanese data stored in EUC and SJIS formats. It only supported codepoint order collation. There was no ability to extend the functionality of the international parts of the product.

In 1993, InterBase 4.0 was released. InterBase 4.0 built on the foundation established by v3.3 and v3.2J and combined both single-byte and multi-byte character set support into a single product.

Unlike the 3.3 architecture, InterBase 4.0 used the then-standard SQL-92 Internationalization architecture.

This document describes how to create drivers in the InterBase Internationalization architecture. Examples are provided for InterBase 6.0. These examples should also work for any InterBase 4.x or 5.x version – as the architecture has not changed since originally implemented for InterBase 4.0.

Many of the International collations supported in InterBase were originally designed for Borland’s Paradox and/or dBASE product lines. The collations were re-implemented in InterBase and validated against the original product. An important design decision was that InterBase collations must match the corresponding dBASE/Paradox collation PRECISELY, to make them compatible. InterBase duplicated known bugs in dBASE/Paradox collations to preserve compatibility. InterBase defines several collations that are “the same, but different”. For instance, there are German collations for dBASE, for Paradox, and an InterBase specific collation “DE_de”. The general rule is dBASE & Paradox collations perform identically to dBASE & Paradox while InterBase specific collations follow the national standard for the language, if one exists.

An InterBase v3.3 Collation Kit document was written in 1993, which documents creating an InterBase 3.3 collation driver. This document covers InterBase 4.x through 6.x – using the architecture developed for V4.x.

SQL Internationalization Definitions

Internationalization is full of ambiguous terminology: the statements are the same but all the definitions are different. Here are the definitions from the SQL approach to Internationalization. These definitions are painfully precise in the SQL 92 definition – this documents takes some liberties with that precision in the interest of clarity. (See the SQL 92 standard for the original definitions). The SQL definitions are different from the “intuitive” definitions many computer programmers use.

Character Repertoire

A character repertoire is a set of characters used for a particular culture. For instance, English uses A, B, C … Z. French uses A, Á, À, B, C, Ç, D … Z.

Character Set

A character set is a collection of characters that includes at least one character repertoire. For instance, ISO Latin 1 is a character set that includes the English repertoire and the French repertoire.

Form of Use

A form of use is the mapping of a character set to a particular computer based representation. It includes the numerical value assigned to a particular character, and the way to encode the value into bits for computer use. Examples of encoding include single byte, multi-byte sequence, 16-bit value, etc.

Collation

A collation is a method of ordering sequences of characters in a culturally acceptable fashion.

Character

A character is a particular member of a character repertoire. Characters can change “form of use” (e.g.: encoding) but never meaning.

InterBase Internationalization Definitions

This section defines the terminology used in InterBase’s internationalization implementation.

Narrow Character – NCHAR

A narrow character set represents text as a sequence of bytes, each byte representing a unique character.

typedef unsigned char NCHAR;

ISO Latin1, DOS437, and similar character sets are represented as NCHAR.

Multi-byte Character – MBCHAR

A multi-byte character set represents text as a sequence of bytes, with some byte-values indicating that one or more of the following bytes determine the next character.

typedef unsigned char MBCHAR;

SJIS, EUC-J, BIG-5, and similar character sets are represented as MBCHAR.

Wide Character – WCHAR

A method of representing a character set as a sequence of integers, in which the integer is larger than an 8-bit byte. Typically, a platform specific WCHAR implementation matches the integer size for the platform.

InterBase defines WCHAR as an unsigned short – a 16-bit value.

typedef unsigned short WCHAR;

InterBase does not fully support character sets in WCHAR format. It does not swap bytes between platforms of different integer architectures (e.g.: little-end-ian and big-end-ian platforms). This limitation can create problems when the client and server have different integer architectures.

InterBase uses WCHAR as an internal processing format – InterBase will typically convert a MBCHAR string to WCHAR for text matching functions.

UNICODE, in WCHAR format, is used internally for charset conversion functions.

Character Set

An InterBase character set consists of a SQL character set and a SQL Form of Use. For instance, the InterBase character Set ISO-8859-1 is precisely called: “The set of characters defined by ISO Standard 8859-1, encoded by the values defined in ISO standard 8859-1, having each value represented by a single 8-bit byte”.

InterBase character sets are generally defined by national standards or industry standards. Examples: the International Standards Organization defines ISO-8859-1, and Microsoft defines Windows1252.

InterBase is precise in following defined standards. Windows 1252 is a superset of ISO-8859-1 (commonly known as Latin-1). In InterBase, WIN1252 and LATIN_1 are two separate character sets because Windows 1252 defines characters in positions that Latin-1 specifically states are “not a character”.

Text Type

An InterBase text type could also be termed a “locale”. It consists of all culturally significant information about processing text. This includes: character set, representation (form of use), collation method, comparison method, and uppercasing method.

Typical “locale” information also includes such methods as conversion of numeric values to string values, time representation, and date representation. These methods and formats are NOT included in the InterBase Text Type.

Transliteration

A transliteration occurs when characters in one InterBase character set are converted to another character set. Note that this process occurs when changing from one SQL “form of use” to another. InterBase terms the conversion from Unicode in the multi-byte FSS format to Unicode in WCHAR format a transliteration. The conversion from DOS437 to ISO-8859-1 is also a transliteration.

InterBase transliterations are “precise.” InterBase will report a transliteration error if the character in the input set does not have an exact representation in the output set. None of the existing set of InterBase transliterations will transliterate a “missing character” to another character.

This restriction provides the ability to perform “round-trip conversions”. A conversion is “round-trip” if it converts from character set A to character set B, then back to A, producing the original sequence of characters in the original representation.

InterBase transliterations preserve “character fidelity”.

Custom transliterations can perform alternate conversions for “missing characters.” For instance, a custom transliteration could convert the character <ae> to the string “ae” when the character <ae> does not exist in the destination character set.

Complexity of Collation

InterBase provides collations of varying complexity. The most sophisticated collations provide full dictionary sorting appropriate to the language being compared. Many InterBase collations provide a 4-weighted collation, where characters are compared by:

  • Primary shape (A vs. B vs. C)
  • Accents (A vs. Á vs. Ä)
  • Case (A vs. a)
  • Punctuation weight (punctuation is ignored in the first three comparison).

In a 4-weight collation Primary differences take precedence over Accent differences, which take precedence over Case differences, which take precedence over punctuation differences.

Most languages use left-to-right comparisons for all 4 weightings. French as used in France requires right-to-left for Accent distinctions and left-to-right for all other distinctions.

The ordering of Primary Shapes, Accents, Case, and punctuation is culturally specific:

Primary Shapes, Accents, Case, and punctuations
Collation Primary Shapes Accents Case Punctuation
English Digits 0 1 2 … 9 … before letters A B C … Z   Uppercase before lowercase comma dash period
Danish Letters A B C … Z Å … before digits 0 1 2 … 9 grave acute circumflex lowercase before uppercase period dash comma
French 0 1 2 … 9 … A B C Ç … Z circumflex acute grave    

Note

The examples used in this section are for illustration purposes only, and not meant to be definitive of any collation sequence.

The “Primary Shape” of a character is culturally specific:

Primary Shape
Collation Character Treatment
English å Same primary shape as a
Danish å Unique primary shape, sort after Z
English ç Same primary shape as c
French ç Unique primary shape, sort between c and d
Danish Æ Unique primary shape, sort after Z
German Æ Sort as if written “AE”

Each InterBase collation is specific to a character set. All characters in the character set must be sorted. If a character is not used in a particular culture it can be collated to any location, but it must still be collated, as it is valid input. Icelandic Thorn (Þ) and eth (Ð) are not used in English – and the EN_UK driver sorts them near T and D.

Some languages expand characters to an equivalent format
Collation Character Expansion Special Rule
German ß ss Sort after “ss” as an “accent difference”
Dutch ÿ ij Sort before “ij” as an “accent difference”
Some languages contract sequences of characters
Collation Sequence Special Rule
Spanish CH Primary shape between C and D
Spanish LL Primary shape between L and M
Danish AA Primary shape as Å, but treat as an accented difference after Å

The complexities of culturally correct sorting are beyond the scope of this document. Designing a collation order requires skill and insight; find a good text book on the subject and be sure you understand the issues before you begin.

InterBase Internationalization Classes

In JAVA object-oriented terminology; every character set supported by InterBase is represented by a class that implements the InterBase character set Interface.

Every InterBase text type is represented by a class that implements the texttype interface.

Every transliteration is represented by a class that implements the conversion interface.

InterBase does not implement international support in an object-oriented language. It uses an object-oriented design implemented in standard C.

The methods in each internationalization interface are not very well divided. In particular, many methods defined in texttype belong in charset (mb_to_wc in particular).

Charset

A class implementing the InterBase charset interface must provide an internal name for the character set and:

  • The value for the SPACE character
  • The length of the SPACE character, in bytes
  • Max & min bytes per character
  • A method to convert from the character set to UNICODE in WCHAR
  • A method to convert from UNICODE in WCHAR to the character set

Charsets are defined by the structure charset in intlobj.h. The structures defined in intlobj.h are explained starting in Texttype objects.

Texttype

A class implementing the InterBase texttype interface must provide an internal name for the texttype and functions that:

  • Calculate the key length for a given string length
  • Calculate the collation key for a given string
  • Compare two strings
  • Convert a string to uppercase
  • Convert a string to lowercase
  • Convert a string to WCHAR format
  • Fetch the next character in a string

Texttypes are defined by the structure texttype in intlobj.h.

Conversion

A class implementing the InterBase conversion interface must provide an internal name for the conversion and a function to convert FROM a charset TO another charset.

Conversions are defined by the structure csconvert in intlobj.h

Data Types for International Object Identifiers

Character sets and text types are identified by numeric values. Once a numeric value is established for a character set or text type, the value will never be re-designated by InterBase. Preserving those numbers protects backup/restore and compatibility between servers.

Some character set and collation identifiers are reserved for user implementations. If you are creating a text type for your own application, you should use values in the range of 250 to 255 for character set identifiers and 250 to 249 for collation identifiers. If you are creating a text type that you expect to share or sell, you should use a value below 250 for each identifier, and register your identifiers with InterBase support.

Character Set Identifiers

Character set identifiers range from 0 to 255. The range from 250 to 255 is reserved for user implementations. All numbers below 250 must be registered with InterBase support to prevent conflicts with other vendors of InterBase international objects.

typedef unsigned char   CHARSETID;

CHARSET ID 127 is reserved for internal InterBase use.

Texttype identifiers

A texttype is identified by a 16-bit value. The most-significant byte is the collation identifier (COLLATIONID). The least significant byte is the character set identifier (CHARSETID).

Collation IDs range from 0 to 254. Collation identifiers from 250 to 254 are reserved for user-defined collations. Value 255 is not supported. If you use a collation identifier below 250, you must register it with InterBase support to avoid conflicts with other vendors of InterBase international objects. By convention Collation ID 0 is the default collation for a given character set. The default collation for a character set is binary collation, where characters are ordered by the binary value specified by their form of use.

typedef unsigned char   COLLATIONID;
typedef unsigned short  TEXTTYPEID;

#define GET_CHARSETID(t)     ((t) & 0xFF)
#define GET_COLLATIONID(t)   (((t) >> 8) & 0xFF)

Collation identifiers are unique within a character set. Texttype ID’s are unique across the database.

Linking Numeric Identifiers and Symbolic Names

InterBase searches internal system metadata tables for the symbolic names to attach to charset and texttype objects. Symbolic Names for Character Set ID’s

The following query gives the correspondence between charset ID’s and charset names.

SELECT RDB$CHARACTER_SET_ID, RDB$CHARACTER_SET_NAME
FROM RDB$CHARACTER_SETS;

This name is the “official” name for a character set. However, naming conventions for character sets vary. InterBase character sets have aliases to conform to the naming conventions on different platforms.

SELECT RDB$TYPE, RDB$TYPE_NAME
FROM RDB$TYPES
WHERE RDB$FIELD_NAME = RDB$CHARACTER_SET_NAME;

RDB$TYPES in the internal table used to define symbolic constants for various internal values.

Symbolic Names for Collation ID’s

Collation names are also defined in a system table. Retrieving them requires a more complicated query

SELECT RDB$COLLATION_NAME, RDB$COLLATION_ID,
RDB$CHARACTER_SET_ID,
RDB$COLLATION_ID * 256 + RDB$CHARACTER_SET_ID
AS TEXTTYPEID
FROM RDB$COLLATIONS;

RDB$COLLATIONS stores the collation name, the collation identifier, and the character set identifier for each supported collation. The TEXTTYPEID is calculated from the COLLATION_ID and the CHARACTER_SET_ID.

COLLATION_NAME must be unique in a database. Each texttype has a unique collation name. Aliases are not supported for collation names.

Naming Conventions

Collation names often follow the convention XX_yy.

Thus, DE_DE is the collation name for German as used in Germany. FR_FR is French as used in France, and FR_CA is French as used in Canada.

InterBase collations that emulate Paradox or dBASE collations have names like “PDOXxxxx”, “PXWxxx” or “DBxxxx”. The names should match the names used in Paradox and dBASE.

A special collation exists for each character set. This collation has the same name as the character set and implements binary collation for the character set. Binary collation sorts the character set by the numeric codes used to represent the characters.

Character set names follow the standard that defines them (for example: WIN1252, DOS437). Character set alias names support the differences in these names between platforms (for example: WIN1252, WIN_1252).

Searching for an Object

InterBase searches for an international implementation object when it loads its metadata. It searches these three locations in this order:

  1. The engine itself
  2. The international library – gdsintl
  3. The supplemental international library – gdsintl2

Objects for character sets ASCII, NONE, OCTETS, and UNICODE_FSS are built into the engine. Without these character set definitions the engine cannot load a database. The default collations for these character sets are also implemented within the engine.

All other InterBase international object implementations are in the gdsintl library.

The gdsintl2 library is available for user-defined international objects.

GDSINTL Libraries

Gdsintl and gdsintl2 are platform-dependant shared-object libraries. The suffix for the filename is platform dependant.

Shared-library suffixes
Platform Suffix
Windows DLL
Solaris SO
HP SL

The implementation of the library is platform-dependant. Libraries must be recompiled when moved to a new platform.

Some platforms do not support shared-object libraries. Novell NLM is an example. For those platforms, the entire gdsintl library is linked with the engine. User-defined international objects cannot be added to InterBase on platforms that do not support shared-object libraries.

The InterBase installation procedure places gdsintl in the intl, bin or lib directory in the InterBase root directory. gdsintl2 must be stored in the same directory as gdsintl. The location of the gdsintl library is dependant on the platform and InterBase version.

Shared-library locations
Platform Version Location
Windows 4.x & 5.x bin
All others 4.x & 5.x lib
All 6.x intl

Single Entry Point

Each international library has a single entry point, which is LD_lookup for gdsintl, and LD_lookup2 for gdsintl. Aside from the difference in entry point names, the libraries are functionally identical.

Historical note: Early versions of IBM’s AIX platform supported only one entry point in shared-object libraries. That limitation is the reason for limiting the lookup function to the single entry point for InterBase international libraries.

Historical note: The Apollo platform required that identifiers be unique across all shared-object libraries loaded by a program. That limitation is the reason for requiring the different entry point names in the two libraries.

The LD2_lookup Function

The lookup function for the gdsintl2 library is passed the following:

  • Type of object to lookup
  • Numerical identifier to lookup

It returns

  • 1 if object was not found
  • 0 if object was found and a pointer to a function used to initialize the object
USHORT LD2_lookup (
USHORT        objtype,
FPTR_SHORT    *fun,
SSHORT        parm1,
SSHORT        parm2)

Objtype is one of type_texttype, type_charset, or type_csconvert. Any other input is an error.

Shared-library locations
Objtype Parm1 Parm2
Type_texttype TEXTTYPEID 0
Type_charset CHARSETID 0
Type_csconvert CHARSETID (from) CHARSETID (to)
Any other value ERROR ERROR

An example LD2_lookup function is provided in ld2.c

Initialization Functions

When an international implementation is found, the call to LD2_lookup returns a pointer to the class initialization function. The engine calls this initialization function with a pointer to a 0-initialized data structure, which the function will fill. This data structure defines the object to the engine.

The initialization is part of database startup, not server startup! If your server supports more than one database, the initialization function will be called for every database that shares a set of international objects. Even in a single-database server, the initialization function will be called each time the engine shuts down and reloads a database that was temporarily inactive.

The initialized object is not released until the database is unloaded. The engine releases the object as a normal part of database shutdown.

Here is a sample initialization function for a texttype:

unsigned short LC_win1252_example(
TEXTTYPE cache,
SSHORT  parm1,
SSHORT  dummy)
{
cache->texttype_version = IB_LANGDRV_VERSION;
/* … Other initialization code … */
}

Each charset, each texttype, and each conversion supported in a gdsintl library must have an initialization function.

Texttype Objects

Texttype objects implement an InterBase texttype. They provide collation and other text manipulation functions. This is the definition of texttype in intlobj.h:

typedef struct texttype {
struct blk       texttype_blk;
USHORT           texttype_version;          /* version ID of object */
USHORT           texttype_flags;            /* miscellaneous flags */
TTYPE_ID         texttype_type;             /* Interpretation ID */
ASCII           *texttype_name;
CHARSET_ID       texttype_character_set;    /* ID of base character set */
SSHORT           texttype_country;          /* ID of base country values */
BYTE             texttype_bytes_per_char;   /* max bytes per character */
ULONG            texttype_license_mask;     /* required bits for license */

/* MUST BE ALIGNED */
FPTR_SHORT       texttype_fn_init;
FPTR_SHORT       texttype_fn_key_length;
FPTR_SHORT       texttype_fn_string_to_key;
FPTR_short       texttype_fn_compare;
FPTR_SHORT       texttype_fn_to_upper;      /* convert one ch to uppercase */
FPTR_SHORT       texttype_fn_to_lower;      /* One ch to lowercase */
FPTR_short       texttype_fn_str_to_upper;  /* Convert string to uppercase */
FPTR_SHORT       texttype_fn_to_wc;         /* convert string to wc */
FPTR_SHORT       texttype_fn_contains;      /* s1 contains s2? */
FPTR_SHORT       texttype_fn_like;          /* s1 like s2? */
FPTR_SHORT       texttype_fn_matches;       /* s1 matches s2 */
FPTR_SHORT       texttype_fn_sleuth_check;  /* s1 sleuth s2 */
FPTR_SHORT       texttype_fn_sleuth_merge;  /* aux function for sleuth */
FPTR_short       texttype_fn_mbtowc;        /* get next character */
BYTE            *texttype_collation_table;
BYTE            *texttype_toupper_table;
BYTE            *texttype_tolower_table;
BYTE            *texttype_expand_table;
BYTE            *texttype_compress_table;
BYTE            *texttype_misc;             /* Used by some drivers */
ULONG           *texttype_unused [4];       /* spare space */
} *TEXTTYPE;

Many of the fields in this structure are for use only by the engine, some are for use by the object itself, and some form the interface between the InterBase engine and the texttype implementation.

texttype structure fields
Field Name Description
texttype_blk Used for the engine’s memory management – do NOT touch!
texttype_version Set to IB_LANGDRV_VERSION
texttype_flags For engine use
texttype_name Pointer to ASCII internal name for texttype
texttype_character_set CHARSET_ID of the character set this object is implemented for.
texttype_country Country ID of locale (not required)
texttype_bytes_per_char Max bytes per char (duplicate information from charset object)
texttype_license_mask No longer used.
texttype_fn_init Pointer to function used to initialize the data structure
texttype_fn_key_length Calculates key length for a input string length.
texttype_fn_string_to_key Converts a string to a collation key
texttype_fn_compare Compares two strings of unlimited length.
texttype_fn_to_lower Converts to lowercase
texttype_fn_to_upper, texttype_fn_str_to_upper Converts to uppercase
texttype_fn_to_wc, texttype_fn_mbtowc Converts string to WCHAR (belongs with charset object!)
texttype_fn_contains, texttype_fn_like, texttype_fn_matches, texttype_fn_sleuth_check, texttype_fn_sleuth_merge The engine will initialize these pointers with functions used to implement LIKE, CONTAINS, and the GDML matching language. Do not set them in the initialization function.
texttype_collation_table, texttype_toupper_table, texttype_tolower_table, texttype_expand_table, texttype_compress_table, texttype_misc, texttype_unused These fields provide pointers to data tables that the implementation uses to provide its functionality. The engine never refers to them directly; they are for the object’s use.

Functions to Perform Operations on Text

The first parameter of all functions implemented within the texttype is a pointer to the data structure initialized by the initialization function. This pointer provides access to the data tables: texttype_collation_table, texttype_toupper_table, etc. Functions can be shared among different texttypes implemented in the same gdsintl2 shared library. The result of the function is determined by the data tables, which are specific to a texttype.

The comments in lc_example.c provide more detail on each function.

Collation Keys

A collation key maps a string with a non-default collation to a set of bytes that can be compared with unsigned-byte comparison preserving the culturally correct collation order. InterBase stores collation keys in indexes. Index keys are always compared with unsigned byte comparisons, starting with the high order byte and continuing until a difference or the end of the key is found. When creating an index key, InterBase transforms values that do not produce correct ordering using unsigned byte comparisons to values that do.

The mapping from a string to a collation key can be irreversible. The original string is never recreated from the collation key.

When writing a textype_fn_string_to_key function, the resulting key must consist of unsigned bytes.

In a binary collation, the collation key is identical to the string.

Examples of collation keys created using different collations
Collation Input Collation Key
Binary ÁbC ÁbC
Case Insensitive binary ÁbC ÁBC
Accent Insensitive binary ÁbC AbC
Case and Accent Insensitive binary ÁbC ABC
English Book Title The Practice Practice, The
Telephone Book (USA) McDonald MacDonald
Telephone Book (USA) 1 Way ONE WAY
Lexical Ordering (German) ÁbCê ABCE’ ^AbCe
Lexical Ordering (French) ÁbCê ABCE^ ‘AbCe
Lexical Ordering (German) Cheß CHESS 1chess
Lexical Ordering (Spanish) Cheß XHEß Xheß where X represents a special value between C and D.

Collation Keys versus Comparison

A texttype must provide functions for string comparison and creating collation keys. String comparison is used for equality and inequality Boolean evaluations. Collation keys are used in indexes. Comparing strings and comparing collation keys must produce identical results.

The two functions have a significant difference. Index keys have a fixed maximum length, currently 254 bytes. String comparison must be able to handle strings of unlimited length.

Ignore Trailing Spaces

SQL requires that trailing spaces be ignored when comparing two strings of different lengths. It is important that both the key creation function and the compare function ignore trailing spaces on strings.

Note

The space character is determined by the character set.

Monocase operations

A texttype must contain functions to implement the SQL UPPER(x) and LOWER(x) functions. The result of UPPER & LOWER is culturally specific, and the rules are determined by the collation sequence defined with the column.

The to_upper and to_lower functions can be implemented using data tables or directly in code.

InterBase does not currently support the SQL LOWER(x) function, however it is necessary to implement it in each defined texttype.

Form of Use Changes

A texttype contains functions for changing the “form of use” of a character string from NCHAR or MBCHAR to WCHAR. These functions, strictly speaking, belong with the charset object. A design decision in the Internationalization architecture placed them with texttype.

Text Matching Functions

Entries exist in texttype for custom implementation of SQL LIKE, CONTAINS, and GDML MATCHING functions. The InterBase engine will fill these entries with internal function pointers after the texttype initialization function is called. Your driver should not initialize these function pointers.

Debugging a Collation Driver

Debugging a collation driver should be done before making it into a GdsIntl2 driver. While the collation driver is a standalone entity, it can print the collation keys it creates to check for correctness.

After installing the driver, sort and index several tables, compare indexed and non-indexed strings and perform UPPER() operations to ensure the driver is working properly.

The example test file is cs_example.sql.

Example

The example texttype driver lc_example.c implements a simple version of English Title Sort Order. When titles are sorted, articles such as “The”, “An”, or “A” at the beginning of the title are ignored. Thus “The Wind and the Lion” sorts as if it were written “Wind and the Lion”.

In this driver, the Collation Key created is simply the input string, minus any articles at the beginning of the string. This collation does not change the default ordering of accented characters or upper and lower case characters.

Conversion Objects

InterBase uses conversion objects to transform one character set to another. A conversion object is defined by the data structure csconvert in intlobj.h

typedef struct csconvert {
struct blk    csconvert_blk;
USHORT        csconvert_version;
USHORT        csconvert_flags;
SSHORT        csconvert_id;
ASCII        *csconvert_name;
CHARSET_ID    csconvert_from;
CHARSET_ID    csconvert_to;
FPTR_SHORT    csconvert_convert;
BYTE         *csconvert_datatable;
BYTE         *csconvert_misc;
ULONG        *csconvert_unused [2];
} *CSCONVERT;
csconvert struct fields
Field Name Definition
csconvert_blk Used by the engine’s memory management. Do not modify.
csconvert_flags Engine use – do not modify
csconvert_id No longer used
csconvert_name Internal name – can be ignored.
csconvert_from CharsetID being converted FROM
csconvert_to CharsetID being converted TO
csconvert_convert Function to perform the conversion
csconvert_datatable, csconvert_misc, csconvert_unused Available for object implementation to use for pointer storage, etc. Not directly accessed by the engine.

Conversion API

The conversion object has a single method, the function that converts from one character set to another.

static USHORT     conversion_function (
CSCONVERT   obj,
BYTE       *dest_ptr,
USHORT      dest_len,
BYTE       *src_ptr,
USHORT      src_len,
SSHORT     *err_code,
USHORT     *err_position);
Input parameters
obj The conversion object
dest_ptr Location for the converted string, or NULL when an estimate of output length is requested.
dest_len Length of the dest_ptr buffer, in bytes
str_ptr The input string
src_len Length of input string, in bytes.
err_code See below
err_position See below

Returns: The length of the converted string, in bytes.

If dest_ptr is NULL, then the routine returns a maximum estimate of the space required for the output string if an input string of src_len bytes is converted.

Error Returns from Conversion Function

The conversion function can return errors to the engine. Generally the engine will report these to the user as a “transliteration error”. The error code is returned via *err_code. The position in the input string where the error was detected is returned via *err_position.

Conversion function error codes
Result in err_code Meaning
0 No Error
CS_CONVERT_ERROR Input character does not exist in output character set.
CS_BAD_INPUT Input string was not properly formed per rules of the input character set.
CS_TRUNCATION_ERROR Not enough space available in the output buffer.

Addressed by FROM charsetid & TO charsetid

Unlike character sets and collations, conversions have no symbolic names. A conversion is identified by the charsetID is it converting from and the charsetID it is converting to. An InterBase aficionado will recognize the similarity to blob filters.

If no direct conversion object has been created to convert from charset A to charset B, the engine converts charset A to UNICODE and then UNICODE to charset B.

Example

In the sample code provided, cv_example.c is an example of a conversion object. Conversion functions are generally table driven. . The example contains several different functions. The functions are code, and produce a specific conversion when the the data table for that conversion is given.

For conversion tables, check files at www.unicode.org

Charset object

InterBase uses a charset object to represent information for a character set. Much of the information inside a charset object is for engine use only.

charsets are defined by the structure charset in intlobj.h

typedef struct charset {
struct blk        charset_blk;
USHORT            charset_version;
USHORT            charset_flags;
CHARSET_ID        charset_id;
ASCII            *charset_name;
BYTE              charset_min_bytes_per_char;
BYTE              charset_max_bytes_per_char;
BYTE              charset_space_length;
BYTE             *charset_space_character;

/* Must be aligned */

FPTR_SHORT        charset_well_formed;
struct csconvert  charset_to_unicode;
struct csconvert  charset_from_unicode;

VEC               charset_converters;
VEC               charset_collations;
ULONG            *charset_unused [2];
} *CHARSET;
charset struct fields
Field Definition
charset_blk Used by engine’s memory manager. Do not modify.
charset_version Version ID of InterBase international interface.
charset_id The CHARSETID for this character set
charset_name Internal name for charset
charset_min_bytes_per_char Minimal number of bytes per character.
charset_max_bytes_per_char Maximum bytes per character. InterBase will allocate this number of bytes for each character in a SQL CHAR(n) definition.
charset_space_length Length, in bytes, of the space character
charset_space_character Pointer to the definition of the space character
charset_well_formed Not used. This was intended to be a pointer to a function that would validate that a string was well formed by the rules of a character set.
charset_to_Unicode A conversion object that transforms the character set TO Unicode (WCHAR).
charset_from_Unicode A conversion object that transforms UNICODE into the character set.
charset_converters For engine use only. (The Engine keeps a list of converters to/from this character set here).
charset_collations For engine use only. (The Engine keeps a list of collations defined for this character set here).
charset_unused May be used by object implementation.

Two Conversion Objects

Every charset must contain two conversion definitions: a conversion from Unicode to the charset and a conversion to Unicode from the charset. All InterBase character sets must be able to convert themselves to and from Unicode.

When the engine cannot find a direct conversion between two character sets, it uses the converters here. Unicode is the intermediate conversion step.

One Texttype Object

The SQL standard requires that all character sets have a default collation. By convention in InterBase, the default collation for a character set has the same name as the character set. It performs a collation in binary, or codepoint order.

The character set default texttype object is loaded when the character set is loaded. It must be defined separately in the gdsintl2 library.

The SPACE character

The SQL standard requires that all character sets define a “space character”. This character is used to pad strings assigned to CHAR(n) columns to the length of the column. Trailing space characters are ignored in all string comparison operations.

The overwhelming majority of modern character sets use 0x20 or 0x0020 for the space character. InterBase character sets follow that rule, except for character set OCTETS. OCTETS instantiates binary byte data, rather than character data, and uses 0x00 for the space character.

Example

The example character set in cs_example.c is identical to the InterBase provided ISO_LATIN_1 character set.

The critical structures in a character set definition are the tables that map the character set to and from Unicode. InterBase’s mapping tables were derived from the tables provided by Unicode, Inc. at their website - www.unicode.org

Building gdsintl2

The build process for gdsintl2 is platform specific. In general terms, gdsintl2 must be a shared library with minimal requirements for other libraries and a single entrypoints. Building a gdsintl2 library is very similar to building an InterBase UDF library.

The platform specific makefiles provided in the collation kit files are examples of building gdsintl2.

Special Notes

Backup/Restore

InterBase will backup and restore user defined character sets and collations, and data in columns defined on them without problems.

If you restore a backup file with that includes user defined international objects on a server other than the one on which it was backed up, you must ensure that the character sets and collations referenced in the backup file exist on the destination server. The CHARSETIDs and COLLATIONIDs must be identical on the two servers.

Testing a GDSINTL2

When you test a new GDSINTL2, you must be aware that the engine does not release a shared library until the engine itself is unloaded. Each time you update your gdsintl2, you must shut down the server, install the new gdsintl2, and reload the server.

Additional Shared Libraries

The gdsintl2 shared-image can not require any shared libraries that are not already loaded by InterBase. In particular, a gdsintl2 shared-library built with Borland C++Builder will require VCL libraries that InterBase does not load. You can build the library using C++Builder command line tools; just don't build the DLL within the visual environment.

If the gdsintl2 shared library references shared libraries that were not loaded during server startup, the engine will fail to load the gdsintl2 shared-library. The error message for this is cryptic: “Implementation of text subtype %d not located.” Because the gdsintl2 library is optional, the engine does not consider its absence an error. The error occurs when the engine attempts to interpret data stored in a format defined in gdsintl2.

Platform Provided Internationalization functions

Many platforms provide internationalization libraries that perform the functions needed in gdsintl2. For instance, JAVA provides the Collator class; Windows provides CompareString and similar functions.

InterBase recommends against using a platform provided Internationalization library as it may:

  • be platform specific. Moving the database to another platform will require re-implementation of the gdsintl2 library and may result in a different behavior if the two platform implementations are not precisely identical.
  • be OS version specific. Updating your OS may install a new version of the library – which may perform differently. As the key values are stored in InterBase indices your database will not operate correctly until all indices are rebuilt.
  • limit input string size. InterBase requires that comparison of Collation Keys and String Comparison return identical results – String Comparison would not be able to use the Collation Key function if it has a limit on input string size.
  • not be thread safe. Many platforms use the process-wide locale setting for the international library. The texttype implementation would then have to set the process locale before each text operation. This may result in other, unintended, differences in InterBase behavior.
  • have poor performance. An OS collation key function that loads data tables from disk each time will greatly reduce database performance.
  • require a library not loaded by the InterBase server If the library isn’t already loaded by the InterBase server, your gdsintl2 library will not be able to access the OS provided functions.

Signed and Unsigned values

A CHARSETID and TEXTTYPEID are unsigned values. However, the LD_lookup interface between the engine and the gdsintl library is defined as a signed short. You must use a C cast to refer to the value as an unsigned short inside of your gdsintl2 module.

The error given when the engine is unable to locate a texttype is “Implementation of text subtype %d not located.” – the TEXTTYPEID / CHARSETID value is printed as a signed value instead of unsigned.

InterBase Source Files

The files provided in the InterBase collation kit were derived from InterBase source files. InterBase source files are best viewed with tab-stops set to 4.

InterBase was written in Kernighan &Richie C, and gradually moved toward ANSI C. Some inconsistencies remain; for example, some function definitions have prototypes, some do not.

Building an International Object, Step-by-Step

This section is a cookbook for creating a new InterBase texttype.

First, build and install a gdsintl2 shared library from the examples that accompany this document.

  1. Install the example files.
  2. Copy the platform specific makefile.platform to makefile.
  3. Make the example gdsintl2 module (“make gdsintl2”)
  4. Install the example gdsintl2 module with “make install”
  5. Test the module using cs_example.sql

Now that you are comfortable with the mechanics of building and using a new stored library, begin working on your character set.

  1. Define a collation object that implements binary collation.

  2. Create your character set mapping tables. Obtain a mapping table to and from Unicode from Unicode.com or another reliable source.

  3. Define a symbol for your charsetid in cs_example.h

  4. Modify cs_example in cs_example.c to use your mapping tables instead of the example ones.

  5. Modify the initialization function cs_example in cs_example.c setting the parameters to reflect your character set.

  6. Pick character set and collation identifiers. If you intend to share or sell your texttypes, you should pick a value below 250 for the character set and collation identifiers and register them with InterBase support. If your new texttypes are for use only in your controlled environment, pick a value between 250 and 255 for the character set and between 250 and 254 for the collation. Those values do not require registration.

  7. Modify the linkage function in ld2.c to reflect your picked parameters.

  8. Test the character set using a standalone test method.

  9. Make and install the character set as a gdsintl2 driver.

  10. Add the character set definition procedures defined in cs_util.sql to a newly created test database for the character set.

  11. Declare the character set to the database.

    execute procedure create_character_set (
    <NEW_CHARSET>,
    <your id>);
    
  12. Declare alias names for the character set.

    execute procedure create_character_set_alias (
    new name, /* for */ existing name);
    
  13. Test the newly created character set. Create a column with all characters from the character set. Map the column to a Unicode column and back again. Sort the column, index it.

  14. Modify the collation function in lc_example.c to support your collation.

  15. Rebuild and re-install gdsintl2. You will need to restart the InterBase server for it to load the new gdsintl2 module (as the prior version is already loaded).

  16. Declare the collation to the database.

    execute procedure create_collation (
    collation name, <collation id>, charset name);
    
  17. Test the collation.

Appendix

Existing Character Sets

SELECT RDB$CHARACTER_SET_NAME, RDB$CHARACTER_SET_ID
FROM RDB$CHARACTER_SETS
ORDER BY RDB$CHARACTER_SET_ID;