WIN_CZ: Case Insensitive Czech Language Collation for the WIN1250 Character Set

by Ivan Prenosil

For help in implementation details, I recommend you look at the file "intlcollationswin_cz.h" (or any other suitable collation, but this particular one uses more symbolic constants than others so it is easier to understand)

There are several tables, the function of these two is evident:

1. static const BYTE ToUpperConversionTbl[UPPERCASE_LEN] = {
2. static const BYTE ToLowerConversionTbl[LOWERCASE_LEN] = {

when creating a new collation you just have to "borrow" them from some of the other collations.

These two are for handling letter pairs (either german sharp s, that sorts as two "ss")

static const ExpandChar ExpansionTbl[NUM_EXPAND_CHARS + 1] = {

or for czech "ch" that sorts as single letter between "h" and "i"

static const CompressPair CompressTbl[NUM_COMPRESS_CHARS + 1] = {

Again, if you need such functionality just copy/paste them from other collations, and then set the proper flags in the NoCaseOrderTbl table.

The essential piece is the table:

static const SortOrderTblEntry NoCaseOrderTbl[NOCASESORT_LEN] = {

That is used to generate the three level sort key.

The first column represents "basic weight", i.e. without distinguishing accents and the upper/lower case letters. The second column is for distinguishing different accent marks, the third column is for distinguishing upper and lower letters. The fourth and fifth columns are flags that indicate using ExpandChar ExpansionTbl and the CompressPair CompressTbl tables.

Example:

in Czech we have several national letters that behave like "independent" letters, i.e. they have their own position in sort order (e.g. "s" with caron), and we have other accented letters whose accent is considered as a secondary difference only (e.g. "a" with acute mark, or Slovak letter "r" with acute). The specific rows and their neighbourhood look like (slightly reordered so you can see the differences easier):

{FIRST_PRIMARY + 83, NULL_SECONDARY, CAPITAL_LETTER, 0, 0}, /* 81 Q */
{FIRST_PRIMARY + 84, FIRST_SECONDARY + 0, CAPITAL_LETTER, 0, 0}, /* 82 R */
{FIRST_PRIMARY + 84, ACUTE, CAPITAL_LETTER, 0, 0}, /* 192 R with acute */
{FIRST_PRIMARY + 86, FIRST_SECONDARY + 0, CAPITAL_LETTER, 0, 0}, /* 83 S */
{FIRST_PRIMARY + 87, NULL_SECONDARY, CAPITAL_LETTER, 0, 0}, /* 138 S with caron */
{FIRST_PRIMARY + 88, FIRST_SECONDARY + 0, CAPITAL_LETTER, 0, 0}, /* 84 T */
{FIRST_PRIMARY + 83, NULL_SECONDARY, SMALL_LETTER, 0, 0}, /* 113 q */
{FIRST_PRIMARY + 84, FIRST_SECONDARY + 0, SMALL_LETTER, 0, 0}, /* 114 r */
{FIRST_PRIMARY + 84, ACUTE, SMALL_LETTER, 0, 0}, /* 224 r with acute */
{FIRST_PRIMARY + 86, FIRST_SECONDARY + 0, SMALL_LETTER, 0, 0}, /* 115 s */
{FIRST_PRIMARY + 87, NULL_SECONDARY, SMALL_LETTER, 0, 0}, /* 154 s with caron */
{FIRST_PRIMARY + 88, FIRST_SECONDARY + 0, SMALL_LETTER, 0, 0}, /* 116 t */

You can see that:

  • letters that differ only by case have identical primary and secondary weights, like uppercase and lowercase "Q":

    {FIRST_PRIMARY + 83, NULL_SECONDARY, CAPITAL_LETTER, 0, 0}, /* 81 Q */
    {FIRST_PRIMARY + 83, NULL_SECONDARY, SMALL_LETTER, 0, 0}, /* 113 q */
    
  • "S" have different primary weight than "S with caron"

    (FIRST_PRIMARY + 86 vs. FIRST_PRIMARY + 87),
    

    so "S with caron" sorts always between "S" and "T"

  • "R" and "R with acute" have the same primary weight (FIRST_PRIMARY + 84), they differ only on secondary level

    {FIRST_PRIMARY + 84, FIRST_SECONDARY + 0, CAPITAL_LETTER, 0, 0}, /* 82 R */
    {FIRST_PRIMARY + 84, FIRST_SECONDARY + 2, CAPITAL_LETTER, 0, 0}, /* 192 R with acute */
    

    or using symbolic constant (const int ACUTE = FIRST_SECONDARY + 2;)

    {FIRST_PRIMARY + 84, ACUTE, CAPITAL_LETTER, 0, 0}, /* 192 R with acute */
    
  • Letters for which there are secondary variants use constants FIRST_SECONDARY + <num>, characters without accents use NULL_SECONDARY (there is no accented "Q" or "space")

  • Letters with uppercase and lowercase variants use theses constants in the third column

    const int CAPITAL_LETTER = FIRST_TERTIARY + 0;
    const int SMALL_LETTER = FIRST_TERTIARY + 1;
    

    characters that do not distinguish upper/lower (like numbers) use NULL_TERTIARY in the third column.

It is also necessary to modify some constants from the beginning of file, like

const int NUM_EXPAND_CHARS = 1; <<< number of values in ExpansionTbl table
const int NUM_COMPRESS_CHARS = 4; <<< number of values in CompressTbl table
const int MAX_NCO_PRIMARY = 153; <<< maximal value in the first column, see {FIRST_PRIMARY + 153, NULL_SECONDARY, NULL_TERTIARY, 0, 0} /* 255 */
const int MAX_NCO_SECONDARY = 6; <<< maximal value in the second column, see const int OGONEK = FIRST_SECONDARY + 6; {FIRST_PRIMARY + 65, OGONEK, CAPITAL_LETTER, 0, 0}, /* 165 A */
const int MAX_NCO_TERTIARY = 1; <<< maximal value in the third column, see const int CAPITAL_LETTER = FIRST_TERTIARY + 0;
const int SMALL_LETTER = FIRST_TERTIARY + 1;

All the above is basically how collations work in Firebird V1.5. In Firebird V2.x there are more possibilities.

The table RDB$COLLATIONS has new columns:

RDB$COLLATION_ATTRIBUTES
RDB$SPECIFIC_ATTRIBUTES
RDB$BASE_COLLATION_NAME

in order to use this new functionalty the collation must be defined as FAMILY3, see:

"intl\lc_iso8859_1.cpp"

The column RDB$COLLATION_ATTRIBUTES can hold these flags:

1 - PAD SPACE, <<< use always; it means 'A' = 'A ' 2 - CASE-INSENSITIVE, <<< it means the third-level key will be ignored 4 - ACCENT-INSENSITIVE <<< it means the second-level key will be ignored

As you can see, the new collation WIN_CZ has:

RDB$COLLATION_ATTRIBUTES = 3

i.e. it is accent sensitive but case insensitive. You may wonder why there are any values in the third column of NoCaseOrderTbl? It is because it is possible to use FAMILY3 collations as base for deriving other collations, so I can create "new" case sensitive collation from the "old" case insensitive WIN_CZ using the procedure sp_register_collation() from "C:Program FilesFirebirdFirebird_2_0miscintl.sql" script.

And here are files that had to be changed in order to incorporate the new collation into the standard Firebird build (these are older notes, the actual values may differ slightly):

"firebird2srcintlld.cpp"

EXTERN_texttype(WIN1250_c8_init);

"firebird2srcintllc_iso8859_1.cpp"

TEXTTYPE_ENTRY(WIN1250_c8_init)
{
static const ASCII POSIX[] = "WIN1250_NOACC.WIN1250";
#include "../intl/collations/win1250noacc.h"
return FAMILY3(cache, CC_CZECH, LDRV_TIEBREAK, NoCaseOrderTbl, ToUpperConversionTbl, ToLowerConversionTbl, CompressTbl, ExpansionTbl, POSIX, attributes, specific_attributes, specific_attributes_length);
}

"firebird2srcjrdintlnames.h"

COLLATION("WIN1250_NOACC", CC_CZECH, CS_WIN1250, 8, WIN1250_c8_init, TEXTTYPE_ATTR_PAD_SPACE | TEXTTYPE_ATTR_CASE_INSENSITIVE | TEXTTYPE_ATTR_ACCENT_INSENSITIVE)

"firebird2buildsinstallmiscfbintl.conf"

collation WIN1250_NOACC

I hope that these instructions are clear enough.