Working with Unicode

FairCom DB API provides support for Unicode. This support includes:

Unicode UTF-16 field types
UTF-8 compliant C/C++ API
Indexing on Unicode field data
ICU library support

Unicode UTF-16

UCS-2 and UTF-16 are alternative names for a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. UTF-16 is officially defined in Annex Q of ISO/IEC 10646-1. It is also described in "The Unicode Standard" version 3.0 and higher, as well as in the IETF’s RFC 2871.

UTF-16 represents a character that has been assigned within the lower 65536 code points of Unicode or ISO/IEC 10646 as a single code value equivalent to the character’s code point: 0 for 0, hexadecimal FFFD for FFFD, for example.

UTF-16 represents a character above hexadecimal FFFF as a surrogate pair of code values from the range D800-DFFF. For example, the character at code point hexadecimal 10000 becomes the code value sequence D800 DC00, and the character at hexadecimal 10FFFD, the upper limit of Unicode, becomes the code value sequence DBFF DFFD. Unicode and ISO/IEC 10646 do not assign characters to any of the code points in the D800-DFFF range, so an individual code value from a surrogate pair does not ever represent a character.

These code values are then serialized as 16-bit words, one word per code value. Because the endian-ness of these words varies according to the computer architecture, UTF-16 specifies three encoding schemes: UTF-16, UTF-16LE, and UTF-16BE.

UTF-16 is the native internal representation of text in the NT/2000/XP versions of Windows and in the Java and .NET bytecode environments, as well as in Mac OS X’s Cocoa and Core Foundation frameworks.

Unicode UTF-8

UTF-8 is the byte-oriented encoding form of Unicode. The UTF-8 encoding is defined in ISO 10646-1:2000 Annex D and also described in RFC 3629 as well as section 3.9 of the Unicode 4.0 standard.

UTF-8 has the following properties:

UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
The first byte of a multi-byte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multi-byte sequence are in the range 0x80 to 0xBF. This allows easy re-synchronization and makes the encoding stateless and robust against missing bytes.
All possible 231 UCS codes can be encoded.
UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
The sorting order of big endian UCS-4 byte strings is preserved.
The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

FairCom DB API C API UTF-8 Compliance

The support routines were changed to enable FairCom DB API to accept Unicode UTF-8 strings for names of objects such as databases, tables, fields, indexes, etc., and for path names used by the API.

For example, a table with name canção (a "song" in Portuguese) can be created with the following code.

Example

TEXT tableName[9] = {0x63, 0x61, 0x6e, 0xc3, 0xa7, 0xc3, 0xa3,  0x6f, 0x00};
If ((Retval = ctdbCreateTable(hTable, tableName, CTCREATE_NORMAL)) != CTDBRET_OK)
{
   printf("ctdbCreateTable failed with error %d\n", eRet);
}

The bytes assigned to variable tableName is the UTF-8 representation of the Portuguese word canção.

If the original strings are encoded using UTF-16, you can use the FairCom DB API function, ctdb_u16TOu8(), to convert a UTF-16 string to a UTF-8 encoding. A UTF-8 string can also be converted back to UTF-16 by calling function ctdb_u8TOu16().

Example

CTDBRET OpenTable(CTHANDLE hTable, pWCHAR tableName)
{
   CTDBRET Retval;
   TEXT buffer[512];
   if ((Retval = ctdb_u16TOu8(tableName, buffer, sizeof(buffer)) == CTDBRET_OK)
   {
       if ((Retval = ctdbOpenTable(hTable, buffer, CTOPEN_NORMAL)) != CTDBRET_OK)
           printf("ctdbOpenTable failed with error %d\n", Retval);
   }
   else
       printf("ctdb_u16TOu8 failed with error %d\n", Retval);
   return Retval;
}

Activating FairCom DB API Unicode support

Unicode support is currently available for any client when connecting to the c-tree Server for Windows or Mac OS X and for FairCom DB Standalone libraries under Windows or Mac OS X. For client operation, ensure you install the c-tree Server for Windows or Mac OS X WITH Unicode support.

When building the FairCom DB libraries, execute mtmake with the "u" flag

mtmake u

to prepare the library for Unicode support. Standalone and client builds need the ICU libraries from the ICU web site, as described in next chapter.

FairCom DB API C and C++ Unicode support is activated by defining macro ctdbUNICODE. ctdbUNICODE is activated automatically when ctUNICODE is selected with the mtmake build utility:

       ---- FairCom c-tree Plus UniCode Support ----
This version of c-tree Plus provides support for UNICODE field types.

Do you want to support UniCode field types? (Y)es (N)o (D)efaults- [N]: y

ICU - International Components for Unicode

Unicode is the single, universal character set for text which enables the interchange, processing, storage and display of text in many languages.

The International Components for Unicode (ICU) are a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

Complete details on the ICU can be found on the ICU Unicode website.

ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software. To the extent required by the licenses accompanying the ICU libraries, the terms of such license will apply in lieu of the terms of any agreement with FairCom with respect to the open source software including, without limitation, any provisions governing access to source code, modification or reverse engineering. FairCom makes no representation, warranty or other commitment of any kind regarding such open source software, offers no technical support for such open source software and shall, to the maximum extent permitted by law, have no liability associated with its use.

Unicode Support

Simply storing Unicode data has always been possible with FairCom DB API, provided the application treated the data as binary and performed any necessary translations. In this case, using the stored Unicode data as a segment of an index was difficult since FairCom DB API had no way of knowing how the underlying binary data was encoded: UTF-8. UTF-16, ASCII, etc.

FairCom DB Unicode UTF-16 Field Types

Storing Unicode data requires DODA entries for each field. The individual wide-characters used in UTF-16 are not platform independent with respect to byte ordering. They are treated the same as short integers: on LOW_HIGH platforms, the lower order byte comes before the higher order byte. With the DODA entries in place, the Server and clients manage byte-order translation automatically.

FairCom DB API has four Unicode UTF-16 field types:

UTF-16 Field Type	Description
CT_FUNICODE	A fixed-length field containing a UTF-16 encoded, null terminated string. This Unicode field type is similar to CT_FSTRING field type.
CT_F2UNICODE	A fixed-length field that begins with a 2-byte (16 bit) integer specifying the number of bytes in the following UTF-16 encoded string. This Unicode field type is similar to CT_F2STRING field type,
CT_UNICODE	A variable-length field containing a UTF-16 encoded, null terminated string. This Unicode field type is similar to CT_STRING field type.
CT_2UNICODE	A variable-length field that begins with a 2 byte (16 bit) integer specifying the number of bytes in the UTF-16 encoded string. This Unicode field type is similar to CT_2STRING field type.

The length fields at the beginning of CT_F2UNICODE and CT_2UNICODE field types, and the length in the DODA entry for CT_FSTRING and CT_F2STRING field types, are specified in bytes. Specifying a field length in bytes is consistent with all other FairCom DB API field types, but is inconsistent with the system level routines that ordinarily use a number of characters, not a number of bytes, to describe the length of UTF-16 strings.

Storing a UTF-16 string longer than 64Kbytes requires a CT_UNICODE field. To store a UTF-16 string greater than 64Kbytes with a length prefix, convert the string to UTF-8 and store it in a CT_4STRING field, as discussed below. If this UTF-8 converted field is to be part of a key segment, then "Extended Key Segment" information must also be added to this key segment.

Creating Tables with Unicode Field types

Creating tables with Unicode field types is done by adding or inserting a new field with a field type set to one of the following Unicode field types: CT_FSTRING, CT_F2STRING, CT_STRING or CT_2STRING.

FairCom DB API C API Example

ctdbAddField(hTable, "f1", CT_FUNICODE, 40);
ctdbAddField(hTable, "f2", CT_INT4, 4);

if ((eRet = ctdbCreateTable(hTable, "table", CTCREATE_NORMAL)) != CTDBRET_OK)
{
    printf("ctdbCreateTable failed with error %d", eRet);
}

Reading UTF-16 Field Data

The ctdbGetFieldAsUTF16() function enables applications using the FairCom DB API C API to read data from Unicode field types.

CTDBRET ctdbGetFieldAsUTF16(CTHANDLE Handle, NINT FieldNbr, pWCHAR pValue, VRLEN size);

ctdbGetFieldAsUTF16() retrieves the field data as a Unicode UTF-16 string. If the underlying field type is not one of the Unicode field types, the data is converted to UTF-16 string. Handle is a record handle, FieldNbr is the number of the field, pValue is a pointer to a wide (UTF-16) string buffer and size indicates the size in bytes of the string area. ctdbGetFieldAsUTF16() returns CTDBRET_OK on success.

Reading UTF-16 C Example

CTDBRET CheckData(CTHANDLE hRecord, pTEXT str, NINT val)
{
	CTDBRET eRet;
	WCHAR WStr[32];
	TEXT s[64];
	CTSIGNED t;

	if ((eRet = ctdbGetFieldAsUTF16(hRecord, 0, WStr, sizeof(WStr))) != CTDBRET_OK)
	{
		printf("ctdbGetFieldAsUTF16 failed with error %d", eRet);
		goto Exit;
	}
	if ((eRet = (CTDBRET)ctdb_u16TOu8(WStr, s, sizeof(s))) != CTDBRET_OK)
	{
		printf("ctdb_u16TOu8 failed with error %d", eRet);
		goto Exit;
	}
	if (strcmp(s, str) != 0)
	{
		printf("UNICODE field contents not the same written");
		eRet = CTDBRET_DIFFERENT;
		goto Exit;
	}
	if ((eRet = ctdbGetFieldAsSigned(hRecord, 1, &t)) != CTDBRET_OK)
	{
		printf("ctdbGetFieldAsSigned failed with error %d", eRet);
		goto Exit;
	}
	if ((NINT)t != val)
	{
		printf("integer field contents not the same written");
		eRet = CTDBRET_DIFFERENT;
		goto Exit;
	}
	if ((eRet = ctdbNextRecord(hRecord)) != CTDBRET_OK)
	{
		if (eRet == INOT_ERR)
			eRet = CTDBRET_OK;
		else
			printf("ctdbNextRecord failed with error %d", eRet);
	}

Exit:
return eRet;
}

Writing UTF-16 Field Data

A new function ctdbSetFieldAsUTF16() has been added to FairCom DB API C API to enable applications to write data to Unicode field types.

CTDBRET ctdbSetFieldAsUTF16(CTHANDLE Handle, NINT FieldNbr, pWCHAR pValue);

ctdbSetFieldAsUTF16() puts a Unicode UTF-16 string in a Unicode field. If the underlying field type is not one of the Unicode field types, the UTF-16 string is converted to the appropriate type before the data is stored in the field. Handle is a record handle, FieldNbr is the field number and pValue is a pointer to a wide (UTF-16) string buffer. ctdbSetFieldAsUTF16() returns CTDBRET_OK on success.

Two new methods, SetFieldAsUTF16(), have been added to the CTRecord class to enable applications to write data to Unicode field types.

void CTRecord::SetFieldAsUTF16(NINT FieldNumber, pWCHAR value);

SetFieldAsUTF16() puts a Unicode UTF-16 string in a Unicode field. If the underlying field type is not one of the Unicode field types, the UTF-16 string is converted to the appropriate type before the data is stored in the field. FieldNbr is a number representing the field number and value is the wide (UTF-16) string buffer.

void CTRecord::SetFieldAsUTF16(const CTString& FieldName, pWCHAR value);

SetFieldAsUTF16() puts a Unicode UTF-16 string in a Unicode field. If the underlying field type is not one of the Unicode field types, the UTF-16 string is converted to the appropriate type before the data is stores in the field. FieldName is the field name and value is the wide (UTF-16) string buffer.

Writing UTF-16 Data C Example

CTDBRET AddData(CTHANDLE hRecord, pTEXT str, NINT val)
{
	CTDBRET eRet;
	WCHAR WStr[32];

	if ((eRet = ctdbClearRecord(hRecord)) != CTDBRET_OK)
	{
		printf("ctdbClearRecord failed with error %d", eRet);
		goto Exit;
	}
	if ((eRet = (CTDBRET)ctdb_u8TOu16(str, WStr, sizeof(WStr))) != CTDBRET_OK)
	{
		printf("ctdb_u8TOu16 failed with error %d", eRet);
		goto Exit;
	}
	if ((eRet = ctdbSetFieldAsUTF16(hRecord, 0, WStr)) != CTDBRET_OK)
	{
		printf("ctdbSetFieldAsUTF16 failed with error %d", eRet);
		goto Exit;
	}
	if ((eRet = ctdbSetFieldAsSigned(hRecord, 1, (CTSIGNED)val)) != CTDBRET_OK)
	{
		printf("ctdbSetFieldAsSigned failed with error %d", eRet);
		goto Exit;
	}
	if ((eRet = ctdbWriteRecord(hRecord)) != CTDBRET_OK)
	{
		printf("ctdbWriteRecord failed with error %d", eRet);
		goto Exit;
	}

Exit:
return eRet;
}

Creating Key Segments based on Unicode Fields

Unicode key segments provide a challenge for two reasons:

Unlike all other key segments previously implemented, the number of bytes stored in the key and the number of bytes of source data used to construct the key are not the same.
The derivation of the binary sort key (segment) stored in the index from the source data is not a simple transformation.

To accommodate both of these challenges, c-tree Plus incorporated "extended key segments." The concept of an extended key segment can be applied to virtually any non-standard key segment. Our first implementation is for Unicode keys.

Because of the complexity of the Unicode collation algorithm, and because of the incredible breadth of language and country support envisaged by Unicode, FairCom has chosen to implement Unicode key segments using the International Components for Unicode (ICU) open- source development project. The ICU implementation of Unicode support is available on a wide variety of platforms, but not every platform. The ICU web site can be accessed at:

IBM International Components for Unicode (ICU)

How to Specify a Unicode Key Segment

An ordinary FairCom DB API key segment is defined by a field handle and mode. FairCom DB API also allows the specification of a key segment using an offset, length, and mode.

In the following example, since the segment mode is CTSEG_SCHSEG, and if hField is a handle of a field whose type is one of the Unicode field types CT_FUNICODE, CT_F2UNICODE, CT_UNICODE or CT_2UNICODE, then FairCom DB API will understand this is a Unicode key segment.

Specifying a Unicode Key Segment C Example

hField = ctdbAddField(hTable, "customer", CT_FUNICODE, 40);

if ((ctdbAddSegment(hIndex, hField, CTSEG_SCHSEG) == NULL)
    printf("ctdbAddSegment failed with error %d\n", ctdbGetError(hIndex));

Specifying a Unicode Key Segment with CTSEG_UNCSEG

If a key segment is a Unicode segment, but the segment mode is not one of the CTSEG_SCHSEG modes or the segment field is not one of the Unicode field types, the segment mode must also specify the Unicode segment modifier CTSEG_UNCSEG. For example, assume the CT_FSTRING field contains a UTF-8 string:

Notice an Extended Key Segment definition must be created for the segment. Please refer to section Extended Key Segment Definition below. If no extended key segment definition is provided at the time of the table creation, FairCom DB API will create an extended key segment with default values. Please refer to the Default Extended Key Segment Definition section below.

CTSEG_UNCSEG C API Example

hField = ctdbAddField(hTable, "customer", CT_FSTRING, 40);

if ((ctdbAddSegment(hIndex, hField, CTSEG_SCHSEG | CTSEG_UNCSEG)) == NULL)
    printf("ctdbAddSegment failed with error %d\n", ctdbGetError(hIndex));

ICU Collation Option Overview

The collation options can be grouped as follows: locale default control, collation strength, normalization, and special attributes. Locale default control effects the degree to which a default locale must be related to the requested locale. Collation strength determines how case, accents and other character modifiers affect the ordering of sort keys. Normalization effects how alter- native variations of the "same" character (including its accents and other modifiers) are compared. The special attributes effect particular properties of the collation, which further modify the strength and normalization options. For example, a special attribute can be used to force lower case characters to be first or last in the collation.

If no locale default control option is made part of kseg_comp, there is no restriction on how close to the requested locale the effective locale must be. For example, if you request collation for the German language ("de"), you are likely to get a locale based on the system default (e.g., "en_US" in the United States). This is not a problem since it has been determined that the default rules work for the German language.

If ctKSEG_COMPU_SYSDEFAULT_NOTOK is used, then a request to use locale "xx_YY_Variant" will succeed as long as collation rules for "xx" are available. If ctKSEG_COMPU_FALLBACK_NOTOK is used, then rules for the particular locale with its optional country and variant modifiers must be available. Falling back from "xx_YY" to "xx" is not satisfactory. In the case of the "de" locale noted above, the segment definition would cause an error in the call to PutXtdKeySegmentDef() if either of the "NOTOK" default restrictions are part of the definition.

At most one of the following collation strength options can be included in kseg_comp:

ctKSEG_COMPU_S_PRIMARY
ctKSEG_COMPU_S_SECONDARY
ctKSEG_COMPU_S_TERTIARY
ctKSEG_COMPU_S_QUATERNARY
ctKSEG_COMPU_S_IDENTICAL
ctKSEG_COMPU_S_DEFAULT

At most, one of the following normalization options can be included in kseg_comp:

ctKSEG_COMPU_N_NONE
ctKSEG_COMPU_N_CAN_DECMP
ctKSEG_COMPU_N_CMP_DECMP
ctKSEG_COMPU_N_CAN_DECMP_CMP
ctKSEG_COMPU_N_CMP_DECMP_CAN
ctKSEG_COMPU_N_DEFAULT

One or more of the following special attributes can be included in kseg_comp After each one of the c-tree symbolic constants is the equivalent ICU-attribute value pair.

c-tree Symbolic Constant	ICU Attribute value pair
ctKSEG_COMPU_A_FRENCH_ON	(UCOL_FRENCH_COLLATION,UCOL_ON)
ctKSEG_COMPU_A_FRENCH_OFF	(UCOL_FRENCH_COLLATION,UCOL_OFF)
ctKSEG_COMPU_A_CASE_ON	(UCOL_CASE_LEVEL,UCOL_ON)
ctKSEG_COMPU_A_CASE_OFF	(UCOL_CASE_LEVEL,UCOL_OFF)
ctKSEG_COMPU_A_DECOMP_ON	(UCOL_DECOMPOSITION_MODE,UCOL_ON)
ctKSEG_COMPU_A_DECOMP_OFF	(UCOL_DECOMPOSITION_MODE,UCOL_OFF)
ctKSEG_COMPU_A_SHIFTED	(UCOL_ALTERNATE_HANDLING, UCOL_SHIFTED)
ctKSEG_COMPU_A_NONIGNR	(UCOL_ALTERNATE_HANDLING, UCOL_NON_IGNORABLE)
ctKSEG_COMPU_A_LOWER	(UCOL_CASE_FIRST,UCOL_LOWER_FIRST)
ctKSEG_COMPU_A_UPPER	(UCOL_CASE_FIRST,UCOL_UPPER_FIRST)
ctKSEG_COMPU_A_HANGUL	(UCOL_NORMALIZATION_MODE, UCOL_ON_WITHOUT_HANGUL)

It is permissible to set kseg_comp to zero. A zero kseg_comp implies no restrictions on locale defaults, default collation strength, default normalization, and no special attributes.

For a complete treatment of all of these options, please refer to the ICU website and the Unicode Consortium’s website and publications.

Storing UTF-8 Data

Since a UTF-8 encoded string is comprised of ordinary ASCII characters (with code values between 0 and 127) and multi-byte characters (which have the highest-order bit set in each byte), they can be stored normally in any of FairCom DB API’s string or binary field types such as CT_STRING, CT_FSTRING, CT_4STRING, etc. It is up to the application to decipher the field data.

On the other hand, if a field holding a UTF-8 encoded string is part of a key segment, then you need to define an "Extended Key Segment" for that segment to allow FairCom DB API to apply the appropriate translations to the field data when building the key data. Please refer to the "Extended Key Segment Definition" section above for more details.

Converting from Unicode UTF-16 to UTF-8

FairCom DB provides conversion routines between UTF‑8 and UTF‑16. The input strings are assumed to be terminated by a NULL character. All output buffer sizes are specified in bytes. The conversion routines return CTDBRET_OK (0) on success, error VBSZ_ERR (153) if the output buffer is too small, or error BMOD_ERR (446) if there is a problem with the input string.

ctdb_u8TOu16() converts an ASCII or UTF-8 encoded string to a UTF-16 Unicode string:

NINT ctdb_u8TOu16(pTEXT u8str, pWCHAR u16str, NINT u16size);

FairCom DB API C API Example

WCHAR buffer[256];

switch (ctdb_u8TOu16("tablename", buffer, sizeof(buffer))
{
    case CTDBRET_OK:
    {
        printf("UTF-8 to UTF-16 conversion ok\n");
        break;
    }

    case VBSZ_ERR:
    {
        printf("Conversion buffer is too small\n");
        break;
    }

    case BMOD_ERR:
    {
        printf("Problem occurred during conversion\n");
        break;
    }

    default:
    {
        printf("Unknown error code\n");
        break;
    }
}

ctdb_u16TOu8() converts a UTF-16 encoded string to a UTF-8 Unicode string:

NINT ctu16TOu8(pWCHAR u16str, pTEXT u8str, NINT u8size);

FairCom DB API C API Example

TEXT buffer[512];
switch (ctdb_u16TOu8(tableName, buffer, sizeof(buffer))
{
    case CTDBRET_OK:
    {
        printf("UTF-16 to UTF-8 conversion ok\n");
        break;
    }

    case VBSZ_ERR:
    {
        printf("Conversion buffer is too small\n");
        break;
    }

    case BMOD_ERR:
    {
        printf("Problem occurred during conversion\n");
        break;
    }

    default:
    {
        printf("Unknown error code\n");
        break;
    }
}

Extended Key Segment Definition

The implementation of extended key segments in FairCom DB API allows a single extended key segment definition to be used by more than one actual key segment. Extended key segment definitions may be set for all segments of a table, all segments of an index or for each particular key segment.

If a key segment mode includes a modifier for an extended key segment definition CTSEG_UNCSEG or the segment mode is CTSEG_SCHSEG, and the field type is one of the Unicode types, then the particular extended key segment definition to use for this segment is determined according to the following hierarchy. Use the definition specified for:

The segment
The index associated with the segment
The data file associated with the index

Once an extended key segment definition has been specified at a particular level (for a particular type of segment), an attempt to specify another definition at the same level results in an error. This is in part because of the "first use" strategy noted above, and because one should not change a definition if key values already exist.

FairCom DB API C API Functions

The following functions are available to implement table-wide extended key segment definitions:

CTDBRET ctdbSetTableKSeg(CTHANDLE Handle, pctKSEGDEF pKSeg);

ctdbSetTableKSeg() establishes a table-wide extended key segment definition. Handle must be a table handle and pKSeg is a pointer to an extended key segment definition structure with the extended key definition. ctdbSetTableKSeg() returns CTDBRET_OK on success.

CTDBRET ctdbGetTableKSeg(CTHANDLE Handle, pctKSEGDEF pKSeg);

ctdbGetTableKSeg() retrieves the current table-wide extended key segment definition. Handle must be a table handle and pKSeg is a pointer to an extended key segment definition structure which will receive the definition. If no extended key segment definition is available, ctdbGetTableKSeg() returns CTDBRET_NOTFOUND. On success ctdbGetTableKSeg() returns CTDBRET_OK.

The following two functions were added to implement index-wide extended key segment definitions:

CTDBRET ctdbSetIndexKSeg(CTHANDLE Handle, pctKSEGDEF pKSeg);

ctdbSetIndexKSeg() establishes an index-wide extended key segment definition. Handle must be an index handle and pKSeg is a pointer to an extended key segment definition structure with the extended key definition. ctdbSetIndexKSeg() returns CTDBRET_OK on success.

CTDBRET ctdbGetIndexKSeg(CTHANDLE Handle, pctKSEGDEF pKSeg);

ctdbGetIndexKSeg() retrieves the current index-wide extended key segment definition. Handle must be an index handle and pKSeg is a pointer to an extended key segment definition structure which will receive the definition. If no extended key segment definition is available, ctdbGetIndexKSeg() returns CTDBRET_NOTFOUND. On Success ctdbGetIndexKSeg() returns CTDBRET_OK.

The following three functions were added to implement an extended key segment definition for a specific key segment:

CTDBRET ctdbSetSegmentKSeg(CTHANDLE Handle, pctKSEGDEF pKSeg);

ctdbSetSegmentKSeg() establishes a segment’s extended key segment definition. Handle must be a segment handle and pKSeg is a pointer to an extended key segment definition structure with the extended key definition. ctdbSetSegmentKSeg() returns CTDBRET_OK on success.

CTDBRET ctdbGetSegmentKSeg(CTHANDLE Handle, pctKSEGDEF pKSeg);

ctdbGetSegmentKSeg() retrieves the current index-wide extended key segment definition. Handle must be a segment handle and pKSeg is a pointer to an extended key segment definition structure which will receive the definition. If no extended key segment definition is available, ctdbGetSegmentKSeg() returns CTDBRET_NOTFOUND. On Success ctdbGetSegmentKSeg() returns CTDBRET_OK.

CTDBRET ctdbSetKSegDefaults(pctKSEGDEF pKSeg);

ctdbSetKSegDefaults() sets the system-wide default values for the extended key segment definition. pKSeg is a pointer to an extended key segment definition structure which will receive the definition.

The default values are:

kseg_ssiz = ctKSEG_SSIZ_COMPUTED;
kseg_type = ctKSEG_TYPE_UNICODE;
kseg_styp = ctKSEG_STYP_UTF16;
kseg_comp = ctKSEG_COMPU_S_DEFAULT | ctKSEG_COMPU_N_NONE;
kseg_desc = "en_US"

Extended Key Segment Structure

Extended key segments are specified by filling the fields of the ctKSEGDEF structure:

#define ctKSEGDLEN      32          /* length of desc string                */
typedef struct keysegdef {
     LONG   kseg_stat;              /* status (internal use)                */
     LONG   kseg_vrsn;              /* version info                         */
     LONG   kseg_ssiz;              /* source size                          */
     LONG   kseg_type;              /* segment type                         */
     LONG   kseg_styp;              /* source type                          */
     LONG   kseg_comp;              /* comparison options                   */
     LONG   kseg_rsv1;              /* future use                           */
     LONG   kseg_rsv2;              /* future use                           */
     TEXT   kseg_desc[ctKSEGDLEN];  /* text specification eg, locale string */
     } ctKSEGDEF, ctMEM* pctKSEGDEF;

The FairCom DB module ctport.h contains defines for all of the constants, beginning with ctKSEG, used to create an extended key segment definition. As extended key segments are currently implemented, the kseg_stat and the kseg_vrsn members are filled-in as needed by the extended key segment implementation itself. The kseg_ssiz member specifies the number of bytes of source data to use to derive the actual key segment. In addition to using a specific numeric value for the source size, kseg_ssiz may also be assigned either of two values discussed in the following two sections.

ctKSEG_SSIZ_COMPUTED

The information about the underlying data field will be used to compute how much source data is available. For fields without length specifiers (such as CT_STRING or CT_UNICODE) an appropriate version of strlen() will be used to determine data availability. However, this could be very inefficient if the field may hold very long strings since it is likely that only a small portion of the variable length field will actually contribute to the key segment. An alternative is to specify a fixed source size. If the variable data has less than this size, it will still be handled correctly.

ctKSEG_SSIZ PROVIDED

The call to create the key segment will provide the particular length of source data available.

For an ICU Unicode definition, the remaining structure members are specified as follows:

kseg_type

Must be set to ctKSEG_TYPE_UNICODE.

kseg_styp

Specify the type of source data as ctKSEG_STYP_PROVIDED.

ctKSEG_STYP_PROVIDED means that the type of source data will be determined at run-time during key value construction. (Key value construction consists of one or both of assembling the key value from its component segments and performing transformations to generate a binary sort key). In this case, if the data type is one of the conventional c-tree string types (e.g., CT_STRING), the source data type is UTF-8; if a Unicode string type is found (e.g., CT_UNICODE), then the source data type is UTF-16. However, if the underlying data type does not fall into either of these categories, the data is treated as UTF-16, and used as is.

kseg_desc

Contains the ICU locale formed as an ordinary, null-terminated ASCII string. The format specified by ICU is "xx", "xx_YY", or "xx_YY_Variant" where "xx" is the language as specified by ISO-639 (e.g., "fr" for French); "YY" is a country as specified by ISO-3166 (e.g., "fr_CA" for French language in Canada); and the "Variant" portion represents system-dependent options. Note: When ICU uses a locale to access collation rules, it attempts to get rules for the closest match to the locale specified in kseg_desc. By default, there is no restriction on how close the match of locales must be to be acceptable. You can restrict the use of alternative locales by including either ctKSEG_COMPU_FALLBACK_NOTOK or ctKSEG_COMPU_SYSDEFAULT_NOTOK as part of the bit map comprising kseg_comp discussed below. After a successful call to PutXtdKeySegmentDef(), the GetXtdKeySegmentDef() function can be used to determine the actual ICU locale used during collation.

kseg_comp

This member of the structure permits the full range of ICU collation options to be specified through a bit map.

Example of extended key segment structure:

ctKSEGDEF ksgdef;
ksgdef.kseg_ssiz = 12;                    /* 12 bytes for the source */
ksgdef.kseg_type = ctKSEG_TYPE_UNICODE;   /* ICU Unicode             */
ksgdef.kseg_styp = ctKSEG_STYP_UTF16;     /* UTF16 source data       */
ksgdef.kseg_comp = ctKSEG_COMPU_A_LOWER;  /* lower case sorts first  */
strcpy(ksgdef.kseg_desc,"fr_CA");         /* French in Canada        */

Working with Unicode

Working with Unicode

Unicode UTF-16

Unicode UTF-8

FairCom DB API C API UTF-8 Compliance

Activating FairCom DB API Unicode support

ICU - International Components for Unicode

Unicode Support

FairCom DB Unicode UTF-16 Field Types

Creating Tables with Unicode Field types

FairCom DB API C API Example

Reading UTF-16 Field Data

Reading UTF-16 C Example

Writing UTF-16 Field Data

Writing UTF-16 Data C Example

Creating Key Segments based on Unicode Fields

How to Specify a Unicode Key Segment

Specifying a Unicode Key Segment C Example

Specifying a Unicode Key Segment with CTSEG_UNCSEG

CTSEG_UNCSEG C API Example

ICU Collation Option Overview

Storing UTF-8 Data

Converting from Unicode UTF-16 to UTF-8

Extended Key Segment Definition

FairCom DB API C API Functions

Extended Key Segment Structure

ctKSEG_SSIZ_COMPUTED

ctKSEG_SSIZ PROVIDED

Was this article helpful?

Related Articles

Transform method concepts

JavaScript transform concepts