Unicode

FairCom DB Standalone provides support for Unicode. This support includes:

  • Storing Unicode data
  • Indexing Unicode data
  • Recognizing Unicode file names

This support is in addition to our ordinary string support, not an either/or approach.

Additionally, FairCom DB Standalone provides extended key segment support, which will allow support for other types of complex key segments in the future.

When building your FairCom DB Standalone library, you are prompted whether to include Unicode support, similar to the following:

       ---- FairCom c-tree Plus UniCode Support ----

This version of c-tree Plus provides support for UNICODE field types.

Do you want to support UniCode field types? (Y)es (N)o (D)efaults- [N]:

Simply choose “Y”es to include this support.

Learn More

FairCom DB Standalone handles Unicode exactly the same as it is handled by FairCom DB. You can learn more in the Unicode chapter of the FairCom ISAM for C Developer Guide.

  • Note: There are a number of locations within the FairCom documentation where additional information may be included for use with the FairCom DB Server. This information can be ignored if you are using the FairCom DB Standalone models.

References to "Mirrored File Names" in the Unicode chapters are not available in the FairCom DB Standalone models.

Also note references to SQL support in this book require either the use of the optional FairCom DB Standalone SQL Service package, or use of the FairCom DB Server engine.

 

Unicode Concepts

Unicode is an effort to standardize the representations of all languages in computer format. Early standards, like ASCII, only encoded letters for English. Efforts to internationalize started with extending ASCII to include characters used in other western languages, such as umlauts and accents, but was limited by a 255-character set that would fit in 1 byte. Unicode incorporates the characters of all the major government standards for ideographic characters from Japan, Korea, China, and Taiwan, and more.

Though Unicode is thought of as a wide-character encoding with 16 bits per character, Unicode standards include 8-bit multi-byte encoding (UTF‑8), 16-bit wide character encoding (UTF‑16), and 32-bit wide character encoding (UTF‑32). FairCom DB supports both UTF‑8 and UTF‑16.

Well-defined conversion routines permit unambiguous translation among UTF‑8, UTF‑16, and UTF‑32.

A Unicode string is terminated by a null character: a single zero byte for UTF‑8, and 2 and 4 zero bytes for UTF‑16 and UTF‑32, respectively.

Note: UTF‑16 does not encode all characters with a single 16-bit code unit. There are some languages that incorporate a sequence of two 16-bit code units to encode a single character.

 

Unicode default charset for SQL CHAR and VARCHAR changed from US-ASCII to ISO-8859-1

Prior to V11, the FairCom DB Unicode implementation mandated US-ASCII (7-bit) chars in [VAR]CHAR fields, which may have been too strict for customers who use an 8-bit charset. For this reason, the default charset has been changed to ISO-8859-1 (aka Latin1). According to ICU:

ISO-8859-1 is relatively unproblematic — if its limited character repertoire is sufficient — because it is converted trivially (1:1) to Unicode, avoiding conversion table problems for its small set of characters. (By contrast, proper conversion from US-ASCII requires a check for illegal byte values 0x80..0xff, which is an unnecessary complication for modern systems with 8-bit bytes. ISO-8859-1 is nearly as ubiquitous for modern systems as US-ASCII was for 7-bit systems.)

The modification should introduce no backward compatibility problems as US-ASCII is a subset of ISO-8859-1.

 

Preparation

Unicode support is currently available for any client when connecting to FairCom DB for Windows or Mac OS X and for FairCom DB Standalone libraries under Windows or Mac OS X. For client operation, ensure you install FairCom DB for Windows or Mac OS X with Unicode support.

To include Unicode support when building the FairCom DB libraries, execute mtmake with the ‘u’ flag:

>mtmake u

This option prepares the library for Unicode support. To enable Unicode support, all builds of FairCom DB require the ICU libraries from the ICU web site, as described in a later section, Unicode Libraries Required for FairCom DB.

 

Storing Unicode Data

Simply storing Unicode data has always been possible with FairCom DB, provided the application treated the data as binary and performed any necessary translations. Index support is described in Unicode Key Segments.

 

Storing UTF-16 Data

Storing Unicode data requires DODA entries for each field. The individual wide-characters used in UTF-16 are not platform independent with respect to byte ordering. They are treated the same as short integers: on LOW_HIGH platforms, the lower order byte comes before the higher order byte. Recall in a client/server environment with the DODA entries in place, the Server and clients manage byte-order translation.

FairCom DB has four Unicode (UTF-16) field types:

  • CT_FUNICODE - A fixed length field containing a UTF-16 encoded, null terminated string.
  • CT_UNICODE - A variable-length field containing a UTF-16 encoded, null terminated string.
  • CT_F2UNICODE -A fixed length field that begins with a 2-byte integer specifying the number of bytes in the following UTF-16 encoded string.
  • CT_2UNICODE - A variable-length field that begins with a 2-byte integer specifying the number of bytes in the following UTF-16 encoded string.

Note: The length fields at the beginning of CT_F2UNICODE and CT_2UNICODE are specified in bytes. Specifying a field length in bytes is consistent with all other FairCom DB field types, but it is inconsistent with system level routines that ordinarily use number of characters, not number of bytes, to describe the length of wide-character strings.

Storing a UTF-16 string longer than 64KB requires a CT_UNICODE field. To store a string greater than 64KB with a length prefix, convert the string to UTF-8 and store it in a CT_4STRING field, as discussed below.

 

Storing UTF-8 Data

Since a UTF-8 encoded string is comprised of ordinary ASCII characters (with code values between 0 and 127), and multi-byte characters (which have the highest-order bit set in each byte), they can be stored normally in a FairCom DB record when a DODA is not present. It is simply up to the application to decipher the record, as with any other data type.

With a DODA present, store UTF-8 encoded strings in any FairCom DB standard string type, such as CT_STRING, CT_4STRING, etc. Since FairCom DB only interprets the contents of a field when the field is part of a key value, storing a UTF-8 string in an “ordinary” FairCom DB string-type field works, provided:

  • Indexing is not required.
  • There is a mechanism to permit FairCom DB’s key assembly routine to properly interpret the string field. FairCom DB support for extended key segment capability deals with this situation.

FairCom DB provides conversion routines between UTF-8 and UTF-16. The input strings are assumed to be terminated by a NULL character. All output buffer sizes are specified in bytes. The conversion routines return NO_ERROR (0) on success, VBSZ_ERR (153) if the output buffer is too small, or BMOD_ERR (446) if there is a problem with the input string.

  • ctu8TOu16() converts an ASCII or UTF-8 encoded string to a UTF-16 Unicode string.

   NINT ctu8TOu16(pTEXT u8str,pWCHAR u16str,VRLEN u16byt)

  • ctu16TOu8() converts in the other direction.

   NINT ctu16TOu8(pWCHAR u16str,pTEXT u8str,VRLEN u8byt)

Contact FairCom if you require routines to handle UTF-32 conversion.

 

Unicode Key Segments

Unicode key segments provide a challenge for two reasons:

  1. Unlike all other key segments previously implemented, the number of bytes stored in the key and the number of bytes of source data used to construct the key are not the same.
  2. The derivation of the binary sort key (segment) stored in the index from the source data is not a simple transformation.

To accommodate both of these challenges, FairCom DB incorporated an “extended key segments” feature. The concept of an extended key segment can be applied to virtually any non-standard key segment. Unicode is one such implementation of extended key segments.

 

Unicode Libraries Required for FairCom DB

Because of the complexity of the Unicode collation algorithm, and because of the incredible breadth of language and country support envisaged by Unicode, FairCom has chosen to implement Unicode key segments using the International Components for Unicode (ICU) open-source development project. The ICU implementation of Unicode support is available on a wide variety of platforms, but not every platform.

As this library is maintained by a third-party, FairCom does not include it with a FairCom DB distribution. A FairCom DB application developer who wishes to include Unicode support will need to download and install the libraries in accordance with the ICU license stipulations.

ICU Libraries

Unicode is the single, universal character set for text which enables the interchange, processing, storage and display of text in many languages.

The International Components for Unicode (ICU) are a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

Complete details on the ICU can be found on the ICU Unicode website.

ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software. To the extent required by the licenses accompanying the ICU libraries, the terms of such license will apply in lieu of the terms of any agreement with FairCom with respect to the open source software including, without limitation, any provisions governing access to source code, modification or reverse engineering. FairCom makes no representation, warranty or other commitment of any kind regarding such open source software, offers no technical support for such open source software and shall, to the maximum extent permitted by law, have no liability associated with its use.

 

How to Specify an ICU Unicode Key Segment

An ordinary FairCom DB key segment is defined by a triplet: offset, length, and mode. In IFIL parlance, this is an ISEG structure. An extended key segment also uses this “standard” specification, but with two adjustments:

  • The length specified in the triplet is the number of bytes that this segment will occupy in the key value stored in the index rather than the number of bytes of source data that will be used to generate the key segment. The source length will be part of the extended key segment definition discussed below.
  • The key segment mode will contain a modifier that indicates the particular type of extended key segment. The only extended key segment at this time is UNCSEG, an ICU Unicode segment. An example of an ICU Unicode ISEG is:

   ISEG isegunc = {8,24,REGSEG | UNCSEG};

This ISEG specifies:

  • The source data begins at an offset of 8 bytes from the start of the record.
  • The key segment will be 24 bytes in length.
  • The segment type will be an ICU Unicode segment.

However, this ISEG definition does not specify the underlying data type (UTF-8 or UFT16 for a Unicode segment), nor does it specify how many bytes of source data to use to construct the segment. The extended key segment definition specifies this additional information. Note that REGSEG implies no standard transformation, but the UNCSEG modifier specifies the particular type of extended segment.

In addition to an explicit definition of segment types such as REGSEG or INTSEG, FairCom DB supports VARSEG and SCHSEG. VARSEG and SCHSEG use the same triplet, but the contents are interpreted somewhat differently.

  • A VARSEG is a segment based on a field that falls in the variable-length region of the file and therefore cannot be located by a simple offset value. The offset is interpreted as the number of fields in the variable-length region to skip over. A zero implies the first variable-length field is used.
    To use an extended segment definition with VARSEG, simply modify the key segment mode as before:

   VARSEG | UNCSEG

  • A SCHSEG is a segment whose type is based on the data record field definitions stored in the DODA. When the mode is SCHSEG, the offset value is interpreted as a zero based index into the DODA. A value of zero implies using the first field definition to determine the type of key segment. You must use SCHSEG | UNCSEG segment mode if the offset value maps to an underlying data field that is one of the UTF-16 Unicode types

(CT_UNICODE, CT_2UNICODE, CT_FUNICODE, CT_F2UNICODE).

If the underlying data is stored in a regular string field (e.g., CT_STRING), and the data is UTF-8 encoded, UNCSEG must also be combined with SCHSEG as FairCom DB is unable to automatically identify the UTF-8 data encoding:

   SCHSEG | UNCSEG

 

Extended Key Segment Definition

This section describes how to define an extended key segment.

The FairCom DB implementation of extended key segments allows a single extended key segment definition to be used by more than one actual key segment. For example, an application may make one call to PutXtdKeySegmentDef() (discussed below) that applies to all of the extended segments used in the application. Therefore, some of the parameters specified in the definition optionally permit their particular values to be determined at run-time for each key segment.

Specify an extended key segment definition using the ctKSEGDEF structure presented in ctport.h:

#define ctKSEGDLEN      32          /* length of desc string                */

typedef struct keysegdef {

     LONG   kseg_stat;              /* status (internal use)                */

     LONG   kseg_vrsn;              /* version info                         */

     LONG   kseg_ssiz;              /* source size                          */

     LONG   kseg_type;              /* segment type                         */

     LONG   kseg_styp;              /* source type                          */

     LONG   kseg_comp;              /* comparison options                   */

     LONG   kseg_rsv1;              /* future use                           */

     LONG   kseg_rsv2;              /* future use                           */

     TEXT   kseg_desc[ctKSEGDLEN];  /* text specification eg, locale string */

     } ctKSEGDEF;

The FairCom DB module ctport.h contains defines for all of the constants, beginning with ctKSEG, used to create an extended key segment definition. As extended key segments are currently implemented, the kseg_stat and the kseg_vrsn members are filled in as needed by the extended key segment implementation itself. The kseg_ssiz member specifies the number of bytes of source data to use to derive the actual key segment. In addition to using a specific numeric value for the source size, kseg_ssiz may also be assigned either of these two “values”:

  • ctKSEG_SSIZ_COMPUTED

The information about the underlying data field will be used to compute how much source data is available.

For fields without length specifiers (such as CT_STRING or CT_UNICODE) an appropriate version of strlen() will be used to determine data availability. However, this could be very inefficient if the field may hold very long strings since it is likely that only a small portion of the variable-length field will actually contribute to the key segment. An alternative is to specify a fixed source size. If the variable data has less than this size, it will still be handled correctly.

  • ctKSEG_SSIZ_PROVIDED

The call to create the key segment will provide the particular length of source data available. This option is typically used when an explicit call is made to TransformXtdSegment() (discussed below).

For an ICU Unicode definition, the remaining structure members are specified as follows:

  • kseg_type

Set to ctKSEG_TYPE_UNICODE.

  • kseg_styp

Specify the type of source data as follows:

  • ctKSEG_STYP_UTF8
  • ctKSEG_STYP_UTF16
  • ctKSEG_STYP_PROVIDED
  • A schema (DODA) exists for the associated data file
  • The key segment mode is either VARSEG or SCHSEG.
  • kseg_desc

Contains the ICU locale formed as an ordinary, null terminated ASCII string. The format specified by ICU is “xx”, “xx_YY”, or “xx_YY_Variant” where “xx” is the language as specified by ISO-639 (for example, “fr” for French); “YY” is a country as specified by ISO-3166 (e.g., “fr_CA” for French language in Canada); and the “Variant” portion represents system dependent options.

Note: When ICU uses a locale to access collation rules, it attempts to get rules for the closest match to the locale specified in kseg_desc. By default, there is no restriction on how close the match of locales must be to be acceptable. You can restrict the use of alternative locales by including either ctKSEG_COMPU_FALLBACK_NOTOK or ctKSEG_COMPU_SYSDEFAULT_NOTOK as part of the bitmap comprising kseg_comp discussed below. After a successful call to PutXtdKeySegmentDef(), the GetXtdKeySegmentDef() function can be used to determine the actual ICU locale used during collation.

  • kseg_comp

This member of the structure permits the full range of ICU collation options to be specified through a bitmap. The details of these options are beyond the scope of this documentation. However the symbolic constants used to form the bitmap are presented in “ICU Collation Option Overview”.

 

ICU Collation Option Overview

The collation options can be grouped as follows: locale default control, collation strength, normalization, and special attributes. Locale default control affects the degree to which a default locale must be related to the requested locale. Collation strength determines how case, accents and other character modifiers affect the ordering of sort keys. Normalization affects how alternative variations of the “same” character (including its accents and other modifiers) are compared. The special attributes affect particular properties of the collation, which further modify the strength and normalization options. For example, a special attribute can be used to force lower case characters first or last in the collation.

If no locale default control option is made part of kseg_comp, there is no restriction on how close to the requested locale the effective locale must be. For example, if you request collation for the German language (“de”), you are likely to get a locale based on the system default (e.g., “en_US” in the United States). This is not a problem since it has been determined that the default rules work for the German language.

If ctKSEG_COMPU_SYSDEFAULT_NOTOK is used, then a request to use locale “xx_YY_Variant” will succeed as long as collation rules for “xx” are available. If ctKSEG_COMPU_FALLBACK_NOTOK is used, then rules for the particular locale with its optional country and variant modifiers must be available. Falling back from “xx_YY” to “xx” is not satisfactory. In the case of the “de” locale noted above, the segment definition would cause an error in the call to PutXtdKeySegmentDef() if either of the “NOTOK” default restrictions are part of the definition.

At most one of the following collation strength options can be included in kseg_comp:

ctKSEG_COMPU_S_PRIMARY

ctKSEG_COMPU_S_SECONDARY

ctKSEG_COMPU_S_TERTIARY

ctKSEG_COMPU_S_QUATERNARY

ctKSEG_COMPU_S_IDENTICAL

ctKSEG_COMPU_S_DEFAULT

At most one of the following normalization options can be included in kseg_comp:

ctKSEG_COMPU_N_NONE

ctKSEG_COMPU_N_CAN_DECMP

ctKSEG_COMPU_N_CMP_DECMP

ctKSEG_COMPU_N_CAN_DECMP_CMP

ctKSEG_COMPU_N_CMP_DECMP_CAN

ctKSEG_COMPU_N_DEFAULT

One or more of the following special attributes can be included in kseg_comp. After each one of our symbolic constants is the equivalent ICU-attribute, attribute-value pair.

ctKSEG_COMPU_A_FRENCH_ON (UCOL_FRENCH_COLLATION,UCOL_ON)
ctKSEG_COMPU_A_FRENCH_OFF (UCOL_FRENCH_COLLATION,UCOL_OFF)
ctKSEG_COMPU_A_CASE_ON (UCOL_CASE_LEVEL,UCOL_ON)
ctKSEG_COMPU_A_CASE_OFF (UCOL_CASE_LEVEL,UCOL_OFF)
ctKSEG_COMPU_A_DECOMP_ON (UCOL_DECOMPOSITION_MODE, UCOL_ON)
ctKSEG_COMPU_A_DECOMP_OFF (UCOL_DECOMPOSITION_MODE, UCOL_OFF)
ctKSEG_COMPU_A_SHIFTED (UCOL_ALTERNATE_HANDLING, UCOL_SHIFTED)
ctKSEG_COMPU_A_NONIGNR (UCOL_ALTERNATE_HANDLING, UCOL_NON_IGNORABLE)
ctKSEG_COMPU_A_LOWER (UCOL_CASE_FIRST, UCOL_LOWER_FIRST)
ctKSEG_COMPU_A_UPPER (UCOL_CASE_FIRST, UCOL_UPPER_FIRST)
ctKSEG_COMPU_A_HANGUL (UCOL_NORMALIZATION_MODE, UCOL_ON_WITHOUT_HANGUL)

It is permissible to set kseg_comp to zero. A zero kseg_comp implies no restrictions on locale defaults, default collation strength, default normalization, and no special attributes.

For a complete treatment of all of these options, please refer to the ICU web site and the Unicode Consortium’s web site and publications.

 

Extended Key Segment Definition Example

This example demonstrates a complete and simple ctKSEGDEF structure. The actual use of this structure is demonstrated in the API example below.

  ctKSEGDEF ksgdef;

 

ksgdef.kseg_ssiz = 12;                     /* 12 bytes for the source  */

ksgdef.kseg_type = ctKSEG_TYPE_UNICODE;    /* ICU Unicode              */

ksgdef.kseg_styp = ctKSEG_STYP_UTF16;      /* UTF16 source data        */

ksgdef.kseg_comp = ctKSEG_COMPU_A_LOWER;   /* lower case sorts first   */

strcpy(ksgdef.kseg_desc,"fr_CA");          /* French in Canada         */

 

Extended Key Segment Default Hierarchy

If a key segment mode (the last member of the ISEG structure) includes a modifier for an extended key segment definition (e.g., REGSEG | UNCSEG), then the particular extended key segment definition to use for this segment is determined according to the following hierarchy. Use the definition specified for:

  1. The segment
  2. The index associated with the segment
  3. The host index file containing the associated index
  4. The data file associated with the index
  5. The application default
  6. If applicable, the server default

The PutXtdKeySegmentDef() routine can specify extended key segment definitions for each of these six levels. To make this hierarchy more concrete, consider this example. Data file customer.dat has an index file named customer.idx. The index file customer.idx contains 3 indexes: a customer number index, a customer name index, and a customer status index. The customer name index is comprised of one Unicode segment. If an extended key segment definition has been specified for this particular segment, then it is used. If not, and if an extended key segment definition has been specified for the customer name index, then it is used. If not, and if an extended key segment definition has been specified for the host index file (customer.idx), then it is used. If not, and if an extended key segment definition has been specified for the associated data file (customer.dat), then it is used. If not, and if an extended key segment definition has been specified for the entire application, then it is used. If not, and if the application is on a server, and if an extended key segment definition has been specified for the server, then it is used. If not, a USEG_ERR (707) occurs: there is no extended key segment definition to use.

Except when an extended key segment definition has been specified for a particular key segment, the determination of which extended definition to use (as specified in the above hierarchy) is not determined until the first use of the key segment. By “first use” we mean either a reference to a key segment, say in a call to AddRecord() or AddVRecord(), after a file has just been created; or upon opening a file that contains extended key segment references that were not used after file creation and subsequent file closure. Upon this first use, if a default from one of levels 2 through 5 is used, then the particular definition is stored in the host index file so that the definition can travel with the physical file. (This automatic storage will not occur if the host index file is opened read only or has DISABLERES in its file mode.)

This hierarchy has been implemented to simplify the use of extended key segment definitions. One can easily imagine a Unicode dependent application that will only process Unicode key segments for one language, although the language may change from one site to another. By using a single call to PutXtdKeySegmentDef() at the application level, the details of the Unicode segment including the locale can be specified at program startup. All extended key segments can then default to this application definition. And since the definition will be added to the host index files, the index files created this way become self-sufficient.

It is important to note that except for a segment specific extended definition, there can be more than one extended key segment definition for each of the remaining five levels (2 through 6). But there can only be one extended key segment definition at each level for a particular type of segment. For example, there can be at most one ICU Unicode extended key segment definition at each level. (At this time we do not have any other type of extended key segment definition, and this is likely to change in time.)

Once an extended key segment definition has been specified at a particular level (for a particular type of segment), an attempt to specify another definition at the same level results in an error. This is in part because of the “first use” strategy noted above, and because one should not change a definition if key values already exist.

 

Extended Key Segment API

Three routines operate directly on extended key segments:

All three functions return a negative value upon error, where the absolute value of the return value is the error code.

The pkdef parameter points to a definition to be created in a call to PutXtdKeySegmentDef(). In a call to GetXtdKeySegmentDef(), the structure pointed to by pkdef gets filled-in except that the kseg_type member of the structure should be set on input to the type of segment to be retrieved. For example, to retrieve an ICU Unicode definition, set the kseg_type member to ctKSEG_TYPE_UNICODE.

Notes:

If GetXtdKeySegmentDef() is called with ctKSEGhandle for the filno parameter and the handle value is passed in via the segno parameter (as shown in the third row in the above table), then the kseg_type member of the structure is ignored on input since the handle uniquely identifies the particular definition. On output, the kseg_type member will be set to the type of segment.

One of the important reasons to call GetXtdKeySegmentDef() is to examine the actual locale used for the ICU collation routines. Most applications will not have a reason to call TransformXtdSegment() unless the application needs to create a Unicode binary sort key outside of the normal ISAM processing.

 

GetXtdKeySegmentDef

GetXtdKeySegmentDef() retrieves (that is, fills in the elements of a ctKSEGDEF structure for) the requested extended key segment definition. If successful, the return value is the handle associated with the definition.

Declaration

NINT GetXtdKeySegmentDef(NINT filno, NINT segno, pctKSEGDEF pkdef);

Description

filno segno Interpretation
ctKSEGserver ignored Retrieve server default definition
ctKSEGapplic ignored Retrieve application default definition
ctKSEGhandle handle Retrieve definition associated with handle
datno ignored Retrieve data file level definition
keyno ctKSEGindex Retrieve index file level definition
keyno 0, 1, 2, ... Retrieve particular segment definition

 

PutXtdKeySegmentDef

PutXtdKeySegmentDef() defines an extended key segment for a Server, an application, a data file, an index file, or a particular index segment. Returns a handle for the definition if successful.

Declaration

NINT PutXtdKeySegmentDef(NINT filno, NINT segno, pctKSEGDEF pkdef);

Description

filno segno Interpretation
ctKSEGserver ignored Create server default definition
ctKSEGapplic ignored Create application default definition
datno ignored Create data file level definition
keyno ctKSEGindex Create index file level definition
keyno 0, 1, 2, ... Create specific segment definition

 

TransformXtdSegment

TransformXtdSegment() creates a binary sort key (segment) using an extended key segment definition. If successful, it returns the number of bytes used for the binary sort key.

Declaration

NINT TransformXtdSegment(NINT seghnd, pVOID src, NINT srclen,
NINT srctyp, pVOID dest, NINT destlen);

Description

Parameter Description of use
seghnd Handle returned by PutXtdKeySegmentDef() or GetXtdKeySegmentDef().
src Pointer to data used to construct segment.
srclen Size in bytes of the region pointed by src. However, srclen is ignored unless kseg_ssiz was set to ctKSEG_SSIZ_PROVIDED.
srctyp srctyp should be set to one of the c-tree Plus field types (e.g., CT_STRING or CT_UNICODE). However, srctyp is ignored unless kseg_styp was set to ctKSEG_STYP_PROVIDED.
dest Pointer to region in which binary sort key is constructed.
destlen Size in bytes of the region pointed to by dest.

 

API Example

Example FairCom DB Unicode API Example

#include "ctreep.h"


typedef struct datrec {    /* offset: description   */

    LONG      delflg;      /*  0:     delete flag   */

    LONG      sernum;      /*  4:     serial number */

    TEXT      idnum[8];    /*  8:     ID number     */

    TEXT      utf8str[64]; /* 16:     UTF8 string   */

    TEXT      zipcode[10]; /* 80:     zipcode       */

    TEXT      code[6];     /* 90:     codes         */

} DATREC;

ISEG iseg[5] =

{    /* offset, length, mode      */

    {80,10,REGSEG},          /* 0 */

    {16, 8,REGSEG | UNCSEG}, /* 1 */

    {16,24,REGSEG | UNCSEG}, /* 2 */

    {80, 5,REGSEG},          /* 3 */

    { 4, 4,SRLSEG}           /* 4 */

};


IIDX iidx[3] =

{

    {22,0,1,0,0,2,iseg + 0},

    {33,0,1,0,0,2,iseg + 2},

    { 4,0,0,0,0,1,iseg + 4}

};


IFIL ifil = {"datrec",-1,96,0,SHARED | TRNLOG,3,0, SHARED | TRNLOG,iidx};


main(int argc,char **argv)

{

  ctKSEGDEF sd;

  DATREC    dr;

  NINT      rc = 0, hnd;

  FILNO filno;


    if ((rc = INTISAM(100,16,16)))

        exit(rc);


    sd.kseg_ssiz = 12;                  /* use up to 12 byte */

    sd.kseg_type = ctKSEG_TYPE_UNICODE; /* ICU Unicode       */

    sd.kseg_styp = ctKSEG_STYP_UTF8;    /* UTF8 source data  */

    sd.kseg_comp = ctKSEG_COMPU_S_TERTIARY |

                   ctKSEG_COMPU_N_DEFAULT;

    strcpy(sd.kseg_desc,"ar");          /* arabic            */


    if ((hnd = PutXtdKeySegmentDef(ctKSEGapplic,0,&sd)) < 0) {

           /* hnd holds the negative of the error code value */

        CLISAM();

        exit(-hnd);

    }

/*

** else hnd holds the handle associated with the application-

** wide default for ICU Unicode extended key segments. Two of

** the segments specified in the ISEG array require ICU Unicode

** extended key segment definitions. If no other definitions

** are available upon the first use of the indexes containing

** the UNCSEG modifiers, then the application default

** definition will be used.

*/

    if ((rc = CREIFIL(&ifil))) {

        CLISAM();

        exit(rc);

    }

/*

** else the data and indexes have been successfully created. No

** extended key segment information has been added to either

** the data file or the index file at this point. If a call to

** PutXtdKeySegmentDef is now made that explicitly references the data

** file or index file (using ifil.tfilno or ifil.tfilno +

** keyno), then a special resource would be added upon

** successful completion of the PutXtdKeySegmentDef call.

*/


    memset(&dr,0,sizeof(dr));


/*

** dr.delflg is now set to zero, and dr.sernum (now zero) will

** be filled-in by a call to ADDREC

*/


    strcpy(dr.idnum,"1234567");

    strcpy(dr.zipcode,"99999");

    strcpy(dr.code,"YNM");


/*

** the ordinary ASCII fields are now set

*/


    getUnicodeUTF8string(dr.utf8str,64);


/*

** this external routine will fill-in the multi-byte UTF8

** encoded Unicode string up to 64 bytes

*/


    if (!TRANBEG(ctTRNLOG | ctENABLE) ||

        (rc = ADDREC(ifil.tfilno,&dr)) ||

        (rc = TRANEND(ctFREE))) {

        /* could not successfully add data record */

        CLISAM();

        if (!rc)

            rc = uerr_cod;

        exit(rc);

    }


/*

** else successfully added data record. This first use of the

** UNCSEGs will have caused the index file to be updated not

** only with the key values, but the application default
** extended key segment definition also will have been added to
** the index.

*/


    CLISAM();

    exit(0);

}

 

Error Codes

The three functions all return negative values if an error occurs. Also, an ISAM operation might fail if a problem arises attempting to use an extended key segment. A problem with an encoded file name might also produce an error. The following table describes the error codes related to extended key segments and Unicode file names.

Value Symbolic Constant Interpretation
694 NUNC_ERR Executable does not support ICU Unicode, but a UNCSEG modifier has been encountered.
700 OSEG_ERR Could not process key segment definition. This could happen during a PutXtdKeySegmentDef() call or when a file is opened that includes an extended key segment definition. Ordinarily, the CTSTATUS.FCS file will contain additional information about the problem.
701 CSEG_ERR Could not process the kseg_comp options. This could occur if more than one of a set of mutually exclusive options are combined.
702 ASEG_ERR An error occurred when attempting to process one of the special attribute options.
703 HSEG_ERR Invalid key segment handle in a call to TransformXtdSegment() or in a call to GetXtdKeySegmentDef() when the ctKSEGhandle option is used and the segno parameter should be set to a valid extended key segment handle.
704 SSEG_ERR No source type provided when kseg_styp has been set to ctKSEG_STYP_PROVIDED. If this error occurs, it is likely to occur during the first use (say with an AddRecord() or AddVRecord() or OpenIFile()) of the extended key segment.
705 DSEG_ERR An extended key segment definition already exists at the level implied by the PutXtdKeySegmentDef() call.
706 NSEG_ERR Zero bytes of binary sort key were generated. Possibly an all NULL source.
707 USEG_ERR There is no extended key segment definition to use.
708 MBSP_ERR Multibyte/Unicode file names are not supported.
709 MBNM_ERR A badly formed multibyte/Unicode file name has been encountered.
710 MBFM_ERR A multibyte/Unicode variant is not supported (e.g., UTF32)

The following table contains explanations of existing error codes used in this context.

Value Symbolic Constant Interpretation
62 LERR_ERR PutXtdKeySegmentDef() called for a data or index file requires the file to be opened exclusively. (Remember, a just created file is in exclusive mode, regardless of the specified file mode, until it is closed and re-opened.)
437 DADR_ERR NULL pkdef argument in PutXtdKeySegmentDef() or GetXtdKeySegmentDef().
445 SDAT_ERR No source data to create key segment.
446 BMOD_ERR Improper filno or segno values in calls to PutXtdKeySegmentDef() or GetXtdKeySegmentDef(). TransformXtdSegment() causes a BMOD_ERR if the handle references an extended key segment definition not supported by the executable.
589 LADM_ERR Member of non-ADMIN group called PutXtdKeySegmentDef() for a Server default (i.e., ctKSEGserver).

 

Server Configuration Keywords for Unicode Segment Default

The FairCom Server configuration file (or settings file or command line) can specify a server default for each type of extended segment definition supported. Each such default definition starts with a XTDKSEG_SEG_TYPE entry. For an ICU Unicode default, the entries begin with

XTDKSEG_SEG_TYPE UNICODE_ICU

This initial entry is then followed by the specifics of the default such as locale and other collation options. For example, the following configuration entries would define a default for the server with the same characteristics as the application default used in the above API example.

XTDKSEG_SEG_TYPE         UNICODE_ICU

ICU_LOCALE               "ar"

XTDKSEG_SRC_SIZE         12

XTDKSEG_SRC_TYPE         UTF8

ICU_OPTION               STRENGTH_TERTIARY

ICU_OPTION               NORM_DEFAULT

The complete list of keywords and arguments is shown in the following table.

Configuration Keyword Arguments
XTDKSEG_SEG_TYPE UNICODE_ICU
XTDKSEG_SRC_TYPE

PROVIDED

UTF-8

UTF-16

XTDKSEG_SRC_SIZE

PROVIDED

COMPUTED

<numeric value>

XTDKSEG_FAILED_DEFAULT_OK

YES [server can still begin if server default encounters an error]

NO [server cannot continue on error, which is the default behavior]

ICU_LOCALE <locale string in ICU form: xx_YY_Variant> *
ICU_OPTION

STRENGTH_PRIMARY

STRENGTH_SECONDARY

STRENGTH_TERTIARY

STRENGTH_QUATERNARY

STRENGTH_IDENTICAL

STRENGTH_DEFAULT

NORM_NONE

NORM_CAN_DECMP

NORM_CMP_DECMP

NORM_CAN_DECMP_CMP

NORM_CMP_DECMP_CAN

NORM_DEFAULT

LOCALE_SYSDEFAULT_NOTOK

LOCALE_FALLBACK_NOTOK

ATTR_FRENCH_ON

ATTR_FRENCH_OFF

ATTR_CASE_ON

ATTR_CASE_OFF

ATTR_DECOMP_ON

ATTR_DECOMP_OFF

ATTR_SHIFTED

ATTR_NONIGNR

ATTR_LOWER

ATTR_UPPER

ATTR_HANGUL

* Where “xx” is the language as specified by ISO-639 (e.g., “fr” for French); ‘Y’ is a country as specified by ISO-3166 (e.g., “fr_CA” for French language in Canada); and the “Variant” portion represents system dependent options.

For all but ICU_OPTION, only one of the listed arguments can be specified for each keyword. For instance, it does not make sense to have both of these entries in the configuration file for one extended key segment default definition:

XTDKSEG_SRC_TYPE       UTF16

XTDKSEG_SRC_TYPE       UTF8

There may be many ICU_OPTION entries in a configuration file. Some combinations of entries do not make sense, and the behavior is not guaranteed if they are combined. For instance, using both of these entries is inappropriate:

ICU_OPTION             ATTR_LOWER

ICU_OPTION             ATTR_UPPER

 

Unicode File Names

Virtually everywhere an ordinary ASCII file name is used with FairCom DB, a Unicode file name can be used. However, to use Unicode file names, the underlying OS platform must support file creation with Unicode file names. Also, FairCom DB requires that the file name have a special 8-byte prefix that informs FairCom DB about special file name encoding. Currently, FairCom DB accepts the use of UTF-8 and UTF-16 encoded Unicode file names. The ctMULTIBYTEname define enables this capability.

The following function stores a proper prefix at dp of type FnType.

NINT ctMBprefix(pTEXT dp,NINT FnType);

dp is a pTEXT because the name may be encoded as a byte stream or a wide character array, and ctMBprefix does not assume that dp is aligned when used with UTF-16. FnType may be ctFnTypeUTF8 or ctFnTypeUTF16. ctFnPrefixSIZE holds the size of the prefix in bytes. ctMBprefix returns NO_ERROR (0) unless the FnType parameter is bad, in which case BMOD_ERR (446) is returned.

In the following example, getUnicodeUTF8string() is assumed to be a routine which fills in a UTF-8 encoded string up to a maximum length. getUnicodeUTF16string() performs in the same manner with 16-bit wide characters.

  FILNO     datno8, datno16;

  WCHAR     utf16name[256];

  TEXT      utf8name[512];

 

ctMBprefix(utf8name,ctFnTypeUTF8);

getUnicodeUTF8string(utf8name + ctFnPrefixSIZE, 512 - ctFnPrefixSIZE);

datno8 = OPNRFIL(-1,utf8name,SHARED);

 

ctMBprefix((pTEXT) utf16name,ctFnTypeUTF16);

getUnicodeUTF16string(utf16name + ctFnPrefixSIZE / 2,

                      256 - ctFnPrefixSIZE / 2);

datno16 = OPNRFIL(-1,utf16name,SHARED);

Note: There is no provision for wide character encoded file name extents when using IFIL routines, such as CreateIFILXtd(), that permit the default file name extents to be overridden. It is possible to use UTF-8 encoded extents that will be properly merged with the root file names. However, there are only 8 bytes reserved for the extents, which effectively means at most 3 Unicode characters. Of course, it is possible to fully specify data and index names complete with extents if the dataextn and indxextn parameters to CreateIFILXtd() point to blank strings (i.e., “ ”), and if the aidxnam index name pointers (in the IIDX structures) specify the index file names complete with their extents.

 

Mirrored File Names

The prefix marking a Unicode encoded file name is only applied to the beginning of the mirrored file name string, NOT after the name separator ‘|’ (vertical bar).