[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25. Multilingual Support

NOTE: There is a great deal of overlapping and redundant information in this chapter. Ben wrote introductions to Mule issues a number of times, each time not realizing that he had already written another introduction previously. Hopefully, in time these will all be integrated.

NOTE: The information at the top of the source file `text.c' is more complete than the following, and there is also a list of all other places to look for text/I18N-related info. Also look in `text.h' for info about the DFC and Eistring APIs.

Recall that there are two primary ways that text is represented in XEmacs. The buffer representation sees the text as a series of bytes (Ibytes), with a variable number of bytes used per character. The character representation sees the text as a series of integers (Ichars), one per character. The character representation is a cleaner representation from a theoretical standpoint, and is thus used in many cases when lots of manipulations on a string need to be done. However, the buffer representation is the standard representation used in both Lisp strings and buffers, and because of this, it is the "default" representation that text comes in. The reason for using this representation is that it's compact and is compatible with ASCII.

25.1 Introduction to Multilingual Issues #1  
25.2 Introduction to Multilingual Issues #2  
25.3 Introduction to Multilingual Issues #3  
25.4 Introduction to Multilingual Issues #4  
25.5 Character Sets  
25.6 Encodings  
25.7 Internal Mule Encodings  
25.8 Byte/Character Types; Buffer Positions; Other Typedefs  
25.9 Internal Text APIs  
25.10 Coding for Mule  
25.11 CCL  
25.12 Microsoft Windows-Related Multilingual Issues  
25.13 Modules for Internationalization  
25.14 The Great Mule Merge of March 2002  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.1 Introduction to Multilingual Issues #1

There is an introduction to these issues in the Lisp Reference manual. See section `Internationalization Terminology' in XEmacs Lisp Reference Manual. Among other documentation that may be of interest to internals programmers is ISO-2022 (see section `ISO 2022' in XEmacs Lisp Reference Manual) and CCL (see section `CCL' in XEmacs Lisp Reference Manual)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.2 Introduction to Multilingual Issues #2

Introduction

This document covers a number of design issues, problems and proposals with regards to XEmacs MULE. At first we present some definitions and some aspects of the design that have been agreed upon. Then we present some issues and problems that need to be addressed, and then I include a proposal of mine to address some of these issues. When there are other proposals, for example from Olivier, these will be appended to the end of this document.

Definitions and Design Basics

First, text is defined to be a series of characters which together defines an utterance or partial utterance in some language. Generally, this language is a human language, but it may also be a computer language if the computer language uses a representation close enough to that of human languages for it to also make sense to call its representation text. Text is opposed to binary, which is a sequence of bytes, representing machine-readable but not human-readable data. A byte is merely a number within a predefined range, which nowadays is nearly always zero to 255. A character is a unit of text. What makes one character different from another is not always clear-cut. It is generally related to the appearance of the character, although perhaps not any possible appearance of that character, but some sort of ideal appearance that is assigned to a character. Whether two characters that look very similar are actually the same depends on various factors such as political ones, such as whether the characters are used to mean similar sorts of things, or behave similarly in similar contexts. In any case, it is not always clearly defined whether two characters are actually the same or not. In practice, however, this is more or less agreed upon.

A character set is just that, a set of one or more characters. The set is unique in that there will not be more than one instance of the same character in a character set, and logically is unordered, although an order is often imposed or suggested for the characters in the character set. We can also define an order on a character set, which is a way of assigning a unique number, or possibly a pair of numbers, or a triplet of numbers, or even a set of four or more numbers to each character in the character set. The combination of an order in the character set results in an ordered character set. In an ordered character set, there is an upper limit and a lower limit on the possible values that a character, or that any number within the set of numbers assigned to a character, can take. However, the lower limit does not have to start at zero or one, or anywhere else in particular, nor does the upper limit have to end anywhere particular, and there may be gaps within these ranges such that particular numbers or sets of numbers do not have a corresponding character, even though they are within the upper and lower limits. For example, ASCII defines a very standard ordered character set. It is normally defined to be 94 characters in the range 33 through 126 inclusive on both ends, with every possible character within this range being actually present in the character set.

Sometimes the ASCII character set is extended to include what are called non-printing characters. Non-printing characters are characters which instead of really being displayed in a more or less rectangular block, like all other characters, instead indicate certain functions typically related to either control of the display upon which the characters are being displayed, or have some effect on a communications channel that may be currently open and transmitting characters, or may change the meaning of future characters as they are being decoded, or some other similar function. You might say that non-printing characters are somewhat of a hack because they are a special exception to the standard concept of a character as being a printed glyph that has some direct correspondence in the non-computer world.

With non-printing characters in mind, the 94-character ordered character set called ASCII is often extended into a 96-character ordered character set, also often called ASCII, which includes in addition to the 94 characters already mentioned, two non-printing characters, one called space and assigned the number 32, just below the bottom of the previous range, and another called delete or rubout, which is given number 127 just above the end of the previous range. Thus to reiterate, the result is a 96-character ordered character set, whose characters take the values from 32 to 127 inclusive. Sometimes ASCII is further extended to contain 32 more non-printing characters, which are given the numbers zero through 31 so that the result is a 128-character ordered character set with characters numbered zero through 127, and with many non-printing characters. Another way to look at this, and the way that is normally taken by XEmacs MULE, is that the characters that would be in the range 30 through 31 in the most extended definition of ASCII, instead form their own ordered character set, which is called control zero, and consists of 32 characters in the range zero through 31. A similar ordered character set called control one is also created, and it contains 32 more non-printing characters in the range 128 through 159. Note that none of these three ordered character sets overlaps in any of the numbers they are assigned to their characters, so they can all be used at once. Note further that the same character can occur in more than one character set. This was shown above, for example, in two different ordered character sets we defined, one of which we could have called ASCII, and the other ASCII-extended, to show that it had extended by two non-printable characters. Most of the characters in these two character sets are shared and present in both of them.

Note that there is no restriction on the size of the character set, or on the numbers that are assigned to characters in an ordered character set. It is often extremely useful to represent a sequence of characters as a sequence of bytes, where a byte as defined above is a number in the range zero to 255. An encoding does precisely this. It is simply a mapping from a sequence of characters, possibly augmented with information indicating the character set that each of these characters belongs to, to a sequence of bytes which represents that sequence of characters and no other -- which is to say the mapping is reversible.

A coding system is a set of rules for encoding a sequence of characters augmented with character set information into a sequence of bytes, and later performing the reverse operation. It is frequently possible to group coding systems into classes or types based on common features. Typically, for example, a particular coding system class may contain a base coding system which specifies some of the rules, but leaves the rest unspecified. Individual members of the coding system class are formed by starting with the base coding system, and augmenting it with additional rules to produce a particular coding system, what you might think of as a sort of variation within a theme.

XEmacs Specific Definitions

First of all, in XEmacs, the concept of character is a little different from the general definition given above. For one thing, the character set that a character belongs to may or may not be an inherent part of the character itself. In other words, the same character occurring in two different character sets may appear in XEmacs as two different characters. This is generally the case now, but we are attempting to move in the other direction. Different proposals may have different ideas about exactly the extent to which this change will be carried out. The general trend, though, is to represent all information about a character other than the character itself, using text properties attached to the character. That way two instances of the same character will look the same to lisp code that merely retrieves the character, and does not also look at the text properties of that character. Everyone involved is in agreement in doing it this way with all Latin characters, and in fact for all characters other than Chinese, Japanese, and Korean ideographs. For those, there may be a difference of opinion.

A second difference between the general definition of character and the XEmacs usage of character is that each character is assigned a unique number that distinguishes it from all other characters in the world, or at the very least, from all other characters currently existing anywhere inside the current XEmacs invocation. (If there is a case where the weaker statement applies, but not the stronger statement, it would possibly be with composite characters and any other such characters that are created on the sly.)

This unique number is called the character representation of the character, and its particular details are a matter of debate. There is the current standard in use that it is undoubtedly going to change. What has definitely been agreed upon is that it will be an integer, more specifically a positive integer, represented with less than or equal to 31 bits on a 32-bit architecture, and possibly up to 63 bits on a 64-bit architecture, with the proviso that any characters that whose representation would fit in a 64-bit architecture, but not on a 32-bit architecture, would be used only for composite characters, and others that would satisfy the weak uniqueness property mentioned above, but not with the strong uniqueness property.

At this point, it is useful to talk about the different representations that a sequence of characters can take. The simplest representation is simply as a sequence of characters, and this is called the Lisp representation of text, because it is the representation that Lisp programs see. Other representations include the external representation, which refers to any encoding of the sequence of characters, using the definition of encoding mentioned above. Typically, text in the external representation is used outside of XEmacs, for example in files, e-mail messages, web sites, and the like. Another representation for a sequence of characters is what I will call the byte representation, and it represents the way that XEmacs internally represents text in a buffer, or in a string. Potentially, the representation could be different between a buffer and a string, and then the terms buffer byte representation and string byte representation would be used, but in practice I don't think this will occur. It will be possible, of course, for buffers and strings, or particular buffers and particular strings, to contain different sub-representations of a single representation. For example, Olivier's 1-2-4 proposal allows for three sub-representations of his internal byte representation, allowing for 1 byte, 2 bytes, and 4 byte width characters respectively. A particular string may be in one sub-representation, and a particular buffer in another sub-representation, but overall both are following the same byte representation. I do not use the term internal representation here, as many people have, because it is potentially ambiguous.

Another representation is called the array of characters representation. This is a representation on the C-level in which the sequence of text is represented, not using the byte representation, but by using an array of characters, each represented using the character representation. This sort of representation is often used by redisplay because it is more convenient to work with than any of the other internal representations.

The term binary representation may also be heard. Binary representation is used to represent binary data. When binary data is represented in the lisp representation, an equivalence is simply set up between bytes zero through 255, and characters zero through 255. These characters come from four character sets, which are from bottom to top, control zero, ASCII, control 1, and Latin 1. Together, they comprise 256 characters, and are a good mapping for the 256 possible bytes in a binary representation. Binary representation could also be used to refer to an external representation of the binary data, which is a simple direct byte-to-byte representation. No internal representation should ever be referred to as a binary representation because of ambiguity. The terms character set/encoding system were defined generally, above. In XEmacs, the equivalent concepts exist, although character set has been shortened to charset, and in fact represents specifically an ordered character set. For each possible charset, and for each possible coding system, there is an associated object in XEmacs. These objects will be of type charset and coding system, respectively. Charsets and coding systems are divided into classes, or types, the normal term under XEmacs, and all possible charsets encoding systems that may be defined must be in one of these types. If you need to create a charset or coding system that is not one of these types, you will have to modify the C code to support this new type. Some of the existing or soon-to-be-created types are, or will be, generic enough so that this shouldn't be an issue. Note also that the byte encoding for text and the character coding of a character are closely related. You might say that ideally each is the simplest equivalent of the other given the general constraints on each representation.

To be specific, in the current MULE representation,

  1. Characters encode both the character itself and the character set that it comes from. These character sets are always assumed to be representable as an ordered character set of size 96 or of size 96 by 96, or the trivially-related sizes 94 and 94 by 94. The only allowable exceptions are the control zero and control one character sets, which are of size 32. Character sets which do not naturally have a compatible ordering such as this are shoehorned into an ordered character set, or possibly two ordered character sets of a compatible size.
  2. The variable width byte representation was deliberately chosen to allow scanning text forwards and backwards efficiently. This necessitated defining the possible bytes into three ranges which we shall call A, B, and C. Range A is used exclusively for single-byte characters, which is to say characters that are representing using only one contiguous byte. Multi-byte characters are always represented by using one byte from Range B, followed by one or more bytes from Range C. What this means is that bytes that begin a character are unequivocally distinguished from bytes that do not begin a character, and therefore there is never a problem scaling backwards and finding the beginning of a character. Know that UTF8 adopts a proposal that is very similar in spirit in that it uses separate ranges for the first byte of a multi byte sequence, and the following bytes in multi-byte sequence.
  3. Given the fact that all ordered character sets allowed were essentially 96 characters per dimension, it made perfect sense to make Range C comprise 96 bytes. With a little more tweaking, the currently-standard MULE byte representation was created, and was drafted from this.
  4. The MULE byte representation defined four basic representations for characters, which would take up from one to four bytes, respectively. The MULE character representation thus had the following constraints:
    1. Character numbers zero through 255 should represent the characters that binary values zero through 255 would be mapped onto. (Note: this was not the case in Kenichi Handa's version of this representation, but I changed it.)
    2. The four sub-classes of representation in the MULE byte representation should correspond to four contiguous non-overlapping ranges of characters.
    3. The algorithmic conversion between the single character represented in the byte representation and in the character representation should be as easy as possible.
    4. Given the previous constraints, the character representation should be as compact as possible, which is to say it should use the least number of bits possible.

So you see that the entire structure of the byte and character representations stemmed from a very small number of basic choices, which were

  1. the choice to encode character set information in a character
  2. the choice to assume that all character sets would have an order imposed upon them with 96 characters per one or two dimensions. (This is less arbitrary than it seems--it follows ISO-2022)
  3. the choice to use a variable width byte representation.

What this means is that you cannot really separate the byte representation, the character representation, and the assumptions made about characters and whether they represent character sets from each other. All of these are closely intertwined, and for purposes of simplicity, they should be designed together. If you change one representation without changing another, you are in essence creating a completely new design with its own attendant problems--since your new design is likely to be quite complex and not very coherent with regards to the translation between the character and byte representations, you are likely to run into problems.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.3 Introduction to Multilingual Issues #3

In XEmacs, Mule is a code word for the support for input handling and display of multi-lingual text. This section provides an overview of how this support impacts the C and Lisp code in XEmacs. It is important for anyone who works on the C or the Lisp code, especially on the C code, to be aware of these issues, even if they don't work directly on code that implements multi-lingual features, because there are various general procedures that need to be followed in order to write Mule-compliant code. (The specifics of these procedures are documented elsewhere in this manual.)

There are four primary aspects of Mule support:

  1. internal handling and representation of multi-lingual text.
  2. conversion between the internal representation of text and the various external representations in which multi-lingual text is encoded, such as Unicode representations (including mostly fixed width encodings such as UCS-2/UTF-16 and UCS-4 and variable width ASCII conformant encodings, such as UTF-7 and UTF-8); the various ISO2022 representations, which typically use escape sequences to switch between different character sets (such as Compound Text, used under X Windows; JIS, used specifically for encoding Japanese; and EUC, a non-modal encoding used for Japanese, Korean, and certain other languages); Microsoft's multi-byte encodings (such as Shift-JIS); various simple encodings for particular 8-bit character sets (such as Latin-1 and Latin-2, and encodings (such as koi8 and Alternativny) for Cyrillic); and others. This conversion needs to happen both for text in files and text sent to or retrieved from system API calls. It even needs to happen for external binary data because the internal representation does not represent binary data simply as a sequence of bytes as it is represented externally.
  3. Proper display of multi-lingual characters.
  4. Input of multi-lingual text using the keyboard.

These four aspects are for the most part independent of each other.

Characters, Character Sets, and Encodings

A character (which is, BTW, a surprisingly complex concept) is, in a written representation of text, the most basic written unit that has a meaning of its own. It's comparable to a phoneme when analyzing words in spoken speech (for example, the sound of `t' in English, which in fact has different pronunciations in different words -- aspirated in `time', unaspirated in `stop', unreleased or even pronounced as a glottal stop in `button', etc. -- but logically is a single concept). Like a phoneme, a character is an abstract concept defined by its meaning. The character `lowercase f', for example, can always be used to represent the first letter in the word `fill', regardless of whether it's drawn upright or italic, whether the `fi' combination is drawn as a single ligature, whether there are serifs on the bottom of the vertical stroke, etc. (These different appearances of a single character are often called graphs or glyphs.) Our concern when representing text is on representing the abstract characters, and not on their exact appearance.

A character set (or charset), as we define it, is a set of characters, each with an associated number (or set of numbers -- see below), called a code point. It's important to understand that a character is not defined by any number attached to it, but by its meaning. For example, ASCII and EBCDIC are two charsets containing exactly the same characters (lowercase and uppercase letters, numbers 0 through 9, particular punctuation marks) but with different numberings. The `comma' character in ASCII and EBCDIC, for instance, is the same character despite having a different numbering. Conversely, when comparing ASCII and JIS-Roman, which look the same except that the latter has a yen sign substituted for the backslash, we would say that the backslash and yen sign are not the same characters, despite having the same number (95) and despite the fact that all other characters are present in both charsets, with the same numbering. ASCII and JIS-Roman, then, do not have exactly the same characters in them (ASCII has a backslash character but no yen-sign character, and vice-versa for JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII and JIS-Roman are closer.

It's also important to distinguish between charsets and encodings. For a simple charset like ASCII, there is only one encoding normally used -- each character is represented by a single byte, with the same value as its code point. For more complicated charsets, however, things are not so obvious. Unicode version 2, for example, is a large charset with thousands of characters, each indexed by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew letter "aleph". One obvious encoding uses two bytes per character (actually two encodings, depending on which of the two possible byte orderings is chosen). This encoding is convenient for internal processing of Unicode text; however, it's incompatible with ASCII, so a different encoding, e.g. UTF-8, is usually used for external text, for example files or e-mail. UTF-8 represents Unicode characters with one to three bytes (often extended to six bytes to handle characters with up to 31-bit indices). Unicode characters 00 to 7F (identical with ASCII) are directly represented with one byte, and other characters with two or more bytes, each in the range 80 to FF.

In general, a single encoding may be able to represent more than one charset.

Internal Representation of Text

In an ASCII or single-European-character-set world, life is very simple. There are 256 characters, and each character is represented using the numbers 0 through 255, which fit into a single byte. With a few exceptions (such as case-changing operations or syntax classes like 'whitespace'), "text" is simply an array of indices into a font. You can get different languages simply by choosing fonts with different 8-bit character sets (ISO-8859-1, -2, special-symbol fonts, etc.), and everything will "just work" as long as anyone else receiving your text uses a compatible font.

In the multi-lingual world, however, it is much more complicated. There are a great number of different characters which are organized in a complex fashion into various character sets. The representation to use is not obvious because there are issues of size versus speed to consider. In fact, there are in general two kinds of representations to work with: one that represents a single character using an integer (possibly a byte), and the other representing a single character as a sequence of bytes. The former representation is normally called fixed width, and the other variable width. Both representations represent exactly the same characters, and the conversion from one representation to the other is governed by a specific formula (rather than by table lookup) but it may not be simple. Most C code need not, and in fact should not, know the specifics of exactly how the representations work. In fact, the code must not make assumptions about the representations. This means in particular that it must use the proper macros for retrieving the character at a particular memory location, determining how many characters are present in a particular stretch of text, and incrementing a pointer to a particular character to point to the following character, and so on. It must not assume that one character is stored using one byte, or even using any particular number of bytes. It must not assume that the number of characters in a stretch of text bears any particular relation to a number of bytes in that stretch. It must not assume that the character at a particular memory location can be retrieved simply by dereferencing the memory location, even if a character is known to be ASCII or is being compared with an ASCII character, etc. Careful coding is required to be Mule clean. The biggest work of adding Mule support, in fact, is converting all of the existing code to be Mule clean.

Lisp code is mostly unaffected by these concerns. Text in strings and buffers appears simply as a sequence of characters regardless of whether Mule support is present. The biggest difference with older versions of Emacs, as well as current versions of GNU Emacs, is that integers and characters are no longer equivalent, but are separate Lisp Object types.

Conversion Between Internal and External Representations

All text needs to be converted to an external representation before being sent to a function or file, and all text retrieved from a function of file needs to be converted to the internal representation. This conversion needs to happen as close to the source or destination of the text as possible. No operations should ever be performed on text encoded in an external representation other than simple copying, because no assumptions can reliably be made about the format of this text. You cannot assume, for example, that the end of text is terminated by a null byte. (For example, if the text is Unicode, it will have many null bytes in it.) You cannot find the next "slash" character by searching through the bytes until you find a byte that looks like a "slash" character, because it might actually be the second byte of a Kanji character. Furthermore, all text in the internal representation must be converted, even if it is known to be completely ASCII, because the external representation may not be ASCII compatible (for example, if it is Unicode).

The place where C code needs to be the most careful is when calling external API functions. It is easy to forget that all text passed to or retrieved from these functions needs to be converted. This includes text in structures passed to or retrieved from these functions and all text that is passed to a callback function that is called by the system.

Macros are provided to perform conversions to or from external text. These macros are called TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT respectively. These macros accept input in various forms, for example, Lisp strings, buffers, lstreams, raw data, and can return data in multiple formats, including both malloc()ed and alloca()ed data. The use of alloca()ed data here is particularly important because, in general, the returned data will not be used after making the API call, and as a result, using alloca()ed data provides a very cheap and easy to use method of allocation.

These macros take a coding system argument which indicates the nature of the external encoding. A coding system is an object that encapsulates the structures of a particular external encoding and the methods required to convert to and from this encoding. A facility exists to create coding system aliases, which in essence gives a single coding system two different names. It is effectively used in XEmacs to provide a layer of abstraction on top of the actual coding systems. For example, the coding system alias "file-name" points to whichever coding system is currently used for encoding and decoding file names as passed to or retrieved from system calls. In general, the actual encoding will differ from system to system, and also on the particular locale that the user is in. The use of the file-name alias effectively hides that implementation detail on top of that abstract interface layer which provides a unified set of coding systems which are consistent across all operating environments.

The choice of which coding system to use in a particular conversion macro requires some thought. In general, you should choose a lower-level actual coding system when the very design of the APIs you are working with call for that particular coding system. In all other cases, you should find the least general abstract coding system (i.e. coding system alias) that applies to your specific situation. Only use the most general coding systems, such as native, when there is simply nothing else that is more appropriate. By doing things this way, you allow the user more control over how the encoding actually works, because the user is free to map the abstracted coding system names onto to different actual coding systems.

Some common coding systems are:

ctext
Compound Text, which is the standard encoding under X Windows, which is used for clipboard data and possibly other data. (ctext is a coding system of type ISO2022.)

mswindows-unicode
this is used for representing text passed to MS Window API calls with arguments that need to be in Unicode format. (mswindows-unicode is a coding system of type UTF-16)

mswindows-multi-byte
this is used for representing text passed to MS Windows API calls with arguments that need to be in multi-byte format. Note that there are very few if any examples of such calls.

mswindows-tstr
this is used for representing text passed to any MS Windows API calls that declare their argument as LPTSTR, or LPCTSTR. This is the vast majority of system calls and automatically translates either to mswindows-unicode or mswindows-multi-byte, depending on the presence or absence of the UNICODE preprocessor constant. (If we compile XEmacs with this preprocessor constant, then all API calls use Unicode for all text passed to or received from these API calls.)

terminal
used for text sent to or read from a text terminal in the absence of a more specific coding system (calls to window-system specific APIs should use the appropriate window-specific coding system if it makes sense to do so.) Like others here, this is a coding system alias.

file-name
used when specifying the names of files in the absence of a more specific encoding, such as ms-windows-tstr. This is a coding system alias -- what it's an alias of is determined at startup.

native
the most general coding system for specifying text passed to system calls. This generally translates to whatever coding system is specified by the current locale. This should only be used when none of the coding systems mentioned above are appropriate. This is a coding system alias -- what it's an alias of is determined at startup.

Proper Display of Multilingual Text

There are two things required to get this working correctly. One is selecting the correct font, and the other is encoding the text according to the encoding used for that specific font, or the window-system specific text display API. Generally each separate character set has a different font associated with it, which is specified by name and each font has an associated encoding into which the characters must be translated. (this is the case on X Windows, at least; on Windows there is a more general mechanism). Both the specific font for a charset and the encoding of that font are system dependent. Currently there is a way of specifying these two properties under X Windows (using the registry and ccl properties of a character set) but not for other window systems. A more general system needs to be implemented to allow these characteristics to be specified for all Windows systems.

Another issue is making sure that the necessary fonts for displaying various character sets are installed on the system. Currently, XEmacs provides, on its web site, X Windows fonts for a number of different character sets that can be installed by users. This isn't done yet for Windows, but it should be.

Inputting of Multilingual Text

This is a rather complicated issue because there are many paradigms defined for inputting multi-lingual text, some of which are specific to particular languages, and any particular language may have many different paradigms defined for inputting its text. These paradigms are encoded in input methods and there is a standard API for defining an input method in XEmacs called LEIM, or Library of Emacs Input Methods. Some of these input methods are written entirely in Elisp, and thus are system-independent, while others require the aid either of an external process, or of C level support that ties into a particular system-specific input method API, for example, XIM under X Windows, or the active keyboard layout and IME support under Windows. Currently, there is no support for any system-specific input methods under Microsoft Windows, although this will change.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.4 Introduction to Multilingual Issues #4

The rest of the sections in this chapter consist of yet another introduction to multilingual issues, duplicating the information in the previous sections.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.5 Character Sets

A character set (or charset) is an ordered set of characters. A particular character in a charset is indexed using one or more position codes, which are non-negative integers. The number of position codes needed to identify a particular character in a charset is called the dimension of the charset. In XEmacs/Mule, all charsets have dimension 1 or 2, and the size of all charsets (except for a few special cases) is either 94, 96, 94 by 94, or 96 by 96. The range of position codes used to index characters from any of these types of character sets is as follows:

 
Charset type            Position code 1         Position code 2
------------------------------------------------------------
94                      33 - 126                N/A
96                      32 - 127                N/A
94x94                   33 - 126                33 - 126
96x96                   32 - 127                32 - 127

Note that in the above cases position codes do not start at an expected value such as 0 or 1. The reason for this will become clear later.

For example, Latin-1 is a 96-character charset, and JISX0208 (the Japanese national character set) is a 94x94-character charset.

[Note that, although the ranges above define the valid position codes for a charset, some of the slots in a particular charset may in fact be empty. This is the case for JISX0208, for example, where (e.g.) all the slots whose first position code is in the range 118 - 127 are empty.]

There are three charsets that do not follow the above rules. All of them have one dimension, and have ranges of position codes as follows:

 
Charset name            Position code 1
------------------------------------
ASCII                   0 - 127
Control-1               0 - 31
Composite               0 - some large number

(The upper bound of the position code for composite characters has not yet been determined, but it will probably be at least 16,383).

ASCII is the union of two subsidiary character sets: Printing-ASCII (the printing ASCII character set, consisting of position codes 33 - 126, like for a standard 94-character charset) and Control-ASCII (the non-printing characters that would appear in a binary file with codes 0 - 32 and 127).

Control-1 contains the non-printing characters that would appear in a binary file with codes 128 - 159.

Composite contains characters that are generated by overstriking one or more characters from other charsets.

Note that some characters in ASCII, and all characters in Control-1, are control (non-printing) characters. These have no printed representation but instead control some other function of the printing (e.g. TAB or 8 moves the current character position to the next tab stop). All other characters in all charsets are graphic (printing) characters.

When a binary file is read in, the bytes in the file are assigned to character sets as follows:

 
Bytes           Character set           Range
--------------------------------------------------
0 - 127         ASCII                   0 - 127
128 - 159       Control-1               0 - 31
160 - 255       Latin-1                 32 - 127

This is a bit ad-hoc but gets the job done.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.6 Encodings

An encoding is a way of numerically representing characters from one or more character sets. If an encoding only encompasses one character set, then the position codes for the characters in that character set could be used directly. This is not possible, however, if more than one character set is to be used in the encoding.

For example, the conversion detailed above between bytes in a binary file and characters is effectively an encoding that encompasses the three character sets ASCII, Control-1, and Latin-1 in a stream of 8-bit bytes.

Thus, an encoding can be viewed as a way of encoding characters from a specified group of character sets using a stream of bytes, each of which contains a fixed number of bits (but not necessarily 8, as in the common usage of "byte").

Here are descriptions of a couple of common encodings:

25.6.1 Japanese EUC (Extended Unix Code)  
25.6.2 JIS7  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.6.1 Japanese EUC (Extended Unix Code)

This encompasses the character sets Printing-ASCII, Katakana-JISX0201 (half-width katakana, the right half of JISX0201), Japanese-JISX0208, and Japanese-JISX0212.

Note that Printing-ASCII and Katakana-JISX0201 are 94-character charsets, while Japanese-JISX0208 and Japanese-JISX0212 are 94x94-character charsets.

The encoding is as follows:

 
Character set            Representation (PC=position-code)
-------------            --------------
Printing-ASCII           PC1
Katakana-JISX0201        0x8E       | PC1 + 0x80
Japanese-JISX0208        PC1 + 0x80 | PC2 + 0x80
Japanese-JISX0212        PC1 + 0x80 | PC2 + 0x80

Note that there are other versions of EUC for other Asian languages. EUC in general is characterized by

  1. row-column encoding,
  2. big-endian (row-first) ordering, and
  3. ASCII compatibility in variable width forms.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.6.2 JIS7

This encompasses the character sets Printing-ASCII, Latin-JISX0201 (the left half of JISX0201; this character set is very similar to Printing-ASCII and is a 94-character charset), Japanese-JISX0208, and Katakana-JISX0201. It uses 7-bit bytes.

Unlike EUC, this is a modal encoding, which means that there are multiple states that the encoding can be in, which affect how the bytes are to be interpreted. Special sequences of bytes (called escape sequences) are used to change states.

The encoding is as follows:

 
Character set              Representation (PC=position-code)
-------------              --------------
Printing-ASCII             PC1
Latin-JISX0201             PC1
Katakana-JISX0201          PC1
Japanese-JISX0208          PC1 | PC2


Escape sequence   ASCII equivalent   Meaning
---------------   ----------------   -------
0x1B 0x28 0x4A    ESC ( J            invoke Latin-JISX0201
0x1B 0x28 0x49    ESC ( I            invoke Katakana-JISX0201
0x1B 0x24 0x42    ESC $ B            invoke Japanese-JISX0208
0x1B 0x28 0x42    ESC ( B            invoke Printing-ASCII

Initially, Printing-ASCII is invoked.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.7 Internal Mule Encodings

In XEmacs/Mule, each character set is assigned a unique number, called a leading byte. This is used in the encodings of a character. Leading bytes are in the range 0x80 - 0xFF (except for ASCII, which has a leading byte of 0), although some leading bytes are reserved.

Charsets whose leading byte is in the range 0x80 - 0x9F are called official and are used for built-in charsets. Other charsets are called private and have leading bytes in the range 0xA0 - 0xFF; these are user-defined charsets.

More specifically:

 
Character set                Leading byte
-------------                ------------
ASCII                        0 (0x7F in arrays indexed by leading byte)
Composite                    0x8D
Dimension-1 Official         0x80 - 0x8C/0x8D
                               (0x8E is free)
Control                      0x8F
Dimension-2 Official         0x90 - 0x99
                               (0x9A - 0x9D are free)
Dimension-1 Private Marker   0x9E
Dimension-2 Private Marker   0x9F
Dimension-1 Private          0xA0 - 0xEF
Dimension-2 Private          0xF0 - 0xFF

There are two internal encodings for characters in XEmacs/Mule. One is called string encoding and is an 8-bit encoding that is used for representing characters in a buffer or string. It uses 1 to 4 bytes per character. The other is called character encoding and is a 21-bit encoding that is used for representing characters individually in a variable.

(In the following descriptions, we'll ignore composite characters for the moment. We also give a general (structural) overview first, followed later by the exact details.)

25.7.1 Internal String Encoding  
25.7.2 Internal Character Encoding  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.7.1 Internal String Encoding

ASCII characters are encoded using their position code directly. Other characters are encoded using their leading byte followed by their position code(s) with the high bit set. Characters in private character sets have their leading byte prefixed with a leading byte prefix, which is either 0x9E or 0x9F. (No character sets are ever assigned these leading bytes.) Specifically:

 
Character set           Encoding (PC=position-code, LB=leading-byte)
-------------           --------
ASCII                   PC-1 |
Control-1               LB   |  PC1 + 0xA0 |
Dimension-1 official    LB   |  PC1 + 0x80 |
Dimension-1 private     0x9E |  LB         | PC1 + 0x80 |
Dimension-2 official    LB   |  PC1 + 0x80 | PC2 + 0x80 |
Dimension-2 private     0x9F |  LB         | PC1 + 0x80 | PC2 + 0x80

The basic characteristic of this encoding is that the first byte of all characters is in the range 0x00 - 0x9F, and the second and following bytes of all characters is in the range 0xA0 - 0xFF. This means that it is impossible to get out of sync, or more specifically:

  1. Given any byte position, the beginning of the character it is within can be determined in constant time.
  2. Given any byte position at the beginning of a character, the beginning of the next character can be determined in constant time.
  3. Given any byte position at the beginning of a character, the beginning of the previous character can be determined in constant time.
  4. Textual searches can simply treat encoded strings as if they were encoded in a one-byte-per-character fashion rather than the actual multi-byte encoding.

None of the pre-Unciode standard non-modal encodings meet all of these conditions. For example, EUC satisfies only (2) and (3), while Shift-JIS and Big5 (not yet described) satisfy only (2). (All non-modal encodings must satisfy (2), in order to be unambiguous.) UTF-8, however, meets all three, and we are considering moving to it as an internal encoding.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.7.2 Internal Character Encoding

One 21-bit word represents a single character. The word is separated into three fields:

 
Bit number:     20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
                <------------------> <------------------> <------------------>
Field:                    1                    2                    3

Note that each field holds 7 bits.

 
Character set           Field 1         Field 2         Field 3
-------------           -------         -------         -------
ASCII                      0               0              PC1
   range:                                                   (00 - 7F)
Control-1                  0               1              PC1
   range:                                                   (00 - 1F)
Dimension-1 official       0            LB - 0x7F         PC1
   range:                                    (01 - 0D)      (20 - 7F)
Dimension-1 private        0            LB - 0x80         PC1
   range:                                    (20 - 6F)      (20 - 7F)
Dimension-2 official    LB - 0x8F         PC1             PC2
   range:                    (01 - 0A)       (20 - 7F)      (20 - 7F)
Dimension-2 private     LB - 0x80         PC1             PC2
   range:                    (0F - 1E)       (20 - 7F)      (20 - 7F)
Composite                 0x1F             ?               ?

Note also that character codes 0 - 255 are the same as the "binary encoding" described above.

Most of the code in XEmacs knows nothing of the representation of a character other than that values 0 - 255 represent ASCII, Control 1, and Latin 1.

WARNING WARNING WARNING: The Boyer-Moore code in `search.c', and the code in search_buffer() that determines whether that code can be used, knows that "field 3" in a character always corresponds to the last byte in the textual representation of the character. (This is important because the Boyer-Moore algorithm works by looking at the last byte of the search string and &&#### finish this.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.8 Byte/Character Types; Buffer Positions; Other Typedefs

25.8.1 Byte Types  
25.8.2 Different Ways of Seeing Internal Text  
25.8.3 Buffer Positions  
25.8.4 Other Typedefs  
25.8.5 Usage of the Various Representations  
25.8.6 Working With the Various Representations  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.8.1 Byte Types

Stuff pointed to by a char * or unsigned char * will nearly always be one of the following types:

Types (b), (c), (f) and (h) are defined as char, while the others are unsigned char. This is for maximum safety (signed characters are dangerous to work with) while maintaining as much compatibility with external APIs and string constants as possible.

We also provide versions of the above types defined with different underlying C types, for API compatibility. These use the following prefixes:

 
C = plain char, when the base type is unsigned
U = unsigned
S = signed

(Formerly I had a comment saying that type (e) "should be replaced with void *". However, there are in fact many places where an unsigned char * might be used -- e.g. for ease in pointer computation, since void * doesn't allow this, and for compatibility with external APIs.)

Note that these typedefs are purely for documentation purposes; from the C code's perspective, they are exactly equivalent to char *, unsigned char *, etc., so you can freely use them with library functions declared as such.

Using these more specific types rather than the general ones helps avoid the confusions that occur when the semantics of a char * or unsigned char * argument being studied are unclear. Furthermore, by requiring that ALL uses of char be replaced with some other type as part of the Mule-ization process, we can use a search for char as a way of finding code that has not been properly Mule-ized yet.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.8.2 Different Ways of Seeing Internal Text

There are various ways of representing internal text. The two primary ways are as an "array" of individual characters; the other is as a "stream" of bytes. In the ASCII world, where there are only 255 characters at most, things are easy because each character fits into a byte. In general, however, this is not true -- see the above discussion of characters vs. encodings.

In some cases, it's also important to distinguish between a stream representation as a series of bytes and as a series of textual units. This is particularly important wrt Unicode. The UTF-16 representation (sometimes referred to, rather sloppily, as simply the "Unicode" format) represents text as a series of 16-bit units. Mostly, each unit corresponds to a single character, but not necessarily, as characters outside of the range 0-65535 (the BMP or "Basic Multilingual Plane" of Unicode) require two 16-bit units, through the mechanism of "surrogates". When a series of 16-bit units is serialized into a byte stream, there are at least two possible representations, little-endian and big-endian, and which one is used may depend on the native format of 16-bit integers in the CPU of the machine that XEmacs is running on. (Similarly, UTF-32 is logically a representation with 32-bit textual units.)

Specifically:

Thus, we can imagine three levels in the representation of texual data:

 
series of characters -> series of textual units -> series of bytes
       [Ichar]                 [Itext]                 [Ibyte]

XEmacs has three corresponding typedefs:

Internal text in stream format can be simultaneously viewed as either Itext * or Ibyte *. The Ibyte * representation is convenient for copying data from one place to another, because such routines usually expect byte counts. However, Itext * is much better for actually working with the data.

From a text-unit perspective, units 0 through 127 will always be ASCII compatible, and data in Lisp strings (and other textual data generated as a whole, e.g. from external conversion) will be followed by a null-unit terminator. From an Ibyte * perspective, however, the encoding is only ASCII-compatible if it uses 1-byte units.

Similarly to the different text representations, three integral count types exist -- Charcount, Textcount and Bytecount.

NOTE: Despite the presence of the terminator, internal text itself can have nulls in it! (Null text units, not just the null bytes present in any UTF-16 encoding.) The terminator is present because in many cases internal text is passed to routines that will ultimately pass the text to library functions that cannot handle embedded nulls, e.g. functions manipulating filenames, and it is a real hassle to have to pass the length around constantly. But this can lead to sloppy coding! We need to be careful about watching for nulls in places that are important, e.g. manipulating string objects or passing data to/from the clipboard.

Ibyte
The data in a buffer or string is logically made up of Ibyte objects, where a Ibyte takes up the same amount of space as a char. (It is declared differently, though, to catch invalid usages.) Strings stored using Ibytes are said to be in "internal format". The important characteristics of internal format are

This leads to a number of desirable properties:

Itext

#### Document me.

Ichar
This typedef represents a single Emacs character, which can be ASCII, ISO-8859, or some extended character, as would typically be used for Kanji. Note that the representation of a character as an Ichar is not the same as the representation of that same character in a string; thus, you cannot do the standard C trick of passing a pointer to a character to a function that expects a string.

An Ichar takes up 21 bits of representation and (for code compatibility and such) is compatible with an int. This representation is visible on the Lisp level. The important characteristics of the Ichar representation are

This means that Ichar values are upwardly compatible with the standard 8-bit representation of ASCII/ISO-8859-1.

Extbyte
Strings that go in or out of Emacs are in "external format", typedef'ed as an array of char or a char *. There is more than one external format (JIS, EUC, etc.) but they all have similar properties. They are modal encodings, which is to say that the meaning of particular bytes is not fixed but depends on what "mode" the string is currently in (e.g. bytes in the range 0 - 0x7f might be interpreted as ASCII, or as Hiragana, or as 2-byte Kanji, depending on the current mode). The mode starts out in ASCII/ISO-8859-1 and is switched using escape sequences -- for example, in the JIS encoding, 'ESC $ B' switches to a mode where pairs of bytes in the range 0 - 0x7f are interpreted as Kanji characters.

External-formatted data is generally desirable for passing data between programs because it is upwardly compatible with standard ASCII/ISO-8859-1 strings and may require less space than internal encodings such as the one described above. In addition, some encodings (e.g. JIS) keep all characters (except the ESC used to switch modes) in the printing ASCII range 0x20 - 0x7e, which results in a much higher probability that the data will avoid being garbled in transmission. Externally-formatted data is generally not very convenient to work with, however, and for this reason is usually converted to internal format before any work is done on the string.

NOTE: filenames need to be in external format so that ISO-8859-1 characters come out correctly.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.8.3 Buffer Positions

There are three possible ways to specify positions in a buffer. All of these are one-based: the beginning of the buffer is position or index 1, and 0 is not a valid position.

As a "buffer position" (typedef Charbpos):

This is an index specifying an offset in characters from the beginning of the buffer. Note that buffer positions are logically between characters, not on a character. The difference between two buffer positions specifies the number of characters between those positions. Buffer positions are the only kind of position externally visible to the user.

As a "byte index" (typedef Bytebpos):

This is an index over the bytes used to represent the characters in the buffer. If there is no Mule support, this is identical to a buffer position, because each character is represented using one byte. However, with Mule support, many characters require two or more bytes for their representation, and so a byte index may be greater than the corresponding buffer position.

As a "memory index" (typedef Membpos):

This is the byte index adjusted for the gap. For positions before the gap, this is identical to the byte index. For positions after the gap, this is the byte index plus the gap size. There are two possible memory indices for the gap position; the memory index at the beginning of the gap should always be used, except in code that deals with manipulating the gap, where both indices may be seen. The address of the character "at" (i.e. following) a particular position can be obtained from the formula

buffer_start_address + memory_index(position) - 1

except in the case of characters at the gap position.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.8.4 Other Typedefs

Charcount: ---------- This typedef represents a count of characters, such as a character offset into a string or the number of characters between two positions in a buffer. The difference between two Charbpos's is a Charcount, and character positions in a string are represented using a Charcount.

Textcount: ---------- #### Document me.

Bytecount: ---------- Similar to a Charcount but represents a count of bytes. The difference between two Bytebpos's is a Bytecount.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.8.5 Usage of the Various Representations

Memory indices are used in low-level functions in insdel.c and for extent endpoints and marker positions. The reason for this is that this way, the extents and markers don't need to be updated for most insertions, which merely shrink the gap and don't move any characters around in memory.

(The beginning-of-gap memory index simplifies insertions w.r.t. markers, because text usually gets inserted after markers. For extents, it is merely for consistency, because text can get inserted either before or after an extent's endpoint depending on the open/closedness of the endpoint.)

Byte indices are used in other code that needs to be fast, such as the searching, redisplay, and extent-manipulation code.

Buffer positions are used in all other code. This is because this representation is easiest to work with (especially since Lisp code always uses buffer positions), necessitates the fewest changes to existing code, and is the safest (e.g. if the text gets shifted underneath a buffer position, it will still point to a character; if text is shifted under a byte index, it might point to the middle of a character, which would be bad).

Similarly, Charcounts are used in all code that deals with strings except for code that needs to be fast, which used Bytecounts.

Strings are always passed around internally using internal format. Conversions between external format are performed at the time that the data goes in or out of Emacs.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.8.6 Working With the Various Representations

We write things this way because it's very important the MAX_BYTEBPOS_GAP_SIZE_3 is a multiple of 3. (As it happens, 65535 is a multiple of 3, but this may not always be the case. #### unfinished


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.9 Internal Text APIs

NOTE: The most current documentation for these APIs is in `text.h'. In case of error, assume that file is correct and this one wrong.

25.9.1 Basic internal-format APIs  
25.9.2 The DFC API  
25.9.3 The Eistring API  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.9.1 Basic internal-format APIs

These are simple functions and macros to convert between text representation and characters, move forward and back in text, etc.

#### Finish the rest of this.

Use the following functions/macros on contiguous text in any of the internal formats. Those that take a format arg work on all internal formats; the others work only on the default (variable-width under Mule) format. If the text you're operating on is known to come from a buffer, use the buffer-level functions in buffer.h, which automatically know the correct format and handle the gap.

Some terminology:

"itext" appearing in the macros means "internal-format text" -- type Ibyte *. Operations on such pointers themselves, rather than on the text being pointed to, have "itext" instead of "itext" in the macro name. "ichar" in the macro names means an Ichar -- the representation of a character as a single integer rather than a series of bytes, as part of "itext". Many of the macros below are for converting between the two representations of characters.

Note also that we try to consistently distinguish between an "Ichar" and a Lisp character. Stuff working with Lisp characters often just says "char", so we consistently use "Ichar" when that's what we're working with.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.9.2 The DFC API

This is for conversion between internal and external text. Note that there is also the "new DFC" API, which returns a pointer to the converted text (in alloca space), rather than storing it into a variable.

The macros below are used for converting data between different formats. Generally, the data is textual, and the formats are related to internationalization (e.g. converting between internal-format text and UTF-8) -- but the mechanism is general, and could be used for anything, e.g. decoding gzipped data.

In general, conversion involves a source of data, a sink, the existing format of the source data, and the desired format of the sink. The macros below, however, always require that either the source or sink is internal-format text. Therefore, in practice the conversions below involve source, sink, an external format (specified by a coding system), and the direction of conversion (internal->external or vice-versa).

Sources and sinks can be raw data (sized or unsized -- when unsized, input data is assumed to be null-terminated [double null-terminated for Unicode-format data], and on output the length is not stored anywhere), Lisp strings, Lisp buffers, lstreams, and opaque data objects. When the output is raw data, the result can be allocated either with alloca() or malloc(). (There is currently no provision for writing into a fixed buffer. If you want this, use alloca() output and then copy the data -- but be careful with the size! Unless you are very sure of the encoding being used, upper bounds for the size are not in general computable.) The obvious restrictions on source and sink types apply (e.g. Lisp strings are a source and sink only for internal data).

All raw data outputted will contain an extra null byte (two bytes for Unicode -- currently, in fact, all output data, whether internal or external, is double-null-terminated, but you can't count on this; see below). This means that enough space is allocated to contain the extra nulls; however, these nulls are not reflected in the returned output size.

The most basic macros are TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT. These can be used to convert between any kinds of sources or sinks. However, 99% of conversions involve raw data or Lisp strings as both source and sink, and usually data is output as alloca() rather than malloc(). For this reason, convenience macros are defined for many types of conversions involving raw data and/or Lisp strings, especially when the output is an alloca()ed string. (When the destination is a Lisp_String, there are other functions that should be used instead -- build_ext_string() and make_ext_string(), for example.) The convenience macros are of two types -- the older kind that store the result into a specified variable, and the newer kind that return the result. The newer kind of macros don't exist when the output is sized data, because that would have two return values. NOTE: All convenience macros are ultimately defined in terms of TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT. Thus, any comments below about the workings of these macros also apply to all convenience macros.

 
TO_EXTERNAL_FORMAT (source_type, source, sink_type, sink, codesys)
TO_INTERNAL_FORMAT (source_type, source, sink_type, sink, codesys)

Typical use is

 
   TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name);

which means that the contents of the lisp string str are written to a malloc'ed memory area which will be pointed to by ptr, after the function returns. The conversion will be done using the file-name coding system (which will be controlled by the user indirectly by setting or binding the variable file-name-coding-system).

Some sources and sinks require two C variables to specify. We use some preprocessor magic to allow different source and sink types, and even different numbers of arguments to specify different types of sources and sinks.

So we can have a call that looks like

 
   TO_INTERNAL_FORMAT (DATA, (ptr, len),
                       MALLOC, (ptr, len),
                       coding_system);

The parenthesized argument pairs are required to make the preprocessor magic work.

NOTE: GC is inhibited during the entire operation of these macros. This is because frequently the data to be converted comes from strings but gets passed in as just DATA, and GC may move around the string data. If we didn't inhibit GC, there'd have to be a lot of messy recoding, alloca-copying of strings and other annoying stuff. The source or sink can be specified in one of these ways:

 
DATA,   (ptr, len),    // input data is a fixed buffer of size len
ALLOCA, (ptr, len),    // output data is in a ALLOCA()ed buffer of size len
MALLOC, (ptr, len),    // output data is in a malloc()ed buffer of size len
C_STRING_ALLOCA, ptr,  // equivalent to ALLOCA (ptr, len_ignored) on output
C_STRING_MALLOC, ptr,  // equivalent to MALLOC (ptr, len_ignored) on output
C_STRING,     ptr,     // equivalent to DATA, (ptr, strlen/wcslen (ptr))
                       // on input (the Unicode version is used when correct)
LISP_STRING,  string,  // input or output is a Lisp_Object of type string
LISP_BUFFER,  buffer,  // output is written to (point) in lisp buffer
LISP_LSTREAM, lstream, // input or output is a Lisp_Object of type lstream
LISP_OPAQUE,  object,  // input or output is a Lisp_Object of type opaque

When specifying the sink, use lvalues, since the macro will assign to them, except when the sink is an lstream or a lisp buffer.

For the sink types ALLOCA and C_STRING_ALLOCA, the resulting text is stored in a stack-allocated buffer, which is automatically freed on returning from the function. However, the sink types MALLOC and C_STRING_MALLOC return xmalloc()ed memory. The caller is responsible for freeing this memory using xfree().

The macros accept the kinds of sources and sinks appropriate for internal and external data representation. See the type_checking_assert macros below for the actual allowed types.

Since some sources and sinks use one argument (a Lisp_Object) to specify them, while others take a (pointer, length) pair, we use some C preprocessor trickery to allow pair arguments to be specified by parenthesizing them, as in the examples above.

Anything prefixed by dfc_ (`data format conversion') is private. They are only used to implement these macros.

[[Using C_STRING* is appropriate for using with external APIs that take null-terminated strings. For internal data, we should try to be '\0'-clean - i.e. allow arbitrary data to contain embedded '\0'.

Sometime in the future we might allow output to C_STRING_ALLOCA or C_STRING_MALLOC _only_ with TO_EXTERNAL_FORMAT(), not TO_INTERNAL_FORMAT().]]

The above comments are not true. Frequently (most of the time, in fact), external strings come as zero-terminated entities, where the zero-termination is the only way to find out the length. Even in cases where you can get the length, most of the time the system will still use the null to signal the end of the string, and there will still be no way to either send in or receive a string with embedded nulls. In such situations, it's pointless to track the length because null bytes can never be in the string. We have a lot of operations that make it easy to operate on zero-terminated strings, and forcing the user the deal with the length everywhere would only make the code uglier and more complicated, for no gain. --ben

There is no problem using the same lvalue for source and sink.

Also, when pointers are required, the code (currently at least) is lax and allows any pointer types, either in the source or the sink. This makes it possible, e.g., to deal with internal format data held in char *'s or external format data held in WCHAR * (i.e. Unicode).

Finally, whenever storage allocation is called for, extra space is allocated for a terminating zero, and such a zero is stored in the appropriate place, regardless of whether the source data was specified using a length or was specified as zero-terminated. This allows you to freely pass the resulting data, no matter how obtained, to a routine that expects zero termination (modulo, of course, that any embedded zeros in the resulting text will cause truncation). In fact, currently two embedded zeros are allocated and stored after the data result. This is to allow for the possibility of storing a Unicode value on output, which needs the two zeros. Currently, however, the two zeros are stored regardless of whether the conversion is internal or external and regardless of whether the external coding system is in fact Unicode. This behavior may change in the future, and you cannot rely on this -- the most you can rely on is that sink data in Unicode format will have two terminating nulls, which combine to form one Unicode null character.

NOTE: You might ask, why are these not written as functions that RETURN the converted string, since that would allow them to be used much more conveniently, without having to constantly declare temporary variables? The answer is that in fact I originally did write the routines that way, but that required either

Turned out that all of the above had bugs, all caused by GCC (hence the comments about "those GCC wankers" and "ream gcc up the ass"). As for (a), some versions of GCC (especially on Intel platforms), which had buggy implementations of alloca() that couldn't handle being called inside of a function call -- they just decremented the stack right in the middle of pushing args. Oops, crash with stack trashing, very bad. (b) was an attempt to fix (a), and that led to further GCC crashes, esp. when you had two such calls in a single subexpression, because GCC couldn't be counted upon to follow even a minimally reasonable order of execution. True, you can't count on one argument being evaluated before another, but GCC would actually interleave them so that the temp var got stomped on by one while the other was accessing it. So I tried (c), which was problematic because that GCC extension has more bugs in it than a termite's nest.

So reluctantly I converted to the current way. Now, that was awhile ago (c. 1994), and it appears that the bug involving alloca in function calls has long since been fixed. More recently, I defined the new-dfc routines down below, which DO allow exactly such convenience of returning your args rather than store them in temp variables, and I also wrote a configure check to see whether alloca() causes crashes inside of function calls, and if so use the portable alloca() implementation in alloca.c. If you define TEST_NEW_DFC, the old routines get written in terms of the new ones, and I've had a beta put out with this on and it appeared to this appears to cause no problems -- so we should consider switching, and feel no compunctions about writing further such function- like alloca() routines in lieu of statement-like ones. --ben


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.9.3 The Eistring API

(This API is currently under-used) When doing simple things with internal text, the basic internal-format APIs are enough. But to do things like delete or replace a substring, concatenate various strings, etc. is difficult to do cleanly because of the allocation issues. The Eistring API is designed to deal with this, and provides a clean way of modifying and building up internal text. (Note that the former lack of this API has meant that some code uses Lisp strings to do similar manipulations, resulting in excess garbage and increased garbage collection.)

NOTE: The Eistring API is (or should be) Mule-correct even without an ASCII-compatible internal representation.

 
#### NOTE: This is a work in progress.  Neither the API nor especially
the implementation is finished.

NOTE: An Eistring is a structure that makes it easy to work with
internally-formatted strings of data.  It provides operations similar
in feel to the standard strcpy(), strcat(), strlen(), etc., but

(a) it is Mule-correct
(b) it does dynamic allocation so you never have to worry about size
    restrictions
(c) it comes in an ALLOCA() variety (all allocation is stack-local,
    so there is no need to explicitly clean up) as well as a malloc()
    variety
(d) it knows its own length, so it does not suffer from standard null
    byte brain-damage -- but it null-terminates the data anyway, so
    it can be passed to standard routines
(e) it provides a much more powerful set of operations and knows about
    all the standard places where string data might reside: Lisp_Objects,
    other Eistrings, Ibyte * data with or without an explicit length,
    ASCII strings, Ichars, etc.
(f) it provides easy operations to convert to/from externally-formatted
    data, and is easier to use than the standard TO_INTERNAL_FORMAT
    and TO_EXTERNAL_FORMAT macros. (An Eistring can store both the internal
    and external version of its data, but the external version is only
    initialized or changed when you call eito_external().)

The idea is to make it as easy to write Mule-correct string manipulation
code as it is to write normal string manipulation code.  We also make
the API sufficiently general that it can handle multiple internal data
formats (e.g. some fixed-width optimizing formats and a default variable
width format) and allows for ANY data format we might choose in the
future for the default format, including UCS2. (In other words, we can't
assume that the internal format is ASCII-compatible and we can't assume
it doesn't have embedded null bytes.  We do assume, however, that any
chosen format will have the concept of null-termination.) All of this is
hidden from the user.

#### It is really too bad that we don't have a real object-oriented
language, or at least a language with polymorphism!


 ********************************************** 
 *                 Declaration                * 
 ********************************************** 

To declare an Eistring, either put one of the following in the local
variable section:

DECLARE_EISTRING (name);
     Declare a new Eistring and initialize it to the empy string.  This
     is a standard local variable declaration and can go anywhere in the
     variable declaration section.  NAME itself is declared as an
     Eistring *, and its storage declared on the stack.

DECLARE_EISTRING_MALLOC (name);
     Declare and initialize a new Eistring, which uses malloc()ed
     instead of ALLOCA()ed data.  This is a standard local variable
     declaration and can go anywhere in the variable declaration
     section.  Once you initialize the Eistring, you will have to free
     it using eifree() to avoid memory leaks.  You will need to use this
     form if you are passing an Eistring to any function that modifies
     it (otherwise, the modified data may be in stack space and get
     overwritten when the function returns).

or use

Eistring ei;
void eiinit (Eistring *ei);
void eiinit_malloc (Eistring *einame);
     If you need to put an Eistring elsewhere than in a local variable
     declaration (e.g. in a structure), declare it as shown and then
     call one of the init macros.

Also note:

void eifree (Eistring *ei);
     If you declared an Eistring to use malloc() to hold its data,
     or converted it to the heap using eito_malloc(), then this
     releases any data in it and afterwards resets the Eistring
     using eiinit_malloc().  Otherwise, it just resets the Eistring
     using eiinit().


 ********************************************** 
 *                 Conventions                * 
 ********************************************** 

 - The names of the functions have been chosen, where possible, to
   match the names of str*() functions in the standard C API.
 - 


 ********************************************** 
 *               Initialization               * 
 ********************************************** 

void eireset (Eistring *eistr);
     Initialize the Eistring to the empty string.

void eicpy_* (Eistring *eistr, ...);
     Initialize the Eistring from somewhere:

void eicpy_ei (Eistring *eistr, Eistring *eistr2);
     ... from another Eistring.
void eicpy_lstr (Eistring *eistr, Lisp_Object lisp_string);
     ... from a Lisp_Object string.
void eicpy_ch (Eistring *eistr, Ichar ch);
     ... from an Ichar (this can be a conventional C character).

void eicpy_lstr_off (Eistring *eistr, Lisp_Object lisp_string,
                     Bytecount off, Charcount charoff,
                     Bytecount len, Charcount charlen);
     ... from a section of a Lisp_Object string.
void eicpy_lbuf (Eistring *eistr, Lisp_Object lisp_buf,
     	    Bytecount off, Charcount charoff,
     	    Bytecount len, Charcount charlen);
     ... from a section of a Lisp_Object buffer.
void eicpy_raw (Eistring *eistr, const Ibyte *data, Bytecount len);
     ... from raw internal-format data in the default internal format.
void eicpy_rawz (Eistring *eistr, const Ibyte *data);
     ... from raw internal-format data in the default internal format
     that is "null-terminated" (the meaning of this depends on the nature
     of the default internal format).
void eicpy_raw_fmt (Eistring *eistr, const Ibyte *data, Bytecount len,
                    Internal_Format intfmt, Lisp_Object object);
     ... from raw internal-format data in the specified format.
void eicpy_rawz_fmt (Eistring *eistr, const Ibyte *data,
                     Internal_Format intfmt, Lisp_Object object);
     ... from raw internal-format data in the specified format that is
     "null-terminated" (the meaning of this depends on the nature of
     the specific format).
void eicpy_c (Eistring *eistr, const Ascbyte *c_string);
     ... from an ASCII null-terminated string.  Non-ASCII characters in
     the string are ILLEGAL (read abort() with error-checking defined).
void eicpy_c_len (Eistring *eistr, const Ascbyte *c_string, len);
     ... from an ASCII string, with length specified.  Non-ASCII characters
     in the string are ILLEGAL (read abort() with error-checking defined).
void eicpy_ext (Eistring *eistr, const Extbyte *extdata,
                Lisp_Object codesys);
     ... from external null-terminated data, with coding system specified.
void eicpy_ext_len (Eistring *eistr, const Extbyte *extdata,
                    Bytecount extlen, Lisp_Object codesys);
     ... from external data, with length and coding system specified.
void eicpy_lstream (Eistring *eistr, Lisp_Object lstream);
     ... from an lstream; reads data till eof.  Data must be in default
     internal format; otherwise, interpose a decoding lstream.


 ********************************************** 
 *    Getting the data out of the Eistring    * 
 ********************************************** 

Ibyte *eidata (Eistring *eistr);
     Return a pointer to the raw data in an Eistring.  This is NOT
     a copy.

Lisp_Object eimake_string (Eistring *eistr);
     Make a Lisp string out of the Eistring.

Lisp_Object eimake_string_off (Eistring *eistr,
                               Bytecount off, Charcount charoff,
     			  Bytecount len, Charcount charlen);
     Make a Lisp string out of a section of the Eistring.

void eicpyout_alloca (Eistring *eistr, LVALUE: Ibyte *ptr_out,
                      LVALUE: Bytecount len_out);
     Make an ALLOCA() copy of the data in the Eistring, using the
     default internal format.  Due to the nature of ALLOCA(), this
     must be a macro, with all lvalues passed in as parameters.
     (More specifically, not all compilers correctly handle using
     ALLOCA() as the argument to a function call -- GCC on x86
     didn't used to, for example.) A pointer to the ALLOCA()ed data
     is stored in PTR_OUT, and the length of the data (not including
     the terminating zero) is stored in LEN_OUT.

void eicpyout_alloca_fmt (Eistring *eistr, LVALUE: Ibyte *ptr_out,
                          LVALUE: Bytecount len_out,
                          Internal_Format intfmt, Lisp_Object object);
     Like eicpyout_alloca(), but converts to the specified internal
     format. (No formats other than FORMAT_DEFAULT are currently
     implemented, and you get an assertion failure if you try.)

Ibyte *eicpyout_malloc (Eistring *eistr, Bytecount *intlen_out);
     Make a malloc() copy of the data in the Eistring, using the
     default internal format.  This is a real function.  No lvalues
     passed in.  Returns the new data, and stores the length (not
     including the terminating zero) using INTLEN_OUT, unless it's
     a NULL pointer.

Ibyte *eicpyout_malloc_fmt (Eistring *eistr, Internal_Format intfmt,
                              Bytecount *intlen_out, Lisp_Object object);
     Like eicpyout_malloc(), but converts to the specified internal
     format. (No formats other than FORMAT_DEFAULT are currently
     implemented, and you get an assertion failure if you try.)


 ********************************************** 
 *             Moving to the heap             * 
 ********************************************** 

void eito_malloc (Eistring *eistr);
     Move this Eistring to the heap.  Its data will be stored in a
     malloc()ed block rather than the stack.  Subsequent changes to
     this Eistring will realloc() the block as necessary.  Use this
     when you want the Eistring to remain in scope past the end of
     this function call.  You will have to manually free the data
     in the Eistring using eifree().

void eito_alloca (Eistring *eistr);
     Move this Eistring back to the stack, if it was moved to the
     heap with eito_malloc().  This will automatically free any
     heap-allocated data.



 ********************************************** 
 *            Retrieving the length           * 
 ********************************************** 

Bytecount eilen (Eistring *eistr);
     Return the length of the internal data, in bytes.  See also
     eiextlen(), below.
Charcount eicharlen (Eistring *eistr);
     Return the length of the internal data, in characters.


 ********************************************** 
 *           Working with positions           * 
 ********************************************** 

Bytecount eicharpos_to_bytepos (Eistring *eistr, Charcount charpos);
     Convert a char offset to a byte offset.
Charcount eibytepos_to_charpos (Eistring *eistr, Bytecount bytepos);
     Convert a byte offset to a char offset.
Bytecount eiincpos (Eistring *eistr, Bytecount bytepos);
     Increment the given position by one character.
Bytecount eiincpos_n (Eistring *eistr, Bytecount bytepos, Charcount n);
     Increment the given position by N characters.
Bytecount eidecpos (Eistring *eistr, Bytecount bytepos);
     Decrement the given position by one character.
Bytecount eidecpos_n (Eistring *eistr, Bytecount bytepos, Charcount n);
     Deccrement the given position by N characters.


 ********************************************** 
 *    Getting the character at a position     * 
 ********************************************** 

Ichar eigetch (Eistring *eistr, Bytecount bytepos);
     Return the character at a particular byte offset.
Ichar eigetch_char (Eistring *eistr, Charcount charpos);
     Return the character at a particular character offset.


 ********************************************** 
 *    Setting the character at a position     * 
 ********************************************** 

Ichar eisetch (Eistring *eistr, Bytecount bytepos, Ichar chr);
     Set the character at a particular byte offset.
Ichar eisetch_char (Eistring *eistr, Charcount charpos, Ichar chr);
     Set the character at a particular character offset.


 ********************************************** 
 *               Concatenation                * 
 ********************************************** 

void eicat_* (Eistring *eistr, ...);
     Concatenate onto the end of the Eistring, with data coming from the
     same places as above:

void eicat_ei (Eistring *eistr, Eistring *eistr2);
     ... from another Eistring.
void eicat_c (Eistring *eistr, Ascbyte *c_string);
     ... from an ASCII null-terminated string.  Non-ASCII characters in
     the string are ILLEGAL (read abort() with error-checking defined).
void eicat_raw (ei, const Ibyte *data, Bytecount len);
     ... from raw internal-format data in the default internal format.
void eicat_rawz (ei, const Ibyte *data);
     ... from raw internal-format data in the default internal format
     that is "null-terminated" (the meaning of this depends on the nature
     of the default internal format).
void eicat_lstr (ei, Lisp_Object lisp_string);
     ... from a Lisp_Object string.
void eicat_ch (ei, Ichar ch);
     ... from an Ichar.

All except the first variety are convenience functions.
n the general case, create another Eistring from the source.)


 ********************************************** 
 *                Replacement                 * 
 ********************************************** 

void eisub_* (Eistring *eistr, Bytecount off, Charcount charoff,
     			  Bytecount len, Charcount charlen, ...);
     Replace a section of the Eistring, specifically:

void eisub_ei (Eistring *eistr, Bytecount off, Charcount charoff,
     	  Bytecount len, Charcount charlen, Eistring *eistr2);
     ... with another Eistring.
void eisub_c (Eistring *eistr, Bytecount off, Charcount charoff,
     	 Bytecount len, Charcount charlen, Ascbyte *c_string);
     ... with an ASCII null-terminated string.  Non-ASCII characters in
     the string are ILLEGAL (read abort() with error-checking defined).
void eisub_ch (Eistring *eistr, Bytecount off, Charcount charoff,
     	  Bytecount len, Charcount charlen, Ichar ch);
     ... with an Ichar.

void eidel (Eistring *eistr, Bytecount off, Charcount charoff,
            Bytecount len, Charcount charlen);
     Delete a section of the Eistring.


 ********************************************** 
 *      Converting to an external format      * 
 ********************************************** 

void eito_external (Eistring *eistr, Lisp_Object codesys);
     Convert the Eistring to an external format and store the result
     in the string.  NOTE: Further changes to the Eistring will NOT
     change the external data stored in the string.  You will have to
     call eito_external() again in such a case if you want the external
     data.

Extbyte *eiextdata (Eistring *eistr);
     Return a pointer to the external data stored in the Eistring as
     a result of a prior call to eito_external().

Bytecount eiextlen (Eistring *eistr);
     Return the length in bytes of the external data stored in the
     Eistring as a result of a prior call to eito_external().


 ********************************************** 
 * Searching in the Eistring for a character  * 
 ********************************************** 

Bytecount eichr (Eistring *eistr, Ichar chr);
Charcount eichr_char (Eistring *eistr, Ichar chr);
Bytecount eichr_off (Eistring *eistr, Ichar chr, Bytecount off,
     		Charcount charoff);
Charcount eichr_off_char (Eistring *eistr, Ichar chr, Bytecount off,
     		     Charcount charoff);
Bytecount eirchr (Eistring *eistr, Ichar chr);
Charcount eirchr_char (Eistring *eistr, Ichar chr);
Bytecount eirchr_off (Eistring *eistr, Ichar chr, Bytecount off,
     		 Charcount charoff);
Charcount eirchr_off_char (Eistring *eistr, Ichar chr, Bytecount off,
     		      Charcount charoff);


 ********************************************** 
 *   Searching in the Eistring for a string   * 
 ********************************************** 

Bytecount eistr_ei (Eistring *eistr, Eistring *eistr2);
Charcount eistr_ei_char (Eistring *eistr, Eistring *eistr2);
Bytecount eistr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off,
     		   Charcount charoff);
Charcount eistr_ei_off_char (Eistring *eistr, Eistring *eistr2,
     			Bytecount off, Charcount charoff);
Bytecount eirstr_ei (Eistring *eistr, Eistring *eistr2);
Charcount eirstr_ei_char (Eistring *eistr, Eistring *eistr2);
Bytecount eirstr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off,
     		    Charcount charoff);
Charcount eirstr_ei_off_char (Eistring *eistr, Eistring *eistr2,
     			 Bytecount off, Charcount charoff);

Bytecount eistr_c (Eistring *eistr, Ascbyte *c_string);
Charcount eistr_c_char (Eistring *eistr, Ascbyte *c_string);
Bytecount eistr_c_off (Eistring *eistr, Ascbyte *c_string, Bytecount off,
     		   Charcount charoff);
Charcount eistr_c_off_char (Eistring *eistr, Ascbyte *c_string,
     		       Bytecount off, Charcount charoff);
Bytecount eirstr_c (Eistring *eistr, Ascbyte *c_string);
Charcount eirstr_c_char (Eistring *eistr, Ascbyte *c_string);
Bytecount eirstr_c_off (Eistring *eistr, Ascbyte *c_string,
     		   Bytecount off, Charcount charoff);
Charcount eirstr_c_off_char (Eistring *eistr, Ascbyte *c_string,
     			Bytecount off, Charcount charoff);


 ********************************************** 
 *                 Comparison                 * 
 ********************************************** 

int eicmp_* (Eistring *eistr, ...);
int eicmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
                 Bytecount len, Charcount charlen, ...);
int eicasecmp_* (Eistring *eistr, ...);
int eicasecmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
                     Bytecount len, Charcount charlen, ...);
int eicasecmp_i18n_* (Eistring *eistr, ...);
int eicasecmp_i18n_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
                          Bytecount len, Charcount charlen, ...);

     Compare the Eistring with the other data.  Return value same as
     from strcm