Collation in FieldWorks

ICU tailoring rules are at

https://unicode-org.github.io/icu/.

ICU provides a useful web site for testing collation rules at
https://demo.icu-project.org/icu-bin/locexp.

Each rule starts with an ampersand followed by an anchor point. The rest of the rule specifies how characters are collated compared to the anchor point. There is no need to start a new line for each rule, but that makes it more readable.

Examples

 

Note: In an ICU rule, any non-alphanumeric ASCII character is reserved for syntax characters. If you need to control collation of any of these characters, you must enclose them in apostrophes. A single apostrophe is represented as two apostrophes. Here are some examples of alphanumeric and punctuation characters with our without the \u syntax.

a

letter a

 

\u0061

letter a

 

3

digit 3

 

ng

digraph ng

 

'ng'

digraph ng (quotes are optional for alphanumeric characters)

\u006e\u0067

digraph ng

 

'-'

hyphen

 

' '

space

 

'\u0020'

space

 

''

apostrophe

 

\u0027\u0027

apostrophe

 

To control the collation of an apostrophe you would thus add two apostrophes (not a double quote). To sort t' after t, you would use the rule

&t<t''

The following rules would be one way to handle IPA sorting:

&d<d͡ʒ

&e<ɛ<f<ɸ

&i<ɨ

&k<k''

&n<ŋ

&p<p''<r<ɾ

&s<ʃ<ʂ

&t<t''<t͡s<t͡s''<t͡ʃ<t͡ʃ''<ʈ͡ʂ<ʈ͡ʂ''

&z<ʒ<ʐ<ʔ

Suppose you want to ignore an apostrophe after m and n, but you want ng to sort after n, and ng' to sort after ng. The following rules allow for this.

The = syntax states that the right side is identical to the left side.

&m=m''

&M=M''

&n=n''

&N=N''

&n<ng<<<Ng<<<NG<ng''<<<Ng''<<<NG''

Suppose you want to ignore 02BC;MODIFIER LETTER APOSTROPHE in sorting. There are two ways you could handle this. The following rule doesn’t totally ignore the apostrophe, but it treats it in a secondary level so that it is ignored unless words are identical otherwise. In this case it always comes after other diacritics.

&\u030E<<\u02BC

This would result in the following order: ba, bad, bäd, baʼd, bʼad, bade, bat, bät, baʼt, bʼat, bate.

The second approach is to totally ignore 02BC.

&[last tertiary ignorable] = \u02BC

This would result in the following order ba, baʼd, bad, bʼad, bäd, bade, baʼt, bat, bʼat, bät, bate. Since ba, baʼd, and bʼad all have identical sort keys, their order is random.

If you need to ignore more than one character, use = to separate the list of characters. The following rule would ignore an apostrophe, a question mark, a hyphen, a space, and the ng digraph

&[last tertiary ignorable] = '' = '?' = '-' = ' ' = ng

This could also be represented as

&[last tertiary ignorable] = \u0027\u0027 = '\u003f' = '\u002d' = '\u0020' = \u006e\u0067

If you simply want to ignore all punctuation as well as white space, you can use the following rule

[alternate shifted]

Tip

&[last tertiary ignorable] = '('

Related Topics

Pathway multigraphs

Treat punctuation as word-forming characters

Writing Systems overview

Writing System Properties, Sorting tab