ICU tailoring rules are at
https://unicode-org.github.io/icu/.
ICU provides a useful web site for testing collation
rules at
https://demo.icu-project.org/icu-bin/locexp.
Each rule starts with an ampersand followed by an anchor point. The rest of the rule specifies how characters are collated compared to the anchor point. There is no need to start a new line for each rule, but that makes it more readable.
Here is a simple example of a rule using a primary level (single left wedge):
&c<k
This rule states that “k”
comes immediately after “c” (e.g.,
cat, kite,
dog).
Note: This does not handle uppercase.
Digraphs can be handled as well:
&n<ng
This rule states that the “ng”
digraph occurs after “n” (e.g., nang, nung,
ngang).
Note: This does not handle uppercase.
The Unicode default collation sequence ignores diacritics unless the rest of the word is identical. In that case, words are sorted based on the diacritic (e.g., bad, bád, bàd, bâd, båd, bäd bãd).
Two left wedges are used for secondary level collation which only comes into effect if the primary levels are identical. The secondary level is typically used for diacritics. You can change the way diacritics are sorted with the following rule:
&a<<à<<á<<â<<å<<ä<<ã
This example changes the default collation of diacritics to include grave before acute (e.g., bad, bàd, bád, bâd, båd, bäd, bãd).
Note: In addition to inserting the actual character in a rule, you can also give the code point. The following commands are identical:
&a<<à
&\u0061<<\u00e0
Three left wedges are used for tertiary level collation which is typically used for case. Tertiary level sorting only affects strings that are identical through the secondary level.
&n<ng<<<Ng<<<NG
&c<k<<<K
The first rule moves “ng” (regardless of case) to follow “n” (e.g., nang, Nang, NANG, nung, ngang, Ngang, NGANG). The second moves “k” (regardless of case) to follow “c” (e.g., cat, kite, Kite, dog).
To sort “á” at a primary level after all other “a's”, use this rule:
&a<á<<<Á
This sort order gives ade, ãde, apple, Azure, áde, Áde.
If you need to sort a character before another one instead of after, (e.g., āb, Āb, aa, Aa) you can do it two ways:
&[before 1]a<ā<<<Ā
In this case the right-hand side goes before the A anchor instead of after. The digit 1 indicates this is a primary level.
&9<ā<<<Ā
The other way is to use an anchor point before the desired letter. Since 9 normally sorts before A, we can use the normal way to specify that ā immediately follows 9, so therefore it will be before A.
To sort phonetic script
in “p
pʰ b ɸ β m ʍ w” order, use either of the following identical
rules:
&p<pʰ<b<ɸ<β<m<ʍ<w
&p<\u0070\u02b0<b<\u0278<\u03b2<m<\u028d<w
Note: This approach can be used to turn a Shoebox sort sequence (that does not have case distinctions) into a rule. Shoebox has a list of characters, one per line in the desired order. Put an “&” in front of the first character and change each new line into “<”.This rule can be pasted into the FieldWorks sort tab. Use only UTF-8 characters, not ANSI.
To sort uppercase and lowercase
in “c C b B a A”
order, use these two rules:
&c<b<<<B
&b<a<<<A
This can also be combined into a single rule:
&c<b<<<B<a<<<A
Note: In an ICU rule, any non-alphanumeric ASCII character is reserved for syntax characters. If you need to control collation of any of these characters, you must enclose them in apostrophes. A single apostrophe is represented as two apostrophes. Here are some examples of alphanumeric and punctuation characters with our without the \u syntax.
a |
letter a |
|
\u0061 |
letter a |
|
3 |
digit 3 |
|
ng |
digraph ng |
|
'ng' |
digraph ng (quotes are optional for alphanumeric characters) |
|
\u006e\u0067 |
digraph ng |
|
'-' |
hyphen |
|
' ' |
space |
|
'\u0020' |
space |
|
'' |
apostrophe |
|
\u0027\u0027 |
apostrophe |
|
To control the collation of an apostrophe you would thus add two apostrophes (not a double quote). To sort t' after t, you would use the rule
&t<t''
The following rules would be one way to handle IPA sorting:
&d<d͡ʒ
&e<ɛ<f<ɸ
&i<ɨ
&k<k''
&n<ŋ
&p<p''<r<ɾ
&s<ʃ<ʂ
&t<t''<t͡s<t͡s''<t͡ʃ<t͡ʃ''<ʈ͡ʂ<ʈ͡ʂ''
&z<ʒ<ʐ<ʔ
Suppose you want to ignore an apostrophe after m and n, but you want ng to sort after n, and ng' to sort after ng. The following rules allow for this.
The = syntax states that the right side is identical to the left side.
&m=m''
&M=M''
&n=n''
&N=N''
&n<ng<<<Ng<<<NG<ng''<<<Ng''<<<NG''
Suppose you want to ignore 02BC;MODIFIER LETTER APOSTROPHE in sorting. There are two ways you could handle this. The following rule doesn’t totally ignore the apostrophe, but it treats it in a secondary level so that it is ignored unless words are identical otherwise. In this case it always comes after other diacritics.
&\u030E<<\u02BC
This would result in the following order: ba, bad, bäd, baʼd, bʼad, bade, bat, bät, baʼt, bʼat, bate.
The second approach is to totally ignore 02BC.
&[last tertiary ignorable] = \u02BC
This would result in the following order ba, baʼd, bad, bʼad, bäd, bade, baʼt, bat, bʼat, bät, bate. Since ba, baʼd, and bʼad all have identical sort keys, their order is random.
If you need to ignore more than one character, use = to separate the list of characters. The following rule would ignore an apostrophe, a question mark, a hyphen, a space, and the ng digraph
&[last tertiary ignorable] = '' = '?' = '-' = ' ' = ng
This could also be represented as
&[last tertiary ignorable] = \u0027\u0027 = '\u003f' = '\u002d' = '\u0020' = \u006e\u0067
If you simply want to ignore all punctuation as well as white space, you can use the following rule
[alternate shifted]
Remove a parenthesis "(" that appears as letter headings in Dictionary or Reversal Indexes with a rule like this:
&[last tertiary ignorable] = '('
For more information refer to the document entitled ICU and writing systems at https://software.sil.org/fieldworks/support/technical-documents/.