Enterprise Software

Oracle Tip: Create accented characters with COMPOSE and UNISTR

Oracle9i introduced the COMPOSE function, which allows you to take a sequence of Unicode characters and normalize the text. Find out how you can enter the Unicode sequence in pure ASCII text with the help of the UNISTR function.

This article originally appeared in the Oracle e-newsletter. Click here to subscribe automatically.

Many languages, including English, use accented characters. Since these characters aren't part of the ASCII character set, it can be difficult to write code that uses these characters without looking up Unicode values or using a Unicode editor and converting to a known character set.

Oracle9i introduced the COMPOSE function, which allows you to take a sequence of Unicode characters and normalize the text. This means that it will take a letter such as 'a' (Unicode character 0097) and a combining mark, such as grave accent (Unicode character 0300), and create a single character that is the combination of the two marks. COMPOSE uses special combining marks, which are a part of the Unicode standard, rather than the ASCII equivalent punctuation marks. The result should be Unicode character 00E0 (little latin letter 'a' with a grave accent).

The most common combining characters in ANSI are:

  • U+0300: grave accent ( ` )
  • U+0301: acute accent ( ' )
  • U+0302: circumflex accent(^)
  • U+0303: tilde (~)
  • U+0308: umlaut

It may be difficult for you to enter Unicode characters 0097 and 0300 at a keyboard without special software or keyboard drivers. Therefore, one way to enter the Unicode sequence in pure ASCII text is with the help of the UNISTR function. This function takes a string of ASCII characters and creates a sequence of Unicode characters in the national character set (which are usually installed as either 16-bit Unicode or UTF-8 character set). It uses hex escape sequences to map any non-ASCII characters in a way that's similar to Java.

To enter the sequence for a, followed by grave accent combining character, you can use UNISTR('a\0300') instead of trying to enter the character directly in the code. This function will work correctly under any character set and any database that has a Unicode-based national character set. You can put more than the combining characters inside the functions—you can include a mix of ASCII and Unicode escapes in the UNISTR function. For example, you could write:

select COMPOSE(UNISTR('Unless you are nai\0308ve, meet me at the cafe\0301 with
your re\0301sume\0301.')) from dual;

When you combine the output of this function with COMPOSE, you can generate a Unicode character without looking up each individual value. For example:

select 'it is true' if compose(unistr('a\0300')) = unistr('\00e0');

The COMPOSE function returns an NVARCHAR2 string, which is usually Unicode-based. When using these characters locally, the database will attempt to map the Unicode characters to your local character set when it does an implicit TO_CHAR on the results. Not all characters can be mapped, and there are some character combinations that don't work in COMPOSE because the Unicode consortium hasn't defined them at the level used by the Oracle database.

To run a quick check of how characters look in a particular environment, you can run a script similar to the following to see how the combined characters will be mapped on output. You may need to investigate NLS_LANG settings to make sure these characters return correctly:

create or replace type hexrange_tbl as table of varchar2(4);
show errors;

create or replace function hexrange(n1 varchar2,n2 varchar2)
    return hexrange_tbl pipelined
    for i in to_number(n1,'000X') .. to_number(n2,'000X') loop
        pipe row(to_char(i,'FM000X'));
    end loop;
end hexrange;
show errors;

select column_value composer,
    compose(unistr('a\'||column_value)) a,
    compose(unistr('c\'||column_value)) c,
    compose(unistr('e\'||column_value)) e,
    compose(unistr('i\'||column_value)) i,
    compose(unistr('n\'||column_value)) n,
    compose(unistr('o\'||column_value)) o,
    compose(unistr('r\'||column_value)) r,
    compose(unistr('s\'||column_value)) s,
    compose(unistr('u\'||column_value)) u,
    compose(unistr('y\'||column_value)) y
 from table(hexrange('0300','0327')) x;

For fun, here's a short PL/SQL script that uses COMPOSE and UNISTR to create an effect which many SMS users, hackers, and spammers around the world use to make readable English text that is difficult to scan because it uses a random sequence of accented versions of characters. I use DBMS_RANDOM to randomly select a combining character that can be used with different letters and lets SQL combine and convert back to produce ANSI/Latin-1 output. This script is hard-coded against the ENAME column of the EMP table.

set serveroutput on;
    — these combinations work under ANSI, at least
    a_comb nvarchar2(50) := unistr('\0300\0301\0302\0303\0308\030A');
    c_comb nvarchar2(50) := unistr('\0327');
    e_comb nvarchar2(50) := unistr('\0300\0301\0302\0308');
    i_comb nvarchar2(50) := unistr('\0300\0301\0308');
    n_comb nvarchar2(50) := unistr('\0303');
    o_comb nvarchar2(50) := unistr('\0300\0301\0302\0303\0308');
    u_comb nvarchar2(50) := unistr('\0300\0301\0302\0308');
    y_comb nvarchar2(50) := unistr('\0301\0308');
    l_idx integer;
    l_ename nvarchar2(50);
    ch nchar;
    l_junk varchar2(50);
    for row in (select ename from emp) loop
        l_ename := row.ename;
        l_junk := null;
        for i in 1..length(l_ename) loop
            ch := substr(l_ename,i,1);
            case lower(ch)
            when 'a' then
                l_junk := l_junk || compose(ch || substr(a_comb,
                    mod(abs(dbms_random.random),length(a_comb)) + 1,1));
            when 'c' then
                l_junk := l_junk || compose(ch || substr(c_comb,
                    mod(abs(dbms_random.random),length(c_comb)) + 1,1));
            when 'e' then
                l_junk := l_junk || compose(ch || substr(e_comb,
                    mod(abs(dbms_random.random),length(e_comb)) + 1,1));
            when 'i' then
                l_junk := l_junk || compose(ch || substr(i_comb,
                    mod(abs(dbms_random.random),length(i_comb)) + 1,1));
            when 'n' then
                l_junk := l_junk || compose(ch || substr(n_comb,
                    mod(abs(dbms_random.random),length(n_comb)) + 1,1));
            when 'o' then
                l_junk := l_junk || compose(ch || substr(o_comb,
                    mod(abs(dbms_random.random),length(o_comb)) + 1,1));
            when 'u' then
                l_junk := l_junk || compose(ch || substr(u_comb,
                    mod(abs(dbms_random.random),length(u_comb)) + 1,1));
            when 'y' then
                l_junk := l_junk || compose(ch || substr(y_comb,
                    mod(abs(dbms_random.random),length(y_comb)) + 1,1));
                l_junk := l_junk || ch;
            end case;
        end loop;
    end loop;
show errors;

Scott Stephens worked for Oracle for more than 13 years in technical support, e-commerce, marketing, and software development. For more of his Oracle tips, visit our Oracle Dev Tips Library.

Editor's Picks

Free Newsletters, In your Inbox