Developer

Oracle Tip: Solving directed graph problems with SQL, part 2

Scott Stephens explains how to solve a state machine problem with SQL Server.

Part one discussed using SQL to solve directed graph problems. Another kind of directed graph is a state machine. In a state machine, each node is called a state and represents the current status of a system, object, or variable. There are vectors between each state that represent a change in state caused by an external event.

For example, Windows event-based architecture or JavaScript events are designed around the idea that a GUI contains a set of objects that respond to various user-input or system-generated events.

Even a programming language compiler is a state machine. A compiler recognizes tokens by shifting between different states until a token is "recognized." A popular utility, lex, was often used to take a set of tokens, or reserved words, and convert it and optimize it into a state machine in C. You can duplicate this same process in SQL, with the advantage that the generated code can be in any language.

First, create a table of tokens:

create table tokens
(
    id$         integer constraint tokens_pk primary key,
    token       varchar2(20)
);

A state machine needs to store a set of states and a set of transitions between states. For this example, a single input character will represent each transition:

create table states
(
    id$         integer constraint states_pk primary key,
    accept#     integer,
    code        varchar2(20),
    constraint  states_accept_fk foreign key (accept#) references tokens(id$),
    constraint  states_accept_uq unique (accept#)
);

create table trans
(
    state$#     integer not null,
    input$      char not null,
    next#       integer,
    constraint  trans_pk primary key (state$#,input$),
    constraint  trans_state_fk foreign key (state$#) references states(id$),
    constraint  trans_next_fk foreign key (next#) references states(id$)
);

The problem is to build up these tables of states and transitions based on the set of tokens. As in the previous directed graph problem, we'll start with some initial node (in this case, the initial state) and try to find all the routes between different states. Each route is the set of characters recognized so far. For example:

state 0: initial state, empty string
state 1: 'A' state
state 2: 'AA' state
state 3: 'AB' state
state 4: 'ABC' state

You can populate the transition table by associating the letter (or regular expression in a more realistic program) with the state we're currently in and the state at which we'll arrive. Let's populate and optimize the table with this code:

Declare
    l_state pls_integer := 0;
    l_len pls_integer := 1;
begin
    — populate states
    insert into states (id$) values (l_state);
    loop
        insert into states(id$,code)
            select l_state+rownum,code
              from
              (
                  select distinct substr(token,1,l_len) code
                    from tokens
                   where length(token) >= l_len
              );
        exit when sql%rowcount = 0;
        l_state := l_state + sql%rowcount;
        l_len := l_len + 1;
    end loop;
    — populate transitions
    insert into trans (state$#,input$,next#)
        select s1.id$,substr(s2.code,-1,1),s2.id$
          from states s1,states s2
         where (s1.code = substr(s2.code,1,length(s2.code)-1))
            or (s1.id$ = 0 and length(s2.code) = 1);
    — map accept states to tokens
    update states
       set accept# = (select id$ from tokens where token = states.code);
end;
/
show errors;

The first statement inserts the initial state. Then, for cutting up the tokens into different lengths, we can enumerate all the states. Next, we populate the transition table with the character required to get from one state to the next. Finally, we attach the token id to the final state to indicate that this state represents a complete token. For example, if we add three tokens, where 1 = AAA, 2 = ABC, 3 = ASC, then running the above code will result in data, which you can see in Table A.

You can use this table to generate code in many languages that process characters from strings efficiently. The only code that gets used in a recognition program is the "accept" column for states, and the input and next state columns from the transition table. For example, in C:

#include <stdio.h>

typedef struct state state;
typedef struct trans trans;

struct state
{
    int accept;               /* token value or 0 */
    int offset;               /* offset into transition table */
    int count;
};
struct trans
{
    char input;               /* input character */
    int    next;              /* next state */
};

state statetbl[8] =
{
    {0,0,1},                  /* 0: initial state */
    {0,1,3},                  /* 1: 'A' state */
    {0,4,1},                  /* 2: 'AA' state */
    {0,5,1},                  /* 3: 'AB' state */
    {0,6,1},                  /* 4: 'AS' state */
    {1,7,0},                  /* 5: 'AAA' state */
    {2,7,0},                  /* 6: 'ABC' state */
    {3,7,0},                  /* 7: 'ASC' state */
};
trans trantbl[7] =
{
   {'A',1},                   /* 0: initial -> 'A' */
   {'A',2},                   /* 1: 'A' -> 'AA' */
   {'B',3},                   /* 2: 'A' -> 'AB' */
   {'S',4},                   /* 3: 'A' -> 'AS' */
   {'A',5},                   /* 4: 'AA' -> 'AAA' */
   {'C',6},                   /* 5: 'AB' -> 'ABC' */
   {'C',7},                   /* 6: 'AS' -> 'ASC' */
};

/* given a current state and input character, return the next state */
int move(int state,char input)
{
    int i;
    for (i = 0;i<statetbl[state].count;i++)
    {
        if (trantbl[statetbl[state].offset+i].input == input)
           return trantbl[statetbl[state].offset+i].next;
    }
    return -1;      /* unknown transition */
}

int main(int argc,char* argv[])
{
    int i,next,state;
    if (argc < 2)
    {
        fprintf(stderr,"Usage: %s \"tokens...\"\n",argv[0]);
        return 1;
    }

    /* parse the first argument */
    state = 0;
    for (i=0;argv[1][i];i++)
    {
        next = move(state,argv[1][i]);
        if (next == -1) state = 0;
        else
        {
            state = next;
            if (statetbl[state].accept)
                printf("recognized token#%d\n",statetbl[state].accept);
        }
    }
    return 0;
}

Then, you can run this program and get tokens recognized on the command line:

$ cc -o token token.c

$ token  "ABC ASC FEF ASC"
recognized token#2
recognized token#3
recognized token#3

TechRepublic's Oracle newsletter covers automating Oracle utilities, generating database alerts, solving directed graph problems, and more. Automatically subscribe today!

Editor's Picks

Free Newsletters, In your Inbox