2013年8月31日 星期六

SQLite Full-Text Searches 與 Custom Tokenizers 筆記 (FTS3/FTS4)

雖然敝公司的強項就是做 search engines 跟資料儲存管理,但研究其他家的 search 機制也是很基本的功夫。SQLite 小歸小,卻也五臟俱全,應該稱得上市面用戶最廣的資料庫系統吧?光 browser 或 mobile phone 就打趴一大片。

SQLITE Full-text Search 官方介紹網址: SQLite FTS3 and FTS4 Extensions

簡言之 FTS4 比 FTS3 更有效率,但也更吃資料空間。要留意的版本資狀態:
FTS4 is an enhancement to FTS3. FTS3 has been available since SQLite version 3.5.0 in 2007-09-04. The enhancements for FTS4 were added with SQLite version 3.7.4 on 2010-12-08.
至於 FTS3/FTS4 使用時機:
FTS4 is sometimes significantly faster than FTS3, even orders of magnitude faster depending on the query, though in the common case the performance of the two modules is similar. FTS4 also offers the enhanced matchinfo() outputs which can be useful in ranking the results of a MATCH operation. On the other hand, in the absence of a matchinfo=fts3 directive FTS4 requires a little more disk space than FTS3, though only a percent of two in most cases.
建立 FTS3/FTS4 Tables:

-- Create an FTS table named "pages" with three columns:
CREATE VIRTUAL TABLE pages USING fts4(title, keywords, body);

-- Create an FTS table named "mail" with two columns. Datatypes
-- and column constraints are specified along with each column. These
-- are completely ignored by FTS and SQLite.
CREATE VIRTUAL TABLE mail USING fts3(
  subject VARCHAR(256) NOT NULL,
  body TEXT CHECK(length(body)<10240)
);


至於 FTS Queries 範例:

-- The examples in this block assume the following FTS table:
CREATE VIRTUAL TABLE mail USING fts3(subject, body);

SELECT * FROM mail WHERE rowid = 15;                -- Fast. Rowid lookup.
SELECT * FROM mail WHERE body MATCH 'sqlite';       -- Fast. Full-text query.
SELECT * FROM mail WHERE mail MATCH 'search';       -- Fast. Full-text query.
SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20;   -- Slow. Linear scan.
SELECT * FROM mail WHERE subject = 'database';      -- Slow. Linear scan.
SELECT * FROM mail WHERE subject MATCH 'database';  -- Fast. Full-text query.


以及這段:

-- Example schema
CREATE VIRTUAL TABLE mail USING fts3(subject, body);

-- Example table population
INSERT INTO mail(docid, subject, body) VALUES(1, 'software feedback', 'found it too slow');
INSERT INTO mail(docid, subject, body) VALUES(2, 'software feedback', 'no feedback');
INSERT INTO mail(docid, subject, body) VALUES(3, 'slow lunch order',  'was a software problem');

-- Example queries
SELECT * FROM mail WHERE subject MATCH 'software';    -- Selects rows 1 and 2
SELECT * FROM mail WHERE body    MATCH 'feedback';    -- Selects row 2
SELECT * FROM mail WHERE mail    MATCH 'software';    -- Selects rows 1, 2 and 3
SELECT * FROM mail WHERE mail    MATCH 'slow';        -- Selects rows 1 and 3


接著,提提要使用 FTS3 / FTS4 的用法,須要在 compile sqlite3 加上設定才行,預設沒有啟用,而啟用 FTS3 也會啟用 FTS4:

-DSQLITE_ENABLE_FTS3
-DSQLITE_ENABLE_FTS3_PARENTHESIS




CPPFLAGS="-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS" ./configure <configure options>

此外,切 token 預設可選的機制有 "simple", "porter" (Porter Stemming algorithm), "icu" (需要SQLITE_ENABLE_ICU, 要有C版ICU Library), 和 "unicode61" (SQLite version 3.7.13之後支援)。

-- Create an FTS table named "papers" with two columns that uses
-- the tokenizer "porter".
CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

-- Create an FTS table with a single column - "content" - that uses
-- the "simple" tokenizer.
CREATE VIRTUAL TABLE data USING fts4(tokenize=simple);

-- Create an FTS table with two columns that uses the "icu" tokenizer.
-- The qualifier "en_AU" is passed to the tokenizer implementation
CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);


其中 simple 就是會把資料都先轉成小寫,查詢時也會將 keyword 轉小寫;porter 則是支援相關字詞的搜尋,例如資料內是存 "frustrated" 字,但用 "Frustrated" 和 "Frustration” 都可以查到;icu library 全名則為 "International Components for Unicode" library;Unicode61 則跟 simple 很像,但支援 unicode,如 unicode 空白與鏢體符號(切 token 用的 delimiter)等,而 simple 只認得 ASCII 部分。

接著,進入正題 SQLite FTS 這迷人的架構就在於可自定 tokenizers 啦,也是這篇筆記的重點,Custom (User Implemented) Tokenizers,不過網站上只有片段程式碼,問了同事才知道接下來又要翻 code 才行 :P 於是建議我挑 simple_tokenizer 來看,在 sqlite3.c 中,可以搜尋 tokenizers 的描述:
The fts3 built-in tokenizers - "simple", "porter" and "icu"- are implemented in files fts3_tokenizer1.c, fts3_porter.c and fts3_icu.c respectively. The following three forward declarations are for functions declared in these files used to retrieve the respective implementations.
Calling sqlite3Fts3SimpleTokenizerModule() sets the value pointed to by the argument to point to the "simple" tokenizer implementation.
And so on.
故搜尋 fts3_tokenizer1.c 可得知以下的定義:

/*
** 2006 Oct 10
**
** The author disclaims copyright to this source code.  In place of
** a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
******************************************************************************
**
** Implementation of the "simple" full-text-search tokenizer.
*/

typedef struct simple_tokenizer {
  sqlite3_tokenizer base;
  char delim[128];             /* flag ASCII delimiters */
} simple_tokenizer;

typedef struct simple_tokenizer_cursor {
  sqlite3_tokenizer_cursor base;
  const char *pInput;          /* input we are tokenizing */
  int nBytes;                  /* size of the input */
  int iOffset;                 /* current position in pInput */
  int iToken;                  /* index of next token to be returned */
  char *pToken;                /* storage for current token */
  int nTokenAllocated;         /* space allocated to zToken buffer */
} simple_tokenizer_cursor;

static int simpleDelim(simple_tokenizer *t, unsigned char c){
  return c<0x80 && t->delim[c];
}
static int fts3_isalnum(int x){
  return (x>='0' && x<='9') || (x>='A' && x<='Z') || (x>='a' && x<='z');
}

/*
** Create a new tokenizer instance.
*/
static int simpleCreate(
  int argc, const char * const *argv,
  sqlite3_tokenizer **ppTokenizer
){
// ...
}

/*
** Destroy a tokenizer
*/
static int simpleDestroy(sqlite3_tokenizer *pTokenizer){
  sqlite3_free(pTokenizer);
  return SQLITE_OK;
}

/*
** Prepare to begin tokenizing a particular string.  The input
** string to be tokenized is pInput[0..nBytes-1].  A cursor
** used to incrementally tokenize this string is returned in
** *ppCursor.
*/
static int simpleOpen(
  sqlite3_tokenizer *pTokenizer,         /* The tokenizer */
  const char *pInput, int nBytes,        /* String to be tokenized */
  sqlite3_tokenizer_cursor **ppCursor    /* OUT: Tokenization cursor */
){
// ...
}

/*
** Close a tokenization cursor previously opened by a call to
** simpleOpen() above.
*/
static int simpleClose(sqlite3_tokenizer_cursor *pCursor){
  simple_tokenizer_cursor *c = (simple_tokenizer_cursor *) pCursor;
  sqlite3_free(c->pToken);
  sqlite3_free(c);
  return SQLITE_OK;
}

/*
** Extract the next token from a tokenization cursor.  The cursor must
** have been opened by a prior call to simpleOpen().
*/
static int simpleNext(
  sqlite3_tokenizer_cursor *pCursor,  /* Cursor returned by simpleOpen */
  const char **ppToken,               /* OUT: *ppToken is the token text */
  int *pnBytes,                       /* OUT: Number of bytes in token */
  int *piStartOffset,                 /* OUT: Starting offset of token */
  int *piEndOffset,                   /* OUT: Ending offset of token */
  int *piPosition                     /* OUT: Position integer of token */
){
// ...
}

/*
** The set of routines that implement the simple tokenizer
*/
static const sqlite3_tokenizer_module simpleTokenizerModule = {
  0,
  simpleCreate,
  simpleDestroy,
  simpleOpen,
  simpleClose,
  simpleNext,
  0,
};

/*
** Allocate a new simple tokenizer.  Return a pointer to the new
** tokenizer in *ppModule
*/
SQLITE_PRIVATE void sqlite3Fts3SimpleTokenizerModule(
  sqlite3_tokenizer_module const**ppModule
){
  *ppModule = &simpleTokenizerModule;
}


從底部反過來往上看,至少要實作 Create, Destroy, Open, Close 和 Next,除了初始化跟資源釋放外,最重要的就是 Next 這個部分,簡單的說,就是把一串 byes 交給 Next,而 Next 每次被呼叫時,就從指定的位置往後找出一個 token 出來回傳,以此完成架構。

來做一個簡單的 changyy tokenizer 好了,只會把文章內有 changyy 開頭的 token 都取出來,其餘都 skip 掉:

/*
** Extract the next token from a tokenization cursor.  The cursor must
** have been opened by a prior call to simpleOpen().
*/
static int changyyNext(
  sqlite3_tokenizer_cursor *pCursor,  /* Cursor returned by simpleOpen */
  const char **ppToken,               /* OUT: *ppToken is the token text */
  int *pnBytes,                       /* OUT: Number of bytes in token */
  int *piStartOffset,                 /* OUT: Starting offset of token */
  int *piEndOffset,                   /* OUT: Ending offset of token */
  int *piPosition                     /* OUT: Position integer of token */
){
  changyy_tokenizer_cursor *c = (changyy_tokenizer_cursor *) pCursor;
  changyy_tokenizer *t = (changyy_tokenizer *) pCursor->pTokenizer;
  unsigned char *p = (unsigned char *)c->pInput;

  while( c->iOffset<c->nBytes ){
    int iStartOffset;

    /* Scan past delimiter characters */
    while( c->iOffset<c->nBytes && simpleDelim(t, p[c->iOffset]) ){
      c->iOffset++;
    }

    /* Count non-delimiter characters. */
    iStartOffset = c->iOffset;
    while( c->iOffset<c->nBytes && !simpleDelim(t, p[c->iOffset]) ){
      c->iOffset++;
    }

    if( c->iOffset>iStartOffset ){
      int i, n = c->iOffset-iStartOffset;
      if( n>c->nTokenAllocated ){
        char *pNew;
        c->nTokenAllocated = n+20;
        pNew = sqlite3_realloc(c->pToken, c->nTokenAllocated);
        if( !pNew ) return SQLITE_NOMEM;
        c->pToken = pNew;
      }
      for(i=0; i<n; i++){
        /* TODO(shess) This needs expansion to handle UTF-8
        ** case-insensitivity.
        */
        unsigned char ch = p[iStartOffset+i];
        c->pToken[i] = (char)((ch>='A' && ch<='Z') ? ch-'A'+'a' : ch);
      }
      *ppToken = c->pToken;
      *pnBytes = n;
      *piStartOffset = iStartOffset;
      *piEndOffset = c->iOffset;
      *piPosition = c->iToken++;
      if( !strncmp(c->pToken,"changyy", n > 7 ? 7 : n ) )
      return SQLITE_OK;
    }
  }
  return SQLITE_DONE;
}


上述程式取於 simple_tokenizer 架構,只在做後的 token 加上判斷而已 :) 如此一來,假設測資為 "changyy changes changyys world",那搜尋 changyy 跟 changyys 都可以找到,但搜尋 changes 和 world 都會找不到(因為不符合 token 規定 :P)

$ ./sqlite3
sqlite> .load /path/changyy.so
sqlite> CREATE VIRTUAL TABLE data USING fts3(tokenize=changyy);
sqlite> INSERT INTO data ('content') VALUES ('changyy changes changyys world');
sqlite> SELECT SNIPPET(data) FROM data WHERE data MATCH 'changyy';
<b>changyy</b> changes changyys world
sqlite> SELECT SNIPPET(data) FROM data WHERE data MATCH 'changes';
sqlite> SELECT SNIPPET(data) FROM data WHERE data MATCH 'changyys';
changyy changes <b>changyys</b> world
sqlite> SELECT SNIPPET(data) FROM data WHERE data MATCH 'world';
sqlite>


可以看到上述只有當 token 有被索引過的才能找出來,如此一來,就算完成偷懶的 custom tokenizer 學習吧 XD 更多操作請參考 https://github.com/changyy/sqlite-study 

沒有留言:

張貼留言