Japanese Full-Text Search in SQLite

Posted on Thursday, 19 May 2022 Suggest An Edit
# japanese
# sqlite
# csharp

Introduction

Recently I have been working on an android app that uses SQLite. The app has a feature where you can find an entry of a Japanese word with some details attached using a keyword that you typed in the search bar.

When building this feature, I found almost zero results on how to do it. There are some full-text search tutorials for Latin words, but I couldn’t find any for Japanese. It needs a special tokenizer called ICU from the ICU project. Since it’s not enabled by default in SQLite on my Linux machine or the one bundled with Microsoft.Data.Sqlite, I had to download the source code and compile it with some build flags to make it work. This is where the tricky part comes in. I spent a lot of time trying to figure out how to do it, but because I don’t really familiar with C, I missed things that would be obvious to someone with a fair amount of C experience.

Hopefully, by the end of this post, you’ll know how to do the exact same thing without spending hours of your life trying to figure out how to do it.

This article assumes that you’re somewhat familiar with SQLite and .NET Core so I will not go into much detail about the basics. I will also be using Linux so some parts of the compilation step will be Linux specific.

What is Full-Text Search

Full-text Search is a technique that allows you to search for a text in a database by providing only parts of them. For example, if I have a text that says “Never gonna give you up” and I type “give” in the search bar, I can find a row containing that text despite the fact that the word give is only in the middle of the text.

This is different from an exact search that usually looks like this:

SELECT text FROM rickroll WHERE text = 'give';

Using that query, you will only find a row that contains give, nothing more, nothing less.

Full-text search is similar to a wildcard search where you can use % to search for a text that contains a part of it. For example, if I type %give% in the search bar, I can find a row that contains give in it.

SELECT text FROM rickroll WHERE text LIKE '%give%';

If wildcard works then why bother with a full-text search? Well, they have some differences. Not only that wildcard is abysmally slow on a large dataset, they’re also less accurate. Wildcard only searches for a pattern while full-text-search will search based on the tokenised text. full-text search is also very fast compared to wildcard search because it already has an index of our data.

With wildcard, for example, if you have a pattern LIKE '%run%', not only that it will return any rows that have the word run, but it will also return any rows that have the word running, crunch, prune, etc. This is because wildcard search is based on the pattern, not the tokenised text. With full-text search, it will only return rows that have the word run and also running depending on the tokenizer that you use.

Why SQLite

Well, you don’t really have that much of an option on android, do you?

I found this tweet the other day…

…which inspired me to use the full-text search feature of SQLite. Initially, I was just thinking of a key-value pair database like a dictionary, but I realized that SQLite is a good fit for this because I can do so much more with it, including full-text search.

Although, the default SQLite tokenizer, which is “simple” and “porter”, doesn’t support Japanese. The one that works with CJK characters is the ICU tokenizer from the ICU Project as I mentioned earlier.

Simple Example

Setting up the project

We’re going to make a simple console app using .NET Core and SQLite to store the data. The app will have a database that contains a list of Japanese sentences and we can search through them using a prompt.

First, we need to set up the project. I will use the dotnet CLI for the entire process. We can run this simple command to get started.

dotnet new console -o ProjectName

That should give us a starting point to work with.

Disabling implicit using statements

Here’s a thing that I like to do when I start a new project. Open up the .csproj file and remove the <ImplicitUsings> line, our .csproj file should end up like this:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net6.0</TargetFramework>
    <Nullable>enable</Nullable>
  </PropertyGroup>

</Project>

We will get a warning from the compiler because our current Program.cs file doesn’t have any using statements. Let’s just ignore that for now.

The reason why I like to do this is that I don’t like the implicit using that was introduced in .NET 6. I prefer to be explicit with what I use.

Building SQLite with ICU support

To build SQLite with ICU support, we need to download the source code of SQLite and compile it with certain compilation flags. You can download the source code of SQLite from their download page. We want to extract it to a folder inside our project so it’ll be easier to find. You can name it whatever you want, but I’ll put it inside a directory and call it ./External/sqlite/.

We can then compile it using the following commands.

# configure the build
CFLAGS="-O3 -DSQLITE_ENABLE_ICU" CPPFLAGS=`icu-config --cppflags` LDFLAGS=`icu-config --ldflags` ./configure

# build the binary
make

The -DSQLITE_ENABLE_ICU is the flag that enables ICU support. You can see the full list of flags from this page. We use icu-config --ldflags to get the linker flags for ICU.

After compiling that, our directory should have the compiled binary called sqlite3. That’s not what we want for our application, though. We want to use the sqlite3 as a library, not a standalone executable. This is where it got me the first time, I didn’t know there was a directory called .libs inside the sqlite directory because it was hidden. That’s where the libraries are stored. I waste quite some time trying to make the binding in .NET works only to realise that I’m using the wrong binary. There should be an sqlite3.so file inside the .libs directory. This is what we’re going to be using for our application.

Since we put the sqlite directory inside our project, there will be a small issue. The sqlite directory contains a file called Replace.cs. This will cause the compiler to freak out because it’s trying to analyse a file called Replace.cs. We can ignore this file by adding the following block to the end of our .csproj file just before the closing tag </Project>.

<ItemGroup>
  <!-- adjust the path to suit your needs -->
  <Compile Remove="./External/sqlite/Replace.cs" />
</ItemGroup>

Of course, if you don’t put the sqlite directory inside your project, you don’t need to do this.

Preparing the database

We’re going to build a simple console app that lets us search for an example sentence containing a word that we typed in. To do that, we will need a table with three columns, Id, EnglishText, and JapaneseText. The Id column will be the primary key and will be auto-incremented. The EnglishText column will contain the English version of the text and the JapaneseText column will be the Japanese version.

In our project directory, we will create a Data directory where our database will be stored. After doing that, we will run the sqlite3 CLI to create the database. Make sure you use the sqlite3 CLI that we just compiled because the one that comes by default might not be compiled with ICU support. We can use our sqlite3 binary by typing the relative path instead of just sqlite3, in my case that would look like this:

./External/sqlite/sqlite3 ./Data/dictionary.db

It will put us in a prompt where we can create the table. I write this in multiline form so it would be easier to read, but we want to make it a single line when we’re typing it in the prompt to avoid weird things happening.

CREATE TABLE Dictionary (
  Id INTEGER PRIMARY KEY AUTOINCREMENT,
  EnglishText TEXT,
  JapaneseText TEXT
);

In order to use the full-text search feature of SQLite, we’re going to need a virtual table. It’s a different type of table, but how you create the table is quite similar.

CREATE VIRTUAL TABLE DictionaryFTS
USING fts4(
  JapaneseText,
  content='Dictionary',
  tokenize=icu ja_JP
);

The reason why I’m using FTS4 and not FTS3 or FTS5 is because FTS3 is slower and FTS5 has several improvements but it’s not available on android. For more detail on this, see the documentation for FTS3/4 and FTS5.

The tokenize parameter is the tokenizer that we’re going to use. We specify icu tokenizer and we use ja_JP as the locale because we’re working with the Japanese language.

Since we’re going to use the data from the regular Dictionary table by specifying the content column, we need to create a trigger that will update the DictionaryFTS table whenever the Dictionary table is updated. We can add a trigger using this SQL statement:

CREATE TRIGGER Dictionary_Update_FTS
AFTER INSERT ON Dictionary
BEGIN
  INSERT INTO DictionaryFTS(rowid, JapaneseText)
  VALUES (new.rowid, new.JapaneseText);
END;

This way, whenever we insert a new row in the Dictionary table, the DictionaryFTS table will be updated as well, which is exactly what we want. If you don’t add this trigger you can still query the DictionaryFTS table but you won’t be able to use the full-text search functionality.

Now, let’s add a few data to the Dictionary table. We can do that by running this statement:

INSERT INTO Dictionary (EnglishText, JapaneseText)
VALUES ("I woke up in the morning", "朝に起きた"),
       ("I was singing that song", "その歌を歌っていた"),
       ("The food was good", "食べ物はうまかった"),
       ("What time is it now?", "今何時?"),
       ("This is Japanese", "これは日本語です"),
       ("That car went too fast", "あの車が速すぎた"),
       ("It smells like spring", "春の匂いがする"),
       ("It's very cold outside", "外には寒いよ"),
       ("Today is Saturday", "今日は土曜日です"),
       ("Please tell me", "教えてくれ"),
       ("Sorry I don't understand Japanese", "日本語分からなくてごめん"),
       ("This is not an apple", "これはりんごじゃないよ"),
       ("This book is mine", "この本は私のものです"),
       ("I usually sleep around this time", "こんな時ごろに寝る"),
       ("Your Japanese is good", "日本語上手ですね"),
       ("I like cats", "私は猫が好き"),
       ("My cat is white", "私の猫が白です");
pardon my japanese :p

You can, of course, add a lot more data, but these should be enough for our purpose.

We can test if the full-text search works by typing this statement:

SELECT * FROM DictionaryFTS WHERE JapaneseText MATCH 'です';

You should see any rows that have the word です in it. In this example the result should be these rows:

これは日本語です
今日は土曜日です
この本は私のものです
日本語上手ですね

If you’re having difficulties with typing Japanese characters in SQLite prompt, you can type all of the commands in a file and then do this

# where `file.sql` is a file that contains SQLite statements
./External/sqlite/sqlite3 ./Data/dictionary.db < file.sql

# you can also do it directly by passing a string to the `sqlite3` command
./External/sqlite/sqlite3 ./Data/dictionary.db "SELECT * FROM table;"

Installing SQLite client

Since we’re going to work on a simple app, we don’t really need a full-fledged package like Entity Framework. We’re going to use Microsoft.Data.Sqlite package which is a lightweight ADO.NET provider for SQLite, but since we want to use our own version of sqlite3 that we’ve compiled earlier, we want Microsoft.Data.Sqlite.Core instead.

Make sure that you DON’T have Microsoft.Data.Sqlite installed in your project, otherwise, this wouldn’t work. The Microsoft.Data.Sqlite package bundles its own sqlite to provide a wider range of compatibility. I also spent quite some time on this step because I didn’t realise that. I missed a part of the documentation where it says that Microsoft.Data.Sqlite is a package that bundles sqlite.

The package that we’re using is built on top of SQLitePCLRaw package, which is a thin wrapper around sqlite. According to the documentation, we will need other packages to be installed as well if we want to use a custom version of sqlite.

dotnet add package Microsoft.Data.Sqlite.Core
dotnet add package SQLitePCLRaw.core
dotnet add package SQLitePCLRaw.provider.dynamic_cdecl

We should be done with the preparation.

Using a custom SQlite versionReplace.cs

Before using any of the Microsoft.Data.Sqlite functionality, we need to register our custom sqlite version. First, we need to add a native library adapter class that implements IGetFunctionPointer from SQLitePCL namespace. This class will be used to get the pointer to the native sqlite library. I will put this inside a file called NativeLibraryAdapter.cs at the root of our project.

using System;
using System.Runtime.InteropServices;
using SQLitePCL;

class NativeLibraryAdapter : IGetFunctionPointer
{
    readonly IntPtr _library;

    public NativeLibraryAdapter(string name)
        => _library = NativeLibrary.Load(name);

    public IntPtr GetFunctionPointer(string name)
        => NativeLibrary.TryGetExport(_library, name, out var address)
            ? address
            : IntPtr.Zero;
}

Next, we want to register our custom sqlite version. We can do that by adding a few lines of code to our Program.cs file.

using System.IO;
using SQLitePCL;

// adjust this path to where your sqlite3.so is stored
string sqliteLibPath = Path.GetFullPath("./External/sqlite/.libs/libsqlite3.so");

SQLite3Provider_dynamic_cdecl.Setup("sqlite3", new NativeLibraryAdapter(sqliteLibPath));
SQLitePCL.raw.SetProvider(new SQLite3Provider_dynamic_cdecl());

Make sure that we do this right at the beginning before doing anything with the database connection.

Connecting to the database

We should try to connect to the database. To do that, we can instantiate the SqliteConnection class and provide it with the path to the database. We can add this code to our Program.cs file. Bare in mind that you should put all using statements at the top of the file.

using Microsoft.Data.Sqlite;

string databasePath = Path.GetFullPath("./Data/dictionary.db");
using var connection = new SqliteConnection($"Data Source={databasePath}");

// open the database connection
connection.Open();

We would want to use the keyword using to make sure that the connection is disposed when we’re done with it.

Main logic of the app

What I want to build for this simple example is a simple search console app where the user is going to be given a prompt that ask for a word and then the app will search the database for the word and display the results. I won’t be going into the details of how to do this because the main focus is the full-text search functionality.

// close the database connection when we quit the app using Ctrl+C
Console.CancelKeyPress += delegate
{
    connection.Close();
    Environment.Exit(0);
};

// ask the user for a word to look up until they quit the app
while (true)
{
    Console.Write("== Enter a word (ctrl+c to cancel): ");
    var word = Console.ReadLine();

    // ask again if the word is empty
    if (string.IsNullOrEmpty(word))
    {
        Console.WriteLine("Word can't be empty!");
        continue;
    }

    // create a command to search the database and bind the user input
    // to the `@word` parameter
    var command = connection.CreateCommand();
    command.CommandText = [email protected]"
    SELECT
      Dictionary.EnglishText,
      Dictionary.JapaneseText
    FROM Dictionary
    WHERE Dictionary.JapaneseText IN (
        SELECT DictionaryFTS.JapaneseText FROM DictionaryFTS
        WHERE DictionaryFTS MATCH @word
    );";
    command.Parameters.Add(new SqliteParameter("@word", word));

    using var reader = command.ExecuteReader();

    // prints a message when no result was found
    if (!reader.HasRows)
    {
        Console.WriteLine("No result was found!");
        continue;
    }

    // print the results in a list
    while (reader.Read())
    {
        Console.WriteLine($"- {reader["EnglishText"]} => {reader["JapaneseText"]}");
    }
}

When you run the app by typing dotnet run, you should see something like this:

As you can see, when I search for です, I get every rows that has です in the JapaneseText field.

Highlighting the matched word

We can highlight the matched word in the result by using the snippet function. To do that, we need to change our query a bit.

SELECT
    Dictionary.EnglishText,
    JapaneseTextSnippet
FROM Dictionary
JOIN (
    SELECT
        snippet(DictionaryFTS, '[', ']') AS JapaneseTextSnippet,
        rowid
    FROM DictionaryFTS
    WHERE DictionaryFTS MATCH @word
) AS ResultFTS
    ON Dictionary.Id = ResultFTS.rowid;

I’m not sure about the performance of this query compared to using WHERE IN. If you have any suggestions, please let me know :)

This query will now select the JapaneseTextSnippet column that comes from the snippet function. The matched word should be surrounded by [ and ]. If you’re using the result for the web, you can wrap it with an html tag instead of square brackets, but since we’re making a console app, we don’t have the luxury of that and have to rely on ANSI escape sequence. You can see this gist for the ANSI escape sequence reference.

The reason why I don’t put the escape sequence inside the snippet function is because I had weird things happening when I tried to do that. So, I decided to wrap it in a bracket and then replace the bracket with the escape sequence using C# like this.

// print the results in a list
while (reader.Read())
{
  var englishText = reader.GetString(0);
  var japaneseTextSnippet = reader.GetString(1).Replace("[", "").Replace("]", "");
  Console.WriteLine($"- {englishText} => {japaneseTextSnippet}");
}

The end result should now look like this:

result with highlight

References

Here are some useful resources that helped me that might also be useful to you:

If you want the full code, you can find the repository at elianiva/foo-dictionary

Conclusion

Doing a full-text search is actually quite easy. Initially, I thought full-text search requires a fancy technology like Elasticsearch, Algolia, or something like that, but I found that I can do it with SQLite. Most of the pain I had to go through came from trying to make the SQLite binding work.

I hope this article helped you to implement full-text search for the Japanese language in your SQLite database. If you have any questions or suggestions, feel free to leave a comment :)

Have a nice day!

Comments