Little Librarian

Semantic search for Calibre libraries using vector embeddings.

Inner Workings

Extract text content from EPUB and PDF files
Chunk text into manageable pieces (500 words by default)
Embed each chunk using your ONNX model
Store embeddings in PostgreSQL with pgvector
Search by converting queries to embeddings and finding similar chunks
Rerank results using a cross-encoder model

Shortcomings

This is an experiment and as such has some quite rough edges, including but not limited to:

Text chunking is naive, no idea if it even works assably on non-latin scripts
PDF text extraction is hit-and-miss
CPU only (this is also a strength, but at least the possibility of using GPU would be nice)

Requirements

All dependencies for running the application are specified in the flake.nix file within the project folder (look for the key buildInputs, though rustup might just work with the provided toolchain file). My recommendation is to use the provided nix flake directly, as it sets up everything automatically.

Additional dependencies:

PostgreSQL with pgvector extension
ONNX embedding model (e.g., all-MiniLM-L6-v2)
Tokenizer file (JSON format, from the embedding model)
ONNX reranking model (e.g. bge-reranker-base)

Usage

Use cargo run -- <ARGS> to run directly from source code and specify needed arguments in place of <ARGS>.

Migrations

Either use sqlx-cli with sqlx migrate run or apply the migrations manually.

General

Usage: little-librarian [OPTIONS] <COMMAND>

Commands:
  ask   Configuration for the ask command
  scan  Configuration for the scan command
  help  Print this message or the help of the given subcommand(s)

Options:
  -d, --db-url <DB_URL>
          Where to read the database connection string from ('-' is stdin) [default: -]
  -t, --tokenizer-path <TOKENIZER_PATH>
          Path to content tokenizer [default: ./tokenizer.json]
  -m, --model-path <MODEL_PATH>
          Path to ONNX model used for embeddings.. [default: ./model.onnx]
  -c, --chunk-size <CHUNK_SIZE>
          Maximum text chunk size when splitting up content. Documents are not updated if chunk size changes [default: 500]
  -h, --help
          Print help
  -V, --version
          Print version

Scan a Calibre library

This can take a huge amount of time, depending on the size of the library. It will use up all available CPU cores.

Usage: little-librarian scan --library-path <LIBRARY_PATH>

Options:
  -l, --library-path <LIBRARY_PATH>  Root directory of calibre library
  -h, --help                         Print help
  -V, --version                      Print version

little-librarian scan --library-path /path/to/calibre/library

Search for books

Usage: little-librarian ask [OPTIONS] --query <QUERY>

Options:
  -q, --query <QUERY>                    Query to run
  -l, --limit <LIMIT>                    Maximum number of results [default: 5]
  -r, --reranking-path <RERANKING_PATH>  Path to reranking model [default: ./reranking.onnx]
  -h, --help                             Print help
  -V, --version                          Print version

little-librarian ask --query "books about artificial intelligence ethics"

3.6 KiB Raw Permalink Blame History