No description
Find a file
2025-07-01 21:57:52 +02:00
migrations WIP job queue 2025-07-01 21:10:35 +02:00
src reorganize all imports 2025-07-01 21:57:52 +02:00
.envrc initial commit 2025-07-01 08:55:08 +02:00
.gitignore initial commit 2025-07-01 08:55:08 +02:00
build.rs add code download route 2025-07-01 14:38:09 +02:00
Cargo.lock WIP job queue 2025-07-01 21:10:35 +02:00
Cargo.toml WIP job queue 2025-07-01 21:10:35 +02:00
COPYING initial commit 2025-07-01 08:55:08 +02:00
deny.toml initial commit 2025-07-01 08:55:08 +02:00
flake.lock initial commit 2025-07-01 08:55:08 +02:00
flake.nix initial commit 2025-07-01 08:55:08 +02:00
README.md initial commit 2025-07-01 08:55:08 +02:00
rust-toolchain.toml initial commit 2025-07-01 08:55:08 +02:00

Little Librarian

Semantic search for Calibre libraries using vector embeddings.

Inner Workings

  1. Extract text content from EPUB and PDF files
  2. Chunk text into manageable pieces (500 words by default)
  3. Embed each chunk using your ONNX model
  4. Store embeddings in PostgreSQL with pgvector
  5. Search by converting queries to embeddings and finding similar chunks
  6. Rerank results using a cross-encoder model

Shortcomings

This is an experiment and as such has some quite rough edges, including but not limited to:

  • Text chunking is naive, no idea if it even works assably on non-latin scripts
  • PDF text extraction is hit-and-miss
  • CPU only (this is also a strength, but at least the possibility of using GPU would be nice)

Requirements

All dependencies for running the application are specified in the flake.nix file within the project folder (look for the key buildInputs, though rustup might just work with the provided toolchain file). My recommendation is to use the provided nix flake directly, as it sets up everything automatically.

Additional dependencies:

Usage

Use cargo run -- <ARGS> to run directly from source code and specify needed arguments in place of <ARGS>.

Migrations

Either use sqlx-cli with sqlx migrate run or apply the migrations manually.

General

Usage: little-librarian [OPTIONS] <COMMAND>

Commands:
  ask   Configuration for the ask command
  scan  Configuration for the scan command
  help  Print this message or the help of the given subcommand(s)

Options:
  -d, --db-url <DB_URL>
          Where to read the database connection string from ('-' is stdin) [default: -]
  -t, --tokenizer-path <TOKENIZER_PATH>
          Path to content tokenizer [default: ./tokenizer.json]
  -m, --model-path <MODEL_PATH>
          Path to ONNX model used for embeddings.. [default: ./model.onnx]
  -c, --chunk-size <CHUNK_SIZE>
          Maximum text chunk size when splitting up content. Documents are not updated if chunk size changes [default: 500]
  -h, --help
          Print help
  -V, --version
          Print version

Scan a Calibre library

This can take a huge amount of time, depending on the size of the library. It will use up all available CPU cores.

Usage: little-librarian scan --library-path <LIBRARY_PATH>

Options:
  -l, --library-path <LIBRARY_PATH>  Root directory of calibre library
  -h, --help                         Print help
  -V, --version                      Print version
little-librarian scan --library-path /path/to/calibre/library

Search for books

Usage: little-librarian ask [OPTIONS] --query <QUERY>

Options:
  -q, --query <QUERY>                    Query to run
  -l, --limit <LIMIT>                    Maximum number of results [default: 5]
  -r, --reranking-path <RERANKING_PATH>  Path to reranking model [default: ./reranking.onnx]
  -h, --help                             Print help
  -V, --version                          Print version
little-librarian ask --query "books about artificial intelligence ethics"