3.6 KiB
Little Librarian
Semantic search for Calibre libraries using vector embeddings.
Inner Workings
- Extract text content from EPUB and PDF files
- Chunk text into manageable pieces (500 words by default)
- Embed each chunk using your ONNX model
- Store embeddings in PostgreSQL with pgvector
- Search by converting queries to embeddings and finding similar chunks
- Rerank results using a cross-encoder model
Shortcomings
This is an experiment and as such has some quite rough edges, including but not limited to:
- Text chunking is naive, no idea if it even works assably on non-latin scripts
- PDF text extraction is hit-and-miss
- CPU only (this is also a strength, but at least the possibility of using GPU would be nice)
Requirements
All dependencies for running the application are specified in the
flake.nix file within the project folder (look for the key
buildInputs
, though rustup might just work with the
provided toolchain file). My recommendation is to use the provided
nix flake directly, as it sets up everything
automatically.
Additional dependencies:
- PostgreSQL with pgvector extension
- ONNX embedding model (e.g., all-MiniLM-L6-v2)
- Tokenizer file (JSON format, from the embedding model)
- ONNX reranking model (e.g. bge-reranker-base)
Usage
Use cargo run -- <ARGS>
to run directly from source code and specify needed
arguments in place of <ARGS>
.
Migrations
Either use sqlx-cli with sqlx migrate run
or apply the migrations manually.
General
Usage: little-librarian [OPTIONS] <COMMAND>
Commands:
ask Configuration for the ask command
scan Configuration for the scan command
help Print this message or the help of the given subcommand(s)
Options:
-d, --db-url <DB_URL>
Where to read the database connection string from ('-' is stdin) [default: -]
-t, --tokenizer-path <TOKENIZER_PATH>
Path to content tokenizer [default: ./tokenizer.json]
-m, --model-path <MODEL_PATH>
Path to ONNX model used for embeddings.. [default: ./model.onnx]
-c, --chunk-size <CHUNK_SIZE>
Maximum text chunk size when splitting up content. Documents are not updated if chunk size changes [default: 500]
-h, --help
Print help
-V, --version
Print version
Scan a Calibre library
This can take a huge amount of time, depending on the size of the library. It will use up all available CPU cores.
Usage: little-librarian scan --library-path <LIBRARY_PATH>
Options:
-l, --library-path <LIBRARY_PATH> Root directory of calibre library
-h, --help Print help
-V, --version Print version
little-librarian scan --library-path /path/to/calibre/library
Search for books
Usage: little-librarian ask [OPTIONS] --query <QUERY>
Options:
-q, --query <QUERY> Query to run
-l, --limit <LIMIT> Maximum number of results [default: 5]
-r, --reranking-path <RERANKING_PATH> Path to reranking model [default: ./reranking.onnx]
-h, --help Print help
-V, --version Print version
little-librarian ask --query "books about artificial intelligence ethics"