# Little Librarian Semantic search for [Calibre](https://calibre-ebook.com/) libraries using vector embeddings. ## Inner Workings 1. **Extract** text content from EPUB and PDF files 2. **Chunk** text into manageable pieces (500 words by default) 3. **Embed** each chunk using your ONNX model 4. **Store** embeddings in PostgreSQL with pgvector 5. **Search** by converting queries to embeddings and finding similar chunks 6. **Rerank** results using a cross-encoder model ## Shortcomings This is an experiment and as such has some quite rough edges, including but not limited to: - Text chunking is naive, no idea if it even works assably on non-latin scripts - PDF text extraction is hit-and-miss - CPU only (this is also a strength, but at least the possibility of using GPU would be nice) ## Requirements All dependencies for running the application are specified in the [flake.nix](./flake.nix) file within the project folder (look for the key `buildInputs`, though [rustup](https://rustup.rs/) might just work with the provided toolchain file). My recommendation is to use the provided [nix flake](https://nixos.wiki/wiki/Flakes) directly, as it sets up everything automatically. Additional dependencies: - [PostgreSQL](https://www.postgresql.org/) with [pgvector extension](https://github.com/pgvector/pgvector) - ONNX embedding model (e.g., [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) - Tokenizer file (JSON format, from the embedding model) - ONNX reranking model (e.g. [bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base)) ## Usage Use `cargo run -- ` to run directly from source code and specify needed arguments in place of ``. ### Migrations Either use [sqlx-cli](https://crates.io/crates/sqlx-cli) with `sqlx migrate run` or apply the [migrations](./migrations/) manually. ### General ``` Usage: little-librarian [OPTIONS] Commands: ask Configuration for the ask command scan Configuration for the scan command help Print this message or the help of the given subcommand(s) Options: -d, --db-url Where to read the database connection string from ('-' is stdin) [default: -] -t, --tokenizer-path Path to content tokenizer [default: ./tokenizer.json] -m, --model-path Path to ONNX model used for embeddings.. [default: ./model.onnx] -c, --chunk-size Maximum text chunk size when splitting up content. Documents are not updated if chunk size changes [default: 500] -h, --help Print help -V, --version Print version ``` ### Scan a Calibre library This can take a huge amount of time, depending on the size of the library. It will use up all available CPU cores. ``` Usage: little-librarian scan --library-path Options: -l, --library-path Root directory of calibre library -h, --help Print help -V, --version Print version ``` ``` little-librarian scan --library-path /path/to/calibre/library ``` ### Search for books ``` Usage: little-librarian ask [OPTIONS] --query Options: -q, --query Query to run -l, --limit Maximum number of results [default: 5] -r, --reranking-path Path to reranking model [default: ./reranking.onnx] -h, --help Print help -V, --version Print version ``` ``` little-librarian ask --query "books about artificial intelligence ethics" ```