little-librarian/README.md

# Little Librarian

Semantic search for [Calibre](https://calibre-ebook.com/) libraries using vector
embeddings.

## Inner Workings

1. **Extract** text content from EPUB and PDF files
2. **Chunk** text into manageable pieces (500 words by default)
3. **Embed** each chunk using your ONNX model
4. **Store** embeddings in PostgreSQL with pgvector
5. **Search** by converting queries to embeddings and finding similar chunks
6. **Rerank** results using a cross-encoder model

## Shortcomings

This is an experiment and as such has some quite rough edges, including but not
limited to:

- Text chunking is naive, no idea if it even works assably on non-latin scripts
- PDF text extraction is hit-and-miss
- CPU only (this is also a strength, but at least the possibility of using GPU
  would be nice)

## Requirements

All dependencies for running the application are specified in the
[flake.nix](./flake.nix) file within the project folder (look for the key
`buildInputs`, though [rustup](https://rustup.rs/) might just work with the
provided toolchain file). My recommendation is to use the provided
[nix flake](https://nixos.wiki/wiki/Flakes) directly, as it sets up everything
automatically.

Additional dependencies:

- [PostgreSQL](https://www.postgresql.org/) with
  [pgvector extension](https://github.com/pgvector/pgvector)
- ONNX embedding model (e.g.,
  [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2))
- Tokenizer file (JSON format, from the embedding model)
- ONNX reranking model (e.g.
  [bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base))

## Usage

Use `cargo run -- <ARGS>` to run directly from source code and specify needed
arguments in place of `<ARGS>`.

### Migrations

Either use [sqlx-cli](https://crates.io/crates/sqlx-cli) with `sqlx migrate run`
or apply the [migrations](./migrations/) manually.

### General

```
Usage: little-librarian [OPTIONS] <COMMAND>

Commands:
  ask   Configuration for the ask command
  scan  Configuration for the scan command
  help  Print this message or the help of the given subcommand(s)

Options:
  -d, --db-url <DB_URL>
          Where to read the database connection string from ('-' is stdin) [default: -]
  -t, --tokenizer-path <TOKENIZER_PATH>
          Path to content tokenizer [default: ./tokenizer.json]
  -m, --model-path <MODEL_PATH>
          Path to ONNX model used for embeddings.. [default: ./model.onnx]
  -c, --chunk-size <CHUNK_SIZE>
          Maximum text chunk size when splitting up content. Documents are not updated if chunk size changes [default: 500]
  -h, --help
          Print help
  -V, --version
          Print version
```

### Scan a Calibre library

This can take a huge amount of time, depending on the size of the library. It
will use up all available CPU cores.

```
Usage: little-librarian scan --library-path <LIBRARY_PATH>

Options:
  -l, --library-path <LIBRARY_PATH>  Root directory of calibre library
  -h, --help                         Print help
  -V, --version                      Print version
```

```
little-librarian scan --library-path /path/to/calibre/library
```

### Search for books

```
Usage: little-librarian ask [OPTIONS] --query <QUERY>

Options:
  -q, --query <QUERY>                    Query to run
  -l, --limit <LIMIT>                    Maximum number of results [default: 5]
  -r, --reranking-path <RERANKING_PATH>  Path to reranking model [default: ./reranking.onnx]
  -h, --help                             Print help
  -V, --version                          Print version
```

```
little-librarian ask --query "books about artificial intelligence ethics"
```
initial commit 2025-07-01 08:55:08 +02:00			`# Little Librarian`

			`Semantic search for [Calibre](https://calibre-ebook.com/) libraries using vector`
			`embeddings.`

			`## Inner Workings`

			`1. Extract text content from EPUB and PDF files`
			`2. Chunk text into manageable pieces (500 words by default)`
			`3. Embed each chunk using your ONNX model`
			`4. Store embeddings in PostgreSQL with pgvector`
			`5. Search by converting queries to embeddings and finding similar chunks`
			`6. Rerank results using a cross-encoder model`

			`## Shortcomings`

			`This is an experiment and as such has some quite rough edges, including but not`
			`limited to:`

			`- Text chunking is naive, no idea if it even works assably on non-latin scripts`
			`- PDF text extraction is hit-and-miss`
			`- CPU only (this is also a strength, but at least the possibility of using GPU`
			`would be nice)`

			`## Requirements`

			`All dependencies for running the application are specified in the`
			`[flake.nix](./flake.nix) file within the project folder (look for the key`
			`buildInputs`, though [rustup](https://rustup.rs/) might just work with the
			`provided toolchain file). My recommendation is to use the provided`
			`[nix flake](https://nixos.wiki/wiki/Flakes) directly, as it sets up everything`
			`automatically.`

			`Additional dependencies:`

			`- [PostgreSQL](https://www.postgresql.org/) with`
			`[pgvector extension](https://github.com/pgvector/pgvector)`
			`- ONNX embedding model (e.g.,`
			`[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2))`
			`- Tokenizer file (JSON format, from the embedding model)`
			`- ONNX reranking model (e.g.`
			`[bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base))`

			`## Usage`

			Use `cargo run -- <ARGS>` to run directly from source code and specify needed
			arguments in place of `<ARGS>`.

			`### Migrations`

			Either use [sqlx-cli](https://crates.io/crates/sqlx-cli) with `sqlx migrate run`
			`or apply the [migrations](./migrations/) manually.`

			`### General`

			```
			`Usage: little-librarian [OPTIONS] <COMMAND>`

			`Commands:`
			`ask Configuration for the ask command`
			`scan Configuration for the scan command`
			`help Print this message or the help of the given subcommand(s)`

			`Options:`
			`-d, --db-url <DB_URL>`
			`Where to read the database connection string from ('-' is stdin) [default: -]`
			`-t, --tokenizer-path <TOKENIZER_PATH>`
			`Path to content tokenizer [default: ./tokenizer.json]`
			`-m, --model-path <MODEL_PATH>`
			`Path to ONNX model used for embeddings.. [default: ./model.onnx]`
			`-c, --chunk-size <CHUNK_SIZE>`
			`Maximum text chunk size when splitting up content. Documents are not updated if chunk size changes [default: 500]`
			`-h, --help`
			`Print help`
			`-V, --version`
			`Print version`
			```

			`### Scan a Calibre library`

			`This can take a huge amount of time, depending on the size of the library. It`
			`will use up all available CPU cores.`

			```
			`Usage: little-librarian scan --library-path <LIBRARY_PATH>`

			`Options:`
			`-l, --library-path <LIBRARY_PATH> Root directory of calibre library`
			`-h, --help Print help`
			`-V, --version Print version`
			```

			```
			`little-librarian scan --library-path /path/to/calibre/library`
			```

			`### Search for books`

			```
			`Usage: little-librarian ask [OPTIONS] --query <QUERY>`

			`Options:`
			`-q, --query <QUERY> Query to run`
			`-l, --limit <LIMIT> Maximum number of results [default: 5]`
			`-r, --reranking-path <RERANKING_PATH> Path to reranking model [default: ./reranking.onnx]`
			`-h, --help Print help`
			`-V, --version Print version`
			```

			```
			`little-librarian ask --query "books about artificial intelligence ethics"`
			```