113 lines
3.6 KiB
Markdown
113 lines
3.6 KiB
Markdown
|
# Little Librarian
|
||
|
|
||
|
Semantic search for [Calibre](https://calibre-ebook.com/) libraries using vector
|
||
|
embeddings.
|
||
|
|
||
|
## Inner Workings
|
||
|
|
||
|
1. **Extract** text content from EPUB and PDF files
|
||
|
2. **Chunk** text into manageable pieces (500 words by default)
|
||
|
3. **Embed** each chunk using your ONNX model
|
||
|
4. **Store** embeddings in PostgreSQL with pgvector
|
||
|
5. **Search** by converting queries to embeddings and finding similar chunks
|
||
|
6. **Rerank** results using a cross-encoder model
|
||
|
|
||
|
## Shortcomings
|
||
|
|
||
|
This is an experiment and as such has some quite rough edges, including but not
|
||
|
limited to:
|
||
|
|
||
|
- Text chunking is naive, no idea if it even works assably on non-latin scripts
|
||
|
- PDF text extraction is hit-and-miss
|
||
|
- CPU only (this is also a strength, but at least the possibility of using GPU
|
||
|
would be nice)
|
||
|
|
||
|
## Requirements
|
||
|
|
||
|
All dependencies for running the application are specified in the
|
||
|
[flake.nix](./flake.nix) file within the project folder (look for the key
|
||
|
`buildInputs`, though [rustup](https://rustup.rs/) might just work with the
|
||
|
provided toolchain file). My recommendation is to use the provided
|
||
|
[nix flake](https://nixos.wiki/wiki/Flakes) directly, as it sets up everything
|
||
|
automatically.
|
||
|
|
||
|
Additional dependencies:
|
||
|
|
||
|
- [PostgreSQL](https://www.postgresql.org/) with
|
||
|
[pgvector extension](https://github.com/pgvector/pgvector)
|
||
|
- ONNX embedding model (e.g.,
|
||
|
[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2))
|
||
|
- Tokenizer file (JSON format, from the embedding model)
|
||
|
- ONNX reranking model (e.g.
|
||
|
[bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base))
|
||
|
|
||
|
## Usage
|
||
|
|
||
|
Use `cargo run -- <ARGS>` to run directly from source code and specify needed
|
||
|
arguments in place of `<ARGS>`.
|
||
|
|
||
|
### Migrations
|
||
|
|
||
|
Either use [sqlx-cli](https://crates.io/crates/sqlx-cli) with `sqlx migrate run`
|
||
|
or apply the [migrations](./migrations/) manually.
|
||
|
|
||
|
### General
|
||
|
|
||
|
```
|
||
|
Usage: little-librarian [OPTIONS] <COMMAND>
|
||
|
|
||
|
Commands:
|
||
|
ask Configuration for the ask command
|
||
|
scan Configuration for the scan command
|
||
|
help Print this message or the help of the given subcommand(s)
|
||
|
|
||
|
Options:
|
||
|
-d, --db-url <DB_URL>
|
||
|
Where to read the database connection string from ('-' is stdin) [default: -]
|
||
|
-t, --tokenizer-path <TOKENIZER_PATH>
|
||
|
Path to content tokenizer [default: ./tokenizer.json]
|
||
|
-m, --model-path <MODEL_PATH>
|
||
|
Path to ONNX model used for embeddings.. [default: ./model.onnx]
|
||
|
-c, --chunk-size <CHUNK_SIZE>
|
||
|
Maximum text chunk size when splitting up content. Documents are not updated if chunk size changes [default: 500]
|
||
|
-h, --help
|
||
|
Print help
|
||
|
-V, --version
|
||
|
Print version
|
||
|
```
|
||
|
|
||
|
### Scan a Calibre library
|
||
|
|
||
|
This can take a huge amount of time, depending on the size of the library. It
|
||
|
will use up all available CPU cores.
|
||
|
|
||
|
```
|
||
|
Usage: little-librarian scan --library-path <LIBRARY_PATH>
|
||
|
|
||
|
Options:
|
||
|
-l, --library-path <LIBRARY_PATH> Root directory of calibre library
|
||
|
-h, --help Print help
|
||
|
-V, --version Print version
|
||
|
```
|
||
|
|
||
|
```
|
||
|
little-librarian scan --library-path /path/to/calibre/library
|
||
|
```
|
||
|
|
||
|
### Search for books
|
||
|
|
||
|
```
|
||
|
Usage: little-librarian ask [OPTIONS] --query <QUERY>
|
||
|
|
||
|
Options:
|
||
|
-q, --query <QUERY> Query to run
|
||
|
-l, --limit <LIMIT> Maximum number of results [default: 5]
|
||
|
-r, --reranking-path <RERANKING_PATH> Path to reranking model [default: ./reranking.onnx]
|
||
|
-h, --help Print help
|
||
|
-V, --version Print version
|
||
|
```
|
||
|
|
||
|
```
|
||
|
little-librarian ask --query "books about artificial intelligence ethics"
|
||
|
```
|