data
Building blocks for ingesting and querying data with thedu with some additional utils
We will build a simple ingestion pipeline to ingest pdf documents into litesearch database for searching.
Applying usearch macOS fix if required… usearch dylib path: /home/runner/.usearch/binaries/usearch_sqlite.dylib Not on macOS, skipping usearch fix. —
read_pdf
read_pdf (pth:str|os.PathLike)
Read a PDF file and return a list of page data.
| Type | Details | |
|---|---|---|
| pth | str | os.PathLike | path to PDF file |
| Returns | AttrDictDefault |
Some utilities for pdf processing
pymupdf2txt
pymupdf2txt (doc)
Database.pdf_ingest
Database.pdf_ingest (pdf_doc:dict|os.PathLike, chunk_embed_pipe:chonkie.pipeline.pipeline.Pipeline= None, tbl:str='content')
Ingest PDF documents into thedu.
| Type | Default | Details | |
|---|---|---|---|
| pdf_doc | dict | os.PathLike | a pdf document or path. Use read_pdf to read from path |
|
| chunk_embed_pipe | Pipeline | None | chunking and embedding pipeline. If None, use default chonkie pipeline |
| tbl | str | content | content table name |
pdf_pipe
pdf_pipe ()
Return the default chunking and embedding pipeline.
pre
pre (q:str, wc=True, wide=True, extract_kw=True)
Preprocess the query for fts search.
| Type | Default | Details | |
|---|---|---|---|
| q | str | query to be passed for fts search | |
| wc | bool | True | add wild card to each word |
| wide | bool | True | widen the query with OR operator |
| extract_kw | bool | True | extract keywords from the query |
kw
kw (q:str)
Extract keywords from the query using YAKE library.
| Type | Details | |
|---|---|---|
| q | str | query to be passed for fts search |
mk_wider
mk_wider (q:str)
Widen the query by joining words with OR operator.
| Type | Details | |
|---|---|---|
| q | str | query to be passed for fts search |
add_wc
add_wc (q:str)
Add wild card to each word in the query.*
| Type | Details | |
|---|---|---|
| q | str | query to be passed for fts search |
clean
clean (q:str)
Clean the query by removing and returning None for empty queries.*
| Type | Details | |
|---|---|---|
| q | str | query to be passed for fts search |