data

Building blocks for ingesting and querying data with thedu with some additional utils

We will build a simple ingestion pipeline to ingest pdf documents into litesearch database for searching.

Applying usearch macOS fix if required… usearch dylib path: /home/runner/.usearch/binaries/usearch_sqlite.dylib Not on macOS, skipping usearch fix. —

source

read_pdf

 read_pdf (pth:str|os.PathLike)

Read a PDF file and return a list of page data.

Type Details
pth str | os.PathLike path to PDF file
Returns AttrDictDefault

Some utilities for pdf processing


source

pymupdf2txt

 pymupdf2txt (doc)

source

Database.pdf_ingest

 Database.pdf_ingest (pdf_doc:dict|os.PathLike,
                      chunk_embed_pipe:chonkie.pipeline.pipeline.Pipeline=
                      None, tbl:str='content')

Ingest PDF documents into thedu.

Type Default Details
pdf_doc dict | os.PathLike a pdf document or path. Use read_pdf to read from path
chunk_embed_pipe Pipeline None chunking and embedding pipeline. If None, use default chonkie pipeline
tbl str content content table name

source

pdf_pipe

 pdf_pipe ()

Return the default chunking and embedding pipeline.


source

pre

 pre (q:str, wc=True, wide=True, extract_kw=True)

Preprocess the query for fts search.

Type Default Details
q str query to be passed for fts search
wc bool True add wild card to each word
wide bool True widen the query with OR operator
extract_kw bool True extract keywords from the query

source

kw

 kw (q:str)

Extract keywords from the query using YAKE library.

Type Details
q str query to be passed for fts search

source

mk_wider

 mk_wider (q:str)

Widen the query by joining words with OR operator.

Type Details
q str query to be passed for fts search

source

add_wc

 add_wc (q:str)

Add wild card to each word in the query.*

Type Details
q str query to be passed for fts search

source

clean

 clean (q:str)

Clean the query by removing and returning None for empty queries.*

Type Details
q str query to be passed for fts search