some utilities to aid data extraction and query preprocessing
We will build a simple ingestion pipeline to ingest pdf documents into litesearch database for searching.
Extensions to pymupdf Document and Page classes to extract texts, images and links
/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/usearch/__init__.py:125: UserWarning: Will download `usearch_sqlite` binary from GitHub.
warnings.warn("Will download `usearch_sqlite` binary from GitHub.", UserWarning)
Return list of installed packages. If nms is provided, return only those packages.
Type
Default
Details
nms
list
None
list of package names
Returns
L
Get list of installed packages in your environment using installed_packages. If you pass a list of package names, it only returns them if they exist in your environment.
installed_packages(['fstlite']) # non existent packageinstalled_packages(['fastlite']) # existing packageinstalled_packages() # all installed packages that are not stdlib
Clean the query by removing and returning None for empty queries.*
Type
Details
q
str
query to be passed for fts search
You can clean queries passed into fts search using clean, add wild cards using add_wc, widen the query using mk_wider and extract keywords using kw. You can combine all these using pre function.
q ='This is a sample query'print('preprocessed q with defaults: `%s`'%pre(q))print('keywords extracted: `%s`'%pre(q, wc=False, wide=False))print('q with wild card: `%s`'%pre(q, extract_kw=False, wide=False, wc=True))
preprocessed q with defaults: `query* OR sample*`
keywords extracted: `query sample`
q with wild card: `This* is* a* sample* query*`