data

some utilities to aid data extraction and query preprocessing

We will build a simple ingestion pipeline to ingest pdf documents into litesearch database for searching.

Extensions to pymupdf Document and Page classes to extract texts, images and links

/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/usearch/__init__.py:125: UserWarning: Will download `usearch_sqlite` binary from GitHub.
  warnings.warn("Will download `usearch_sqlite` binary from GitHub.", UserWarning)

source

Document.ext_imgs

 Document.ext_imgs (st=0, end=-1)

source

Document.ext_im

 Document.ext_im (it=None)

source

Document.get_texts

 Document.get_texts (st=0, end=-1, **kw)

Code extraction utilities


source

pyparse

 pyparse (p:pathlib.Path=None, code:str=None, imports=False)

Parse a code string or python file and return code chunks as list of dicts with content and metadata.

Type Default Details
p Path None path to a python file
code str None code string to parse
imports bool False include import statements as code chunks
Returns L

You can use pyparse to extract code chunks from a python file or code string.

txt = """
from fastcore.all import *
a=1
class SomeClass:
    def __init__(self,x): store_attr()
    def method(self): return self.x + a
 """
pyparse(code=txt)
(#2) [{'content': 'a=1', 'metadata': {'path': None, 'uploaded_at': None, 'name': None, 'type': 'Assign', 'lineno': 3, 'end_lineno': 3}},{'content': 'class SomeClass:\n    def __init__(self,x): store_attr()\n    def method(self): return self.x + a', 'metadata': {'path': None, 'uploaded_at': None, 'name': 'SomeClass', 'type': 'ClassDef', 'lineno': 4, 'end_lineno': 6}}]

Setting imports to True will also include import statements as code chunks.

pyparse(code=txt, imports=True)
(#3) [{'content': 'from fastcore.all import *', 'metadata': {'path': None, 'uploaded_at': None, 'name': None, 'type': 'ImportFrom', 'lineno': 2, 'end_lineno': 2}},{'content': 'a=1', 'metadata': {'path': None, 'uploaded_at': None, 'name': None, 'type': 'Assign', 'lineno': 3, 'end_lineno': 3}},{'content': 'class SomeClass:\n    def __init__(self,x): store_attr()\n    def method(self): return self.x + a', 'metadata': {'path': None, 'uploaded_at': None, 'name': 'SomeClass', 'type': 'ClassDef', 'lineno': 4, 'end_lineno': 6}}]

source

pkg2chunks

 pkg2chunks (pkg:str, imports:bool=False, **kw)

Return code chunks from a package with extra metadata.

Type Default Details
pkg str package name
imports bool False include import statements as code chunks
kw VAR_KEYWORD
Returns L additional args to pass to pkg2files

source

pkg2files

 pkg2files (pkg:str, file_glob:str='*.py', skip_file_glob:str='_*', skip_f
            ile_re='(^__init__\\.py$|^setup\\.py$|^conftest\\.py$|^test_.*
            \\.py$|^tests?\\.py$|^.*_test\\.py$)', skip_folder_re='(^tests
            ?$|^__pycache__$|^\\.eggs$|^\\.mypy_cache$|^\\.tox$|^examples?
            $|^docs?$|^build$|^dist$|^\\.git$|^\\.ipynb_checkpoints$)',
            recursive:bool=True, symlinks:bool=True, file_re:str=None,
            folder_re:str=None, func:callable=<function join>,
            ret_folders:bool=False, sort:bool=True)

Return list of python files in a package excluding tests and setup files.

Type Default Details
pkg str package name
file_glob str *.py file glob to match
skip_file_glob str _* file glob to skip
skip_file_re str (^init.py\(\|\^setup\.py\)|^conftest.py\(\|\^test_.*\.py\)|^tests?.py\(\|\^.*_test\.py\)) regex to skip files
skip_folder_re str (^tests?\(\|\^__pycache__\)|^.eggs\(\|\^\.mypy_cache\)|^.tox\(\|\^examples?\)|^docs?\(\|\^build\)|^dist\(\|\^\.git\)|^.ipynb_checkpoints$) regex to skip folders
recursive bool True search subfolders
symlinks bool True follow symlinks?
file_re str None Only include files matching regex
folder_re str None Only enter folders matching regex
func callable join function to apply to each matched file
ret_folders bool False return folders, not just files
sort bool True sort files by name within each folder
Returns L additional args to pass to globtastic

pkg2chunks can be used to extract code chunks from an entire package installed in your environment.

chunks=pkg2chunks('fastlite')
chunks.filter(lambda d: d['metadata']['type']=='FunctionDef')[0]
{'content': 'def t(self:Database): return _TablesGetter(self)',
 'metadata': {'path': '/Users/71293/code/litesearch/.venv/lib/python3.13/site-packages/fastlite/core.py',
  'uploaded_at': 1752468812.9739048,
  'name': 't',
  'type': 'FunctionDef',
  'lineno': 44,
  'end_lineno': 44,
  'package': 'fastlite',
  'version': '0.2.1'}}

source

installed_packages

 installed_packages (nms:list=None)

Return list of installed packages. If nms is provided, return only those packages.

Type Default Details
nms list None list of package names
Returns L

Get list of installed packages in your environment using installed_packages. If you pass a list of package names, it only returns them if they exist in your environment.

installed_packages(['fstlite']) # non existent package
installed_packages(['fastlite']) # existing package
installed_packages() # all installed packages that are not stdlib
(#179) ['litesearch','shellingham','jiter','ipykernel','simsimd','threadpoolctl','coloredlogs','uri-template','humanfriendly','socksio','rfc3339-validator','pexpect','jupyterlab-quarto','fqdn','requests','babel','rich','traitlets','tokenizers','urllib3'...]

Query Preprocessing utilities


source

pre

 pre (q:str, wc=True, wide=True, extract_kw=True)

Preprocess the query for fts search.

Type Default Details
q str query to be passed for fts search
wc bool True add wild card to each word
wide bool True widen the query with OR operator
extract_kw bool True extract keywords from the query

source

kw

 kw (q:str)

Extract keywords from the query using YAKE library.

Type Details
q str query to be passed for fts search

source

mk_wider

 mk_wider (q:str)

Widen the query by joining words with OR operator.

Type Details
q str query to be passed for fts search

source

add_wc

 add_wc (q:str)

Add wild card to each word in the query.*

Type Details
q str query to be passed for fts search

source

clean

 clean (q:str)

Clean the query by removing and returning None for empty queries.*

Type Details
q str query to be passed for fts search

You can clean queries passed into fts search using clean, add wild cards using add_wc, widen the query using mk_wider and extract keywords using kw. You can combine all these using pre function.

q = 'This is a sample query'
print('preprocessed q with defaults: `%s`' %pre(q))
print('keywords extracted: `%s`' %pre(q, wc=False, wide=False))
print('q with wild card: `%s`' %pre(q, extract_kw=False, wide=False, wc=True))
preprocessed q with defaults: `query* OR sample*`
keywords extracted: `query sample`
q with wild card: `This* is* a* sample* query*`