str_matching

Utils for finding the closest match of a string in a list of strings.

fuzzy_match

fuzzy_match(
   query_string,
   candidate_strings,
   max_results,
   min_similarity,
   scorer
)

Find the closest fuzzy matches to a query string from a list of candidate strings using RapidFuzz.

This is a simplified wrapper around rapidfuzz.process.extract to find fuzzy matches. The equivalent of, and much faster version of, difflib.get_close_matches.

Arguments: - query_string (str): The string to match against the candidates. - candidate_strings (Iterable[str]): The list of candidate strings to search. - max_results (int): Maximum number of matches to return. Defaults to 8. - min_similarity (float): Minimum similarity threshold (0.0-1.0). Defaults to 0.1. - scorer (callable): Scoring function from rapidfuzz.fuzz. Defaults to rapidfuzz.fuzz.ratio. See rapidfuzz.fuzz for available scorers.

Returns: List[Tuple[str, float, int]]: List of tuples containing (matched_string, score, index).

fuzzy_match(
    "Apple Inc",
    ["Apple Inc", "Apple", "Google LLC", "Microsoft Corp"],
)

[('Apple Inc', 100.0, 0),
 ('Apple', 71.42857142857143, 1),
 ('Google LLC', 31.57894736842105, 2),
 ('Microsoft Corp', 8.695652173913048, 3)]

get_vector_dist_matrix

get_vector_dist_matrix(vectors: list[list[float]], metric: str)

Calculate the pairwise distance matrix for a set of vectors.

Arguments: - vectors (List[List[float]]): List of vectors to calculate distances for. - metric (str): Distance metric to use. Defaults to “cosine”. Options include “euclidean”, “manhattan”, “cosine”, etc. See sklearn.metrics.pairwise_distances for more options.

Returns: np.ndarray: Distance matrix. Each element [i, j] represents the distance between vectors[i] and vectors[j].

vs = [
    [9,7,1,2,6],
    [1,8,3,3,2],
    [4,5,6,7,8]
]
get_vector_dist_matrix(vs)

array([[1.11022302e-16, 2.94916146e-01, 2.28848079e-01],
       [2.94916146e-01, 1.11022302e-16, 2.29985740e-01],
       [2.28848079e-01, 2.29985740e-01, 0.00000000e+00]])

embedding_match

embedding_match(embedding_index, dist_matrix, num_matches)

Find the indices and distances of the closest matches to a given embedding. Use it in combination with get_vector_dist_matrix to find the closest embeddings.

Arguments: - embedding_index (int): Index of the embedding to match. - dist_matrix (np.ndarray): Pairwise distance matrix of embeddings. Computed using get_vector_dist_matrix. - num_matches (int): Number of closest matches to return (excluding self). Defaults to 5.

Returns: Tuple[np.ndarray, np.ndarray]: Tuple of (indices of closest matches, distances to those matches).

from adulib.llm import async_batch_embeddings
docs = [
    "Apple Inc",
    "Apple",
    "Google LLC",
    "Microsoft Corp",
    "Apple Inc is a technology company.",
    "Google is a search engine.",
    "Microsoft develops software and hardware.",
    "Apple and Google are competitors in the tech industry."
]
embeddings, responses = await async_batch_embeddings(
    model="text-embedding-3-small",
    input=docs,
    batch_size=1000,
    verbose=False,
)
dist_matrix = get_vector_dist_matrix(embeddings)
match_indices, _ = embedding_match(docs.index("Apple"), dist_matrix, num_matches=2)
[docs[i] for i in match_indices]

['Apple Inc', 'Apple Inc is a technology company.']