fuzzy_match(
"Apple Inc",
["Apple Inc", "Apple", "Google LLC", "Microsoft Corp"],
)[('Apple Inc', 100.0, 0),
('Apple', 71.42857142857143, 1),
('Google LLC', 31.57894736842105, 2),
('Microsoft Corp', 8.695652173913048, 3)]
Utils for finding the closest match of a string in a list of strings.
fuzzy_match(
query_string,
candidate_strings,
max_results,
min_similarity,
scorer
)Find the closest fuzzy matches to a query string from a list of candidate strings using RapidFuzz.
This is a simplified wrapper around rapidfuzz.process.extract to find fuzzy matches. The equivalent of, and much faster version of, difflib.get_close_matches.
Arguments: - query_string (str): The string to match against the candidates. - candidate_strings (Iterable[str]): The list of candidate strings to search. - max_results (int): Maximum number of matches to return. Defaults to 8. - min_similarity (float): Minimum similarity threshold (0.0-1.0). Defaults to 0.1. - scorer (callable): Scoring function from rapidfuzz.fuzz. Defaults to rapidfuzz.fuzz.ratio. See rapidfuzz.fuzz for available scorers.
Returns: List[Tuple[str, float, int]]: List of tuples containing (matched_string, score, index).
fuzzy_match(
"Apple Inc",
["Apple Inc", "Apple", "Google LLC", "Microsoft Corp"],
)[('Apple Inc', 100.0, 0),
('Apple', 71.42857142857143, 1),
('Google LLC', 31.57894736842105, 2),
('Microsoft Corp', 8.695652173913048, 3)]
get_vector_dist_matrix(vectors: list[list[float]], metric: str)Calculate the pairwise distance matrix for a set of vectors.
Arguments: - vectors (List[List[float]]): List of vectors to calculate distances for. - metric (str): Distance metric to use. Defaults to “cosine”. Options include “euclidean”, “manhattan”, “cosine”, etc. See sklearn.metrics.pairwise_distances for more options.
Returns: np.ndarray: Distance matrix. Each element [i, j] represents the distance between vectors[i] and vectors[j].
vs = [
[9,7,1,2,6],
[1,8,3,3,2],
[4,5,6,7,8]
]
get_vector_dist_matrix(vs)array([[1.11022302e-16, 2.94916146e-01, 2.28848079e-01],
[2.94916146e-01, 1.11022302e-16, 2.29985740e-01],
[2.28848079e-01, 2.29985740e-01, 0.00000000e+00]])
embedding_match(embedding_index, dist_matrix, num_matches)Find the indices and distances of the closest matches to a given embedding. Use it in combination with get_vector_dist_matrix to find the closest embeddings.
Arguments: - embedding_index (int): Index of the embedding to match. - dist_matrix (np.ndarray): Pairwise distance matrix of embeddings. Computed using get_vector_dist_matrix. - num_matches (int): Number of closest matches to return (excluding self). Defaults to 5.
Returns: Tuple[np.ndarray, np.ndarray]: Tuple of (indices of closest matches, distances to those matches).
from adulib.llm import async_batch_embeddings
docs = [
"Apple Inc",
"Apple",
"Google LLC",
"Microsoft Corp",
"Apple Inc is a technology company.",
"Google is a search engine.",
"Microsoft develops software and hardware.",
"Apple and Google are competitors in the tech industry."
]
embeddings, responses = await async_batch_embeddings(
model="text-embedding-3-small",
input=docs,
batch_size=1000,
verbose=False,
)
dist_matrix = get_vector_dist_matrix(embeddings)
match_indices, _ = embedding_match(docs.index("Apple"), dist_matrix, num_matches=2)
[docs[i] for i in match_indices]['Apple Inc', 'Apple Inc is a technology company.']