Embeddings¶

Finalfusion Embeddings

class finalfusion.embeddings.Embeddings(storage: finalfusion.storage.storage.Storage, vocab: finalfusion.vocab.vocab.Vocab, norms: Optional[finalfusion.norms.Norms] = None, metadata: Optional[finalfusion.metadata.Metadata] = None, origin: str = '<memory>')[source]¶

Embeddings class.

Embeddings always contain a Storage and Vocab. Optional chunks are Norms corresponding to the embeddings of the in-vocab tokens and Metadata.

Embeddings can be retrieved through three methods:

Embeddings.embedding() allows to provide a default value and returns this value if no embedding could be found.
Embeddings.__getitem__() retrieves an embedding for the query but raises an exception if it cannot retrieve an embedding.
Embeddings.embedding_with_norm() requires a Norms chunk and returns an embedding together with the corresponding L2 norm.

Embeddings are composed of the 4 chunk types:

Storage (required):
- NdArray
- QuantizedArray
Vocab (required):
Metadata
Norms

Examples

>>> storage = NdArray(np.float32(np.random.rand(2, 10)))
>>> vocab = SimpleVocab(["Some", "words"])
>>> metadata = Metadata({"Some": "value", "numerical": 0})
>>> norms = Norms(np.float32(np.random.rand(2)))
>>> embeddings = Embeddings(storage=storage, vocab=vocab, metadata=metadata, norms=norms)
>>> embeddings.vocab.words
['Some', 'words']
>>> np.allclose(embeddings["Some"], storage[0])
True
>>> try:
...     embeddings["oov"]
... except KeyError:
...     True
True
>>> _, n = embeddings.embedding_with_norm("Some")
>>> np.isclose(n, norms[0])
True
>>> embeddings.metadata
{'Some': 'value', 'numerical': 0}

__init__(storage: finalfusion.storage.storage.Storage, vocab: finalfusion.vocab.vocab.Vocab, norms: Optional[finalfusion.norms.Norms] = None, metadata: Optional[finalfusion.metadata.Metadata] = None, origin: str = '<memory>')[source]¶

Initialize Embeddings.

Initializes Embeddings with the given chunks.

Conditions

The following conditions need to hold if the respective chunks are passed:

Chunks need to have the expected type.
vocab.idx_bound == storage.shape[0]
len(vocab) == len(norms)
len(norms) == len(vocab) and len(norms) >= storage.shape[0]

Parameters

storage (Storage) – Embeddings Storage.
vocab (Vocab) – Embeddings Vocabulary.
norms (Norms, optional) – Embeddings Norms.
metadata (Metadata, optional) – Embeddings Metadata.
origin (str, optional) – Origin of the embeddings, e.g. file name

Raises

AssertionError – If any of the conditions don’t hold.

__getitem__(item: str) → numpy.ndarray [source]¶

Returns an embedding.

Parameters: item (str) – The query item.
Returns: embedding – The embedding.
Return type: numpy.ndarray
Raises: KeyError – If no embedding could be retrieved.

See also

embedding(), __getitem__()

property dims¶

Get the embdeding dimensionality.

Returns: dims – Embedding dimensionality
Return type: int

property n_words¶

Get the number of known words.

Returns: n_words – Number of known words
Return type: int

property storage¶

Get the Storage.

Returns: storage – The embeddings storage.
Return type: Storage

property vocab¶

The Vocab.

Returns: vocab – The vocabulary
Return type: Vocab

property norms¶

The Norms.

Getter

Returns None or the Norms.

Setter

Set the Norms.

Returns

norms – The Norms or None.

Return type

Norms, optional

Raises

AssertionError – if embeddings.storage.shape[0] < len(embeddings.norms) or len(embeddings.norms) != len(embeddings.vocab)
TypeError – If norms is neither Norms nor None.

property metadata¶

The Metadata.

Getter: Returns None or the Metadata.
Setter: Set the Metadata.
Returns: metadata – The Metadata or None.
Return type: Metadata, optional
Raises: TypeError – If metadata is neither Metadata nor None.

property origin¶

The origin of the embeddings.

Returns: origin – Origin of the embeddings, e.g. file name
Return type: str

chunks() → List[finalfusion.io.Chunk][source]¶

Get the Embeddings Chunks as a list.

The Chunks are ordered in the expected serialization order:

Metadata (optional)

Vocabulary

Storage

Norms (optional)

Returns: chunks – List of embeddings chunks.
Return type: List[Chunk]

write(file: Union[str, bytes, int, os.PathLike])[source]¶

Write the Embeddings to the given file.

Writes the Embeddings to a finalfusion file at the given file.

Parameters: file (str, bytes, int, PathLike) – Path of the output file.

bucket_to_explicit() → finalfusion.embeddings.Embeddings [source]¶

Bucket to explicit Embeddings conversion.

Multiple embeddings can still map to the same bucket, but all buckets that are not indexed by in-vocabulary n-grams are eliminated. This can have a big impact on the size of the embedding matrix.

Metadata is not copied to the new embeddings since it doesn’t reflect the changes. You can manually set the metadata and update the values accordingly.

Returns: embeddings – Embeddings with an ExplicitVocab instead of a hash-based vocabulary.
Return type: Embeddings
Raises: TypeError – If the current vocabulary is not a hash-based vocabulary (FinalfusionBucketVocab or FastTextVocab)

analogy(word1: str, word2: str, word3: str, k: int = 1, skip: Set[str] = None) → Optional[List[finalfusion.embeddings.SimilarityResult]][source]¶

Perform an analogy query.

This method returns words that are close in vector space the analogy query word1 is to word2 as word3 is to ?. More concretely, it searches embeddings that are similar to:

embedding(word2) - embedding(word1) + embedding(word3)

Words specified in skip are not considered as answers. If skip is None, the query words word1, word2 and word3 are excluded.

At most, k results are returned. None is returned when no embedding could be computed for any of the tokens.

Parameters

word1 (str) – Word1 is to…
word2 (str) – word2 like…
word3 (str) – word3 is to the return value
skip (Set[str]) – Set of strings which should not be considered as answers. Defaults to None which excludes the query strings. To allow the query strings as answers, pass an empty set.
k (int) – Number of answers to return, defaults to 1.

Returns

answers – List of answers.

Return type

List[SimilarityResult]

word_similarity(query: str, k: int = 10) → Optional[List[finalfusion.embeddings.SimilarityResult]][source]¶

Retrieves the nearest neighbors of the query string.

The similarity between the embedding of the query and other embeddings is defined by the dot product of the embeddings. If the vectors are unit vectors, this is the cosine similarity.

At most, k results are returned.

Parameters

query (str) – The query string
k (int) – The number of neighbors to return, defaults to 10.

Returns

neighbours – List of tuples with neighbour and similarity measure. None if no embedding can be found for query.

Return type

List[Tuple[str, float], optional

embedding_similarity(query: numpy.ndarray, k: int = 10, skip: Optional[Set[str]] = None) → Optional[List[finalfusion.embeddings.SimilarityResult]][source]¶

Retrieves the nearest neighbors of the query embedding.

The similarity between the query embedding and other embeddings is defined by the dot product of the embeddings. If the vectors are unit vectors, this is the cosine similarity.

At most, k results are returned.

Parameters

query (str) – The query array.
k (int) – The number of neighbors to return, defaults to 10.
skip (Set[str], optional) – Set of strings that should not be considered as neighbours.

Returns

neighbours – List of tuples with neighbour and similarity measure. None if no embedding can be found for query.

Return type

List[Tuple[str, float], optional

class finalfusion.embeddings.SimilarityResult(word: str, similarity: float)[source]¶

Container for a Similarity result.

The word can be accessed through result.word, the similarity through result.similarity.

finalfusion.embeddings.load_finalfusion(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → finalfusion.embeddings.Embeddings [source]¶

Read embeddings from a file in finalfusion format.

Parameters

file (str, bytes, int, PathLike) – Path to a file with embeddings in finalfusoin format.
mmap (bool) – Toggles memory mapping the storage buffer.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings