Embeddings¶
Finalfusion Embeddings
-
class
finalfusion.embeddings.
Embeddings
(storage: finalfusion.storage.storage.Storage, vocab: finalfusion.vocab.vocab.Vocab, norms: Optional[finalfusion.norms.Norms] = None, metadata: Optional[finalfusion.metadata.Metadata] = None, origin: str = '<memory>')[source]¶ Embeddings class.
Embeddings always contain a
Storage
andVocab
. Optional chunks areNorms
corresponding to the embeddings of the in-vocab tokens andMetadata
.Embeddings can be retrieved through three methods:
Embeddings.embedding()
allows to provide a default value and returns this value if no embedding could be found.Embeddings.__getitem__()
retrieves an embedding for the query but raises an exception if it cannot retrieve an embedding.Embeddings.embedding_with_norm()
requires aNorms
chunk and returns an embedding together with the corresponding L2 norm.
Embeddings are composed of the 4 chunk types:
Examples
>>> storage = NdArray(np.float32(np.random.rand(2, 10))) >>> vocab = SimpleVocab(["Some", "words"]) >>> metadata = Metadata({"Some": "value", "numerical": 0}) >>> norms = Norms(np.float32(np.random.rand(2))) >>> embeddings = Embeddings(storage=storage, vocab=vocab, metadata=metadata, norms=norms) >>> embeddings.vocab.words ['Some', 'words'] >>> np.allclose(embeddings["Some"], storage[0]) True >>> try: ... embeddings["oov"] ... except KeyError: ... True True >>> _, n = embeddings.embedding_with_norm("Some") >>> np.isclose(n, norms[0]) True >>> embeddings.metadata {'Some': 'value', 'numerical': 0}
-
__init__
(storage: finalfusion.storage.storage.Storage, vocab: finalfusion.vocab.vocab.Vocab, norms: Optional[finalfusion.norms.Norms] = None, metadata: Optional[finalfusion.metadata.Metadata] = None, origin: str = '<memory>')[source]¶ Initialize Embeddings.
Initializes Embeddings with the given chunks.
- Conditions
The following conditions need to hold if the respective chunks are passed:
Chunks need to have the expected type.
vocab.idx_bound == storage.shape[0]
len(vocab) == len(norms)
len(norms) == len(vocab) and len(norms) >= storage.shape[0]
- Parameters
storage (Storage) – Embeddings Storage.
vocab (Vocab) – Embeddings Vocabulary.
norms (Norms, optional) – Embeddings Norms.
metadata (Metadata, optional) – Embeddings Metadata.
origin (str, optional) – Origin of the embeddings, e.g. file name
- Raises
AssertionError – If any of the conditions don’t hold.
-
__getitem__
(item: str) → numpy.ndarray[source]¶ Returns an embedding.
- Parameters
item (str) – The query item.
- Returns
embedding – The embedding.
- Return type
- Raises
KeyError – If no embedding could be retrieved.
See also
-
embedding
(word: str, out: Optional[numpy.ndarray] = None, default: Optional[numpy.ndarray] = None) → Optional[numpy.ndarray][source]¶ Embedding lookup.
Looks up the embedding for the input word.
If an out array is specified, the embedding is written into the array.
If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.
This method never fails. If you do not provide a default value, check the return value for None.
out
is left untouched if no embedding can be found anddefault
is None.- Parameters
word (str) – The query word.
out (numpy.ndarray, optional) – Optional output array to write the embedding into.
default (numpy.ndarray, optional) – Optional default value to return if no embedding can be retrieved. Defaults to None.
- Returns
embedding – The retrieved embedding or the default value.
- Return type
numpy.ndarray, optional
Examples
>>> matrix = np.float32(np.random.rand(2, 10)) >>> storage = NdArray(matrix) >>> vocab = SimpleVocab(["Some", "words"]) >>> embeddings = Embeddings(storage=storage, vocab=vocab) >>> np.allclose(embeddings.embedding("Some"), matrix[0]) True >>> # default value is None >>> embeddings.embedding("oov") is None True >>> # It's possible to specify a default value >>> default = embeddings.embedding("oov", default=storage[0]) >>> np.allclose(default, storage[0]) True >>> # Embeddings can be written to an output buffer. >>> out = np.zeros(10, dtype=np.float32) >>> out2 = embeddings.embedding("Some", out=out) >>> out is out2 True >>> np.allclose(out, matrix[0]) True
See also
-
embedding_with_norm
(word: str, out: Optional[numpy.ndarray] = None, default: Optional[Tuple[numpy.ndarray, float]] = None) → Optional[Tuple[numpy.ndarray, float]][source]¶ Embedding lookup with norm.
Looks up the embedding for the input word together with its norm.
If an out array is specified, the embedding is written into the array.
If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.
This method raises a TypeError if norms are not set.
- Parameters
word (str) – The query word.
out (numpy.ndarray, optional) – Optional output array to write the embedding into.
default (Tuple[numpy.ndarray, float], optional) – Optional default value to return if no embedding can be retrieved. Defaults to None.
- Returns
(embedding, norm) – Tuple with the retrieved embedding or the default value at the first index and the norm at the second index.
- Return type
EmbeddingWithNorm, optional
See also
-
property
dims
¶ Get the embdeding dimensionality.
- Returns
dims – Embedding dimensionality
- Return type
-
property
n_words
¶ Get the number of known words.
- Returns
n_words – Number of known words
- Return type
-
property
norms
¶ The
Norms
.- Getter
Returns None or the Norms.
- Setter
Set the Norms.
- Returns
norms – The Norms or None.
- Return type
Norms, optional
- Raises
AssertionError – if
embeddings.storage.shape[0] < len(embeddings.norms)
orlen(embeddings.norms) != len(embeddings.vocab)
TypeError – If
norms
is neither Norms nor None.
-
property
metadata
¶ The
Metadata
.
-
property
origin
¶ The origin of the embeddings.
- Returns
origin – Origin of the embeddings, e.g. file name
- Return type
-
chunks
() → List[finalfusion.io.Chunk][source]¶ Get the Embeddings Chunks as a list.
The Chunks are ordered in the expected serialization order:
Metadata (optional)
Vocabulary
Storage
Norms (optional)
- Returns
chunks – List of embeddings chunks.
- Return type
List[Chunk]
-
write
(file: Union[str, bytes, int, os.PathLike])[source]¶ Write the Embeddings to the given file.
Writes the Embeddings to a finalfusion file at the given file.
- Parameters
file (str, bytes, int, PathLike) – Path of the output file.
-
bucket_to_explicit
() → finalfusion.embeddings.Embeddings[source]¶ Bucket to explicit Embeddings conversion.
Multiple embeddings can still map to the same bucket, but all buckets that are not indexed by in-vocabulary n-grams are eliminated. This can have a big impact on the size of the embedding matrix.
Metadata is not copied to the new embeddings since it doesn’t reflect the changes. You can manually set the metadata and update the values accordingly.
- Returns
embeddings – Embeddings with an ExplicitVocab instead of a hash-based vocabulary.
- Return type
- Raises
TypeError – If the current vocabulary is not a hash-based vocabulary (FinalfusionBucketVocab or FastTextVocab)
-
analogy
(word1: str, word2: str, word3: str, k: int = 1, skip: Set[str] = None) → Optional[List[finalfusion.embeddings.SimilarityResult]][source]¶ Perform an analogy query.
This method returns words that are close in vector space the analogy query word1 is to word2 as word3 is to ?. More concretely, it searches embeddings that are similar to:
embedding(word2) - embedding(word1) + embedding(word3)
Words specified in
skip
are not considered as answers. Ifskip
is None, the query wordsword1
,word2
andword3
are excluded.At most,
k
results are returned.None
is returned when no embedding could be computed for any of the tokens.- Parameters
word1 (str) – Word1 is to…
word2 (str) – word2 like…
word3 (str) – word3 is to the return value
skip (Set[str]) – Set of strings which should not be considered as answers. Defaults to
None
which excludes the query strings. To allow the query strings as answers, pass an empty set.k (int) – Number of answers to return, defaults to 1.
- Returns
answers – List of answers.
- Return type
List[SimilarityResult]
-
word_similarity
(query: str, k: int = 10) → Optional[List[finalfusion.embeddings.SimilarityResult]][source]¶ Retrieves the nearest neighbors of the query string.
The similarity between the embedding of the query and other embeddings is defined by the dot product of the embeddings. If the vectors are unit vectors, this is the cosine similarity.
At most,
k
results are returned.
-
embedding_similarity
(query: numpy.ndarray, k: int = 10, skip: Optional[Set[str]] = None) → Optional[List[finalfusion.embeddings.SimilarityResult]][source]¶ Retrieves the nearest neighbors of the query embedding.
The similarity between the query embedding and other embeddings is defined by the dot product of the embeddings. If the vectors are unit vectors, this is the cosine similarity.
At most,
k
results are returned.- Parameters
query (str) – The query array.
k (int) – The number of neighbors to return, defaults to 10.
skip (Set[str], optional) – Set of strings that should not be considered as neighbours.
- Returns
neighbours – List of tuples with neighbour and similarity measure. None if no embedding can be found for
query
.- Return type
-
class
finalfusion.embeddings.
SimilarityResult
(word: str, similarity: float)[source]¶ Container for a Similarity result.
The word can be accessed through
result.word
, the similarity throughresult.similarity
.
-
finalfusion.embeddings.
load_finalfusion
(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read embeddings from a file in finalfusion format.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in finalfusoin format.
mmap (bool) – Toggles memory mapping the storage buffer.
- Returns
embeddings – The embeddings from the input file.
- Return type