Finalfusion in Python¶
finalfusion
is a Python package for reading, writing and using
finalfusion embeddings, but also supports other commonly used
embeddings like fastText, GloVe and word2vec.
The Python package supports the same types of embeddings as the finalfusion-rust crate:
Vocabulary
No subwords
Subwords
Embedding matrix
Array
Memory-mapped
Quantized
Norms
Metadata
This package extends (de-)serialization capabilities of finalfusion
Chunk
s by
allowing loading and writing single chunks. E.g. a Vocab
can be loaded from a
finalfusion spec file without loading the
Storage
. Single chunks can also be serialized to their own files
through write()
. This is different from the functionality of finalfusion-rust
,
loading stand-alone components is only supported by the Python package. Reading will fail with
other tools from the finalfusion
ecosystem.
It integrates nicely with numpy
since its Storage
types can be
treated as numpy arrays.
finalfusion
comes with some scripts to convert between
embedding formats, do analogy and similarity queries and turn bucket subword embeddings
into explicit subword embeddings.
The package is implemented in Python with some Cython
extensions, it is not based on bindings
to the finalfusion-rust crate.
Contents¶
Quickstart¶
Package¶
And use embeddings by:
import finalfusion
# loading from different formats
w2v_embeds = finalfusion.load_word2vec("/path/to/w2v.bin")
text_embeds = finalfusion.load_text("/path/to/embeds.txt")
text_dims_embeds = finalfusion.load_text_dims("/path/to/embeds.dims.txt")
fasttext_embeds = finalfusion.load_fasttext("/path/to/fasttext.bin")
fifu_embeds = finalfusion.load_finalfusion("/path/to/embeddings.fifu")
# serialization to formats works similarly
finalfusion.compat.write_word2vec("to_word2vec.bin", fifu_embeds)
# embedding lookup
embedding = fifu_embeds["Test"]
# reading an embedding into a buffer
import numpy as np
buffer = np.zeros(fifu_embeds.storage.shape[1], dtype=np.float32)
fifu_embeds.embedding("Test", out=buffer)
# similarity and analogy query
sim_query = fifu_embeds.word_similarity("Test")
analogy_query = fifu_embeds.analogy("A", "B", "C")
# accessing the vocab and printing the first 10 words
vocab = fifu_embeds.vocab
print(vocab.words[:10])
# SubwordVocabs give access to the subword indexer:
subword_indexer = vocab.subword_indexer
print(subword_indexer.subword_indices("Test", with_ngrams=True))
# accessing the storage and calculate its dot product with an embedding
res = embedding.dot(fifu_embeds.storage)
# printing metadata
print(fifu_embeds.metadata)
finalfusion
exports most commonly used functions and types in the top level.
See Top-Level Exports for an overview.
The full API documentation can be found here.
Conversion¶
finalfusion
also comes with a conversion tool to convert between supported file formats
and from bucket subword embeddings to explicit subword embeddings:
$ ffp-convert -f fasttext from_fasttext.bin -t finalfusion to_finalfusion.fifu
$ ffp-bucket-to-explicit buckets.fifu explicit.fifu
See Scripts
Similarity and Analogy¶
$ echo Tübingen | ffp-similar embeddings.fifu
$ echo Tübingen Stuttgart Heidelberg | ffp-analogy embeddings.fifu
See Scripts
Selecting Embeddings¶
It’s also possible to generate an embedding file based on an input vocabulary. For subword
vocabularies, ffp-select
adds computed representations for unknown words.
$ ffp-select big-embeddings.fifu small-output.fifu vocab.txt
Install¶
finalfusion
is compatible with Python 3.6
and more recent versions. Direct dependencies
are numpy and toml.
Installing for 3.6
additionally depends on
dataclasses.
Pip¶
From Pypi:
$ pip install finalfusion
From GitHub:
$ pip install git+https://github.com/finalfusion/finalfusion-python
Source¶
Installing from source requires Cython. When finalfusion
is built
with pip
, you don’t need to install Cython
manually since the dependency is specified in
pyproject.toml
.
$ git clone https://github.com/finalfusion/finalfusion-python
$ cd finalfusion-python
$ pip install . # or python setup.py install
Building a wheel from source is possible if wheel is installed by:
$ git clone https://github.com/finalfusion/finalfusion-python
$ cd finalfusion-python
$ python setup.py bdist_wheel
Top-level Exports¶
finalfusion
re-exports some common types at the top-level. These types cover the
typical use-cases.
Embeddings¶
|
Embeddings class. |
Read embeddings from a file in finalfusion format. |
Compat¶
Read embeddings from a file in fastText format. |
|
|
Read embeddings in text format. |
Read emebddings in text-dims format. |
|
Read embeddings in word2vec binary format. |
Metadata¶
Embeddings metadata |
|
Load a Metadata chunk from the given file. |
Norms¶
|
Norms Chunk. |
Load Norms from a finalfusion file. |
Storage¶
Common interface to finalfusion storage types. |
|
|
Load any vocabulary from a finalfusion file. |
Vocab¶
|
Finalfusion vocabulary interface. |
Load any vocabulary from a finalfusion file. |
API¶
Embeddings¶
Finalfusion Embeddings
-
class
finalfusion.embeddings.
Embeddings
(storage: finalfusion.storage.storage.Storage, vocab: finalfusion.vocab.vocab.Vocab, norms: Optional[finalfusion.norms.Norms] = None, metadata: Optional[finalfusion.metadata.Metadata] = None, origin: str = '<memory>')[source]¶ Embeddings class.
Embeddings always contain a
Storage
andVocab
. Optional chunks areNorms
corresponding to the embeddings of the in-vocab tokens andMetadata
.Embeddings can be retrieved through three methods:
Embeddings.embedding()
allows to provide a default value and returns this value if no embedding could be found.Embeddings.__getitem__()
retrieves an embedding for the query but raises an exception if it cannot retrieve an embedding.Embeddings.embedding_with_norm()
requires aNorms
chunk and returns an embedding together with the corresponding L2 norm.
Embeddings are composed of the 4 chunk types:
Examples
>>> storage = NdArray(np.float32(np.random.rand(2, 10))) >>> vocab = SimpleVocab(["Some", "words"]) >>> metadata = Metadata({"Some": "value", "numerical": 0}) >>> norms = Norms(np.float32(np.random.rand(2))) >>> embeddings = Embeddings(storage=storage, vocab=vocab, metadata=metadata, norms=norms) >>> embeddings.vocab.words ['Some', 'words'] >>> np.allclose(embeddings["Some"], storage[0]) True >>> try: ... embeddings["oov"] ... except KeyError: ... True True >>> _, n = embeddings.embedding_with_norm("Some") >>> np.isclose(n, norms[0]) True >>> embeddings.metadata {'Some': 'value', 'numerical': 0}
-
__init__
(storage: finalfusion.storage.storage.Storage, vocab: finalfusion.vocab.vocab.Vocab, norms: Optional[finalfusion.norms.Norms] = None, metadata: Optional[finalfusion.metadata.Metadata] = None, origin: str = '<memory>')[source]¶ Initialize Embeddings.
Initializes Embeddings with the given chunks.
- Conditions
The following conditions need to hold if the respective chunks are passed:
Chunks need to have the expected type.
vocab.idx_bound == storage.shape[0]
len(vocab) == len(norms)
len(norms) == len(vocab) and len(norms) >= storage.shape[0]
- Parameters
storage (Storage) – Embeddings Storage.
vocab (Vocab) – Embeddings Vocabulary.
norms (Norms, optional) – Embeddings Norms.
metadata (Metadata, optional) – Embeddings Metadata.
origin (str, optional) – Origin of the embeddings, e.g. file name
- Raises
AssertionError – If any of the conditions don’t hold.
-
__getitem__
(item: str) → numpy.ndarray[source]¶ Returns an embedding.
- Parameters
item (str) – The query item.
- Returns
embedding – The embedding.
- Return type
- Raises
KeyError – If no embedding could be retrieved.
See also
-
embedding
(word: str, out: Optional[numpy.ndarray] = None, default: Optional[numpy.ndarray] = None) → Optional[numpy.ndarray][source]¶ Embedding lookup.
Looks up the embedding for the input word.
If an out array is specified, the embedding is written into the array.
If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.
This method never fails. If you do not provide a default value, check the return value for None.
out
is left untouched if no embedding can be found anddefault
is None.- Parameters
word (str) – The query word.
out (numpy.ndarray, optional) – Optional output array to write the embedding into.
default (numpy.ndarray, optional) – Optional default value to return if no embedding can be retrieved. Defaults to None.
- Returns
embedding – The retrieved embedding or the default value.
- Return type
numpy.ndarray, optional
Examples
>>> matrix = np.float32(np.random.rand(2, 10)) >>> storage = NdArray(matrix) >>> vocab = SimpleVocab(["Some", "words"]) >>> embeddings = Embeddings(storage=storage, vocab=vocab) >>> np.allclose(embeddings.embedding("Some"), matrix[0]) True >>> # default value is None >>> embeddings.embedding("oov") is None True >>> # It's possible to specify a default value >>> default = embeddings.embedding("oov", default=storage[0]) >>> np.allclose(default, storage[0]) True >>> # Embeddings can be written to an output buffer. >>> out = np.zeros(10, dtype=np.float32) >>> out2 = embeddings.embedding("Some", out=out) >>> out is out2 True >>> np.allclose(out, matrix[0]) True
See also
-
embedding_with_norm
(word: str, out: Optional[numpy.ndarray] = None, default: Optional[Tuple[numpy.ndarray, float]] = None) → Optional[Tuple[numpy.ndarray, float]][source]¶ Embedding lookup with norm.
Looks up the embedding for the input word together with its norm.
If an out array is specified, the embedding is written into the array.
If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.
This method raises a TypeError if norms are not set.
- Parameters
word (str) – The query word.
out (numpy.ndarray, optional) – Optional output array to write the embedding into.
default (Tuple[numpy.ndarray, float], optional) – Optional default value to return if no embedding can be retrieved. Defaults to None.
- Returns
(embedding, norm) – Tuple with the retrieved embedding or the default value at the first index and the norm at the second index.
- Return type
EmbeddingWithNorm, optional
See also
-
property
dims
¶ Get the embdeding dimensionality.
- Returns
dims – Embedding dimensionality
- Return type
-
property
n_words
¶ Get the number of known words.
- Returns
n_words – Number of known words
- Return type
-
property
norms
¶ The
Norms
.- Getter
Returns None or the Norms.
- Setter
Set the Norms.
- Returns
norms – The Norms or None.
- Return type
Norms, optional
- Raises
AssertionError – if
embeddings.storage.shape[0] < len(embeddings.norms)
orlen(embeddings.norms) != len(embeddings.vocab)
TypeError – If
norms
is neither Norms nor None.
-
property
metadata
¶ The
Metadata
.
-
property
origin
¶ The origin of the embeddings.
- Returns
origin – Origin of the embeddings, e.g. file name
- Return type
-
chunks
() → List[finalfusion.io.Chunk][source]¶ Get the Embeddings Chunks as a list.
The Chunks are ordered in the expected serialization order:
Metadata (optional)
Vocabulary
Storage
Norms (optional)
- Returns
chunks – List of embeddings chunks.
- Return type
List[Chunk]
-
write
(file: Union[str, bytes, int, os.PathLike])[source]¶ Write the Embeddings to the given file.
Writes the Embeddings to a finalfusion file at the given file.
- Parameters
file (str, bytes, int, PathLike) – Path of the output file.
-
bucket_to_explicit
() → finalfusion.embeddings.Embeddings[source]¶ Bucket to explicit Embeddings conversion.
Multiple embeddings can still map to the same bucket, but all buckets that are not indexed by in-vocabulary n-grams are eliminated. This can have a big impact on the size of the embedding matrix.
Metadata is not copied to the new embeddings since it doesn’t reflect the changes. You can manually set the metadata and update the values accordingly.
- Returns
embeddings – Embeddings with an ExplicitVocab instead of a hash-based vocabulary.
- Return type
- Raises
TypeError – If the current vocabulary is not a hash-based vocabulary (FinalfusionBucketVocab or FastTextVocab)
-
analogy
(word1: str, word2: str, word3: str, k: int = 1, skip: Set[str] = None) → Optional[List[finalfusion.embeddings.SimilarityResult]][source]¶ Perform an analogy query.
This method returns words that are close in vector space the analogy query word1 is to word2 as word3 is to ?. More concretely, it searches embeddings that are similar to:
embedding(word2) - embedding(word1) + embedding(word3)
Words specified in
skip
are not considered as answers. Ifskip
is None, the query wordsword1
,word2
andword3
are excluded.At most,
k
results are returned.None
is returned when no embedding could be computed for any of the tokens.- Parameters
word1 (str) – Word1 is to…
word2 (str) – word2 like…
word3 (str) – word3 is to the return value
skip (Set[str]) – Set of strings which should not be considered as answers. Defaults to
None
which excludes the query strings. To allow the query strings as answers, pass an empty set.k (int) – Number of answers to return, defaults to 1.
- Returns
answers – List of answers.
- Return type
List[SimilarityResult]
-
word_similarity
(query: str, k: int = 10) → Optional[List[finalfusion.embeddings.SimilarityResult]][source]¶ Retrieves the nearest neighbors of the query string.
The similarity between the embedding of the query and other embeddings is defined by the dot product of the embeddings. If the vectors are unit vectors, this is the cosine similarity.
At most,
k
results are returned.
-
embedding_similarity
(query: numpy.ndarray, k: int = 10, skip: Optional[Set[str]] = None) → Optional[List[finalfusion.embeddings.SimilarityResult]][source]¶ Retrieves the nearest neighbors of the query embedding.
The similarity between the query embedding and other embeddings is defined by the dot product of the embeddings. If the vectors are unit vectors, this is the cosine similarity.
At most,
k
results are returned.- Parameters
query (str) – The query array.
k (int) – The number of neighbors to return, defaults to 10.
skip (Set[str], optional) – Set of strings that should not be considered as neighbours.
- Returns
neighbours – List of tuples with neighbour and similarity measure. None if no embedding can be found for
query
.- Return type
-
class
finalfusion.embeddings.
SimilarityResult
(word: str, similarity: float)[source]¶ Container for a Similarity result.
The word can be accessed through
result.word
, the similarity throughresult.similarity
.
-
finalfusion.embeddings.
load_finalfusion
(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read embeddings from a file in finalfusion format.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in finalfusoin format.
mmap (bool) – Toggles memory mapping the storage buffer.
- Returns
embeddings – The embeddings from the input file.
- Return type
Storage¶
finalfusion.storage
|
Load any vocabulary from a finalfusion file. |
Load an array chunk from the given file. |
|
Array storage. |
|
Load a quantized array chunk from the given file. |
|
QuantizedArray storage. |
NdArray¶
-
class
finalfusion.storage.ndarray.
NdArray
(array: numpy.ndarray)[source]¶ Bases:
numpy.ndarray
,finalfusion.storage.storage.Storage
Array storage.
Wraps an numpy matrix, either in-memory or memory-mapped.
Examples
>>> storage = NdArray(np.array([[1., 0.5], [0.5, 1.], [0.3, 0.4]], ... dtype=np.float32)) >>> # slicing an NdArray returns a storage backed by the same array >>> storage[:2] NdArray([[1. , 0.5], [0.5, 1. ]], dtype=float32) >>> # NdArray storage can be treated as numpy arrays >>> storage * 2 NdArray([[2. , 1. ], [1. , 2. ], [0.6, 0.8]], dtype=float32) >>> # Indexing with arrays, lists or ints returns numpy.ndarray >>> storage[0] array([1. , 0.5], dtype=float32)
-
static
__new__
(cls, array: numpy.ndarray)[source]¶ Construct a new NdArray storage.
- Parameters
array (np.ndarray) – The storage buffer.
- Raises
TypeError – If the array is not a 2-dimensional float32 array.
-
classmethod
load
(file: BinaryIO, mmap: bool = False) → finalfusion.storage.ndarray.NdArray[source]¶ Load Storage from the given finalfusion file.
- Parameters
file (IO[bytes]) – Finalfusion file with a storage chunk
mmap (bool)
- Returns
storage (Storage) – The first storage in the input file
mmap (bool) – Toggles memory mapping the storage buffer as read-only.
- Raises
ValueError – If the file did not contain a storage.
-
static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
static
read_chunk
(file: BinaryIO) → finalfusion.storage.ndarray.NdArray[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
property
shape
¶ Tuple of array dimensions.
The shape property is usually used to get the current shape of an array, but may also be used to reshape the array in-place by assigning a tuple of array dimensions to it. As with numpy.reshape, one of the new shape dimensions can be -1, in which case its value is inferred from the size of the array and the remaining dimensions. Reshaping an array in-place will fail if a copy is required.
Examples
>>> x = np.array([1, 2, 3, 4]) >>> x.shape (4,) >>> y = np.zeros((2, 3, 4)) >>> y.shape (2, 3, 4) >>> y.shape = (3, 8) >>> y array([[ 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0.]]) >>> y.shape = (3, 6) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: total size of new array must be unchanged >>> np.zeros((4,2))[::2].shape = (-1,) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: Incompatible shape for in-place modification. Use `.reshape()` to make a copy with the desired shape.
See also
numpy.reshape
similar function
ndarray.reshape
similar method
-
static
mmap_chunk
(file: BinaryIO) → finalfusion.storage.ndarray.NdArray[source]¶ Memory maps the storage as a read-only buffer.
- Parameters
file (IO[bytes]) – Finalfusion file with a storage chunk
- Returns
storage – The first storage in the input file
- Return type
- Raises
ValueError – If the file did not contain a storage.
-
static
-
finalfusion.storage.ndarray.
load_ndarray
(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → finalfusion.storage.ndarray.NdArray[source]¶ Load an array chunk from the given file.
- Parameters
file (str, bytes, int, PathLike) – Finalfusion file with a ndarray chunk.
mmap (bool) – Toggles memory mapping the array buffer as read only.
- Returns
storage – The NdArray storage from the file.
- Return type
- Raises
ValueError – If the file did not contain and NdArray chunk.
Quantized¶
-
class
finalfusion.storage.quantized.
QuantizedArray
(pq: finalfusion.storage.quantized.PQ, quantized_embeddings: numpy.ndarray, norms: Optional[numpy.ndarray])[source]¶ Bases:
finalfusion.storage.storage.Storage
QuantizedArray storage.
QuantizedArrays support slicing, indexing with integers, lists of integers and arbitrary dimensional integer arrays. Slicing a QuantizedArray returns a new QuantizedArray but does not copy any buffers.
QuantizedArrays offer two ways of indexing:
QuantizedArray.__getitem__()
:passing a slice returns a new view of the QuantizedArray.
passing an integer returns a single embedding, lists and arrays return ndims + 1 dimensional embeddings.
QuantizedArray.embedding()
:embeddings can be written to an output buffer.
passing a slice returns a matrix holding reconstructed embeddings.
otherwise, this method behaves like
__getitem__()
A QuantizedArray can be treated as
numpy.ndarray
throughnumpy.asarray()
. This restores the original matrix and copies into a new buffer.Using common numpy functions on a QuantizedArray will produce a regular
ndarray
in the process and is therefore an expensive operation.-
__init__
(pq: finalfusion.storage.quantized.PQ, quantized_embeddings: numpy.ndarray, norms: Optional[numpy.ndarray])[source]¶ Initialize a QuantizedArray.
- Parameters
pq (PQ) – A product quantizer
quantized_embeddings (numpy.ndarray) – The quantized embeddings
norms (numpy.ndarray, optional) – Optional norms corresponding to the quantized embeddings. Reconstructed embeddings are scaled by their norm.
-
property
shape
¶ Get the shape of the storage.
-
embedding
(key, out: numpy.ndarray = None) → numpy.ndarray[source]¶ Get embeddings.
if
key
is an integer, a single reconstructed embedding is returned.if
key
is a list of integers or a slice, a matrix of reconstructed embeddings is returned.if
key
is an n-dimensional array, a tensor with reconstructed embeddings is returned. This tensor has one new axis in the last dimension containing the embeddings.
If
out
is passed, the reconstruction is written to this buffer.out.shape
needs to match the dimensions described above.- Parameters
key (int, list, numpy.ndarray, slice) – Key specifying which embeddings to retrieve.
out (numpy.ndarray) – Array to reconstruct the embeddings into.
- Returns
reconstruction – The reconstructed embedding or embeddings.
- Return type
-
property
quantized_len
¶ Length of the quantized embeddings.
- Returns
quantized_len – Length of quantized embeddings.
- Return type
-
classmethod
load
(file: BinaryIO, mmap=False) → finalfusion.storage.quantized.QuantizedArray[source]¶ Load Storage from the given finalfusion file.
- Parameters
file (IO[bytes]) – Finalfusion file with a storage chunk
mmap (bool)
- Returns
storage (Storage) – The first storage in the input file
mmap (bool) – Toggles memory mapping the storage buffer as read-only.
- Raises
ValueError – If the file did not contain a storage.
-
static
read_chunk
(file: BinaryIO) → finalfusion.storage.quantized.QuantizedArray[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
mmap_chunk
(file: BinaryIO) → finalfusion.storage.quantized.QuantizedArray[source]¶ Memory maps the storage as a read-only buffer.
- Parameters
file (IO[bytes]) – Finalfusion file with a storage chunk
- Returns
storage – The first storage in the input file
- Return type
- Raises
ValueError – If the file did not contain a storage.
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
class
finalfusion.storage.quantized.
PQ
(quantizers: numpy.ndarray, projection: Optional[numpy.ndarray])[source]¶ Product Quantizer
Product Quantizers are vector quantizers which decompose high dimensional vector spaces into subspaces. Each of these subspaces is a slice of the the original vector space. Embeddings are quantized by assigning their ith slice to the closest centroid.
Product Quantizers can reconstruct vectors by concatenating the slices of the quantized vector.
-
__init__
(quantizers: numpy.ndarray, projection: Optional[numpy.ndarray])[source]¶ Initializes a Product Quantizer.
- Parameters
quantizers (np.ndarray) – 3-d ndarray with dtype uint8
projection (np.ndarray, optional) – Projection matrix, must be a square matrix with shape [reconstructed_len, reconstructed_len]
- Raises
AssertionError – If the projection shape does not match the reconstructed_len
-
property
n_centroids
¶ Number of centroids per quantizer.
- Returns
n_centroids – The number of centroids per quantizer.
- Return type
-
property
projection
¶ Projection matrix.
- Returns
projection – Projection Matrix (2-d numpy array with datatype float32) or None.
- Return type
np.ndarray, optional
-
property
reconstructed_len
¶ Reconstructed length.
- Returns
reconstructed_len – Length of the reconstructed vectors.
- Return type
-
property
subquantizers
¶ Get the quantizers.
Returns a 3-d array with shape quantizers * n_centroids * reconstructed_len / quantizers
- Returns
quantizers (np.ndarray) – 3-d np.ndarray with dtype=np.uint8
@return (3d tensor of quantizers)
-
reconstruct
(quantized: numpy.ndarray, out: numpy.ndarray = None) → numpy.ndarray[source]¶ Reconstruct vectors.
Input
- Parameters
quantized (np.ndarray) – Batch of quantized vectors. 2-d np.ndarray with integers required.
out (np.ndarray, optional) – 2-d np.ndarray to write the output into.
- Returns
out – Batch of reconstructed vectors.
- Return type
np.ndarray
- Raises
AssertionError – If out is passed and its last dimension does not match reconstructed_len or its first n-1 dimensions do not match the first n-1 dimensions of quantized.
-
-
finalfusion.storage.quantized.
load_quantized_array
(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → finalfusion.storage.quantized.QuantizedArray[source]¶ Load a quantized array chunk from the given file.
- Parameters
file (str, bytes, int, PathLike) – Finalfusion file with a quantized array chunk.
mmap (bool) – Toggles memory mapping the array buffer as read only.
- Returns
storage – The QuantizedArray storage from the file.
- Return type
- Raises
ValueError – If the file did not contain a QuantizedArray chunk.
Storage Interface¶
-
class
finalfusion.storage.storage.
Storage
[source]¶ Common interface to finalfusion storage types.
-
abstract property
shape
¶ Get the shape of the storage.
-
classmethod
load
(file: BinaryIO, mmap: bool = False) → finalfusion.storage.storage.Storage[source]¶ Load Storage from the given finalfusion file.
- Parameters
file (IO[bytes]) – Finalfusion file with a storage chunk
mmap (bool)
- Returns
storage (Storage) – The first storage in the input file
mmap (bool) – Toggles memory mapping the storage buffer as read-only.
- Raises
ValueError – If the file did not contain a storage.
-
abstract static
mmap_chunk
(file: BinaryIO) → finalfusion.storage.storage.Storage[source]¶ Memory maps the storage as a read-only buffer.
- Parameters
file (IO[bytes]) – Finalfusion file with a storage chunk
- Returns
storage – The first storage in the input file
- Return type
- Raises
ValueError – If the file did not contain a storage.
-
abstract property
-
finalfusion.storage.
load_storage
(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → finalfusion.storage.storage.Storage[source]¶ Load any vocabulary from a finalfusion file.
Loads the first known vocabulary from a finalfusion file.
- Parameters
file (str) – Path to finalfusion file containing a storage chunk.
mmap (bool) – Toggles memory mapping the storage buffer as read-only.
- Returns
storage – First finalfusion Storage in the file.
- Return type
- Raises
ValueError – If the file did not contain a vocabulary.
Vocabularies¶
finalfusion.vocab
Load any vocabulary from a finalfusion file. |
|
|
Load a FinalfusionBucketVocab from the given finalfusion file. |
Load a FastTextVocab from the given finalfusion file. |
|
Load a ExplicitVocab from the given finalfusion file. |
|
Load a SimpleVocab from the given finalfusion file. |
|
|
Finalfusion vocabulary interface. |
Simple vocabulary. |
|
Interface for vocabularies with subword lookups. |
|
Finalfusion Bucket Vocabulary. |
|
FastText vocabulary |
|
A vocabulary with explicitly stored n-grams. |
SimpleVocab¶
-
class
finalfusion.vocab.simple_vocab.
SimpleVocab
(*args, **kwds)[source]¶ Bases:
finalfusion.vocab.vocab.Vocab
Simple vocabulary.
SimpleVocabs provide a simple string to index mapping and index to string mapping.
-
__init__
(words: List[str])[source]¶ Initialize a SimpleVocab.
Initializes the vocabulary with the given words and optional index. If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.
- Parameters
words (List[str]) – List of unique words
- Raises
AssertionError – if
words
contains duplicate entries.
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
word_index
¶ Get the index of known words
-
property
upper_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
upper_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
idx
(item: str, default: Optional[Union[list, int]] = None) → Optional[Union[list, int]][source]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
An integer if there is a single index for a known item.
A list if the vocab can provide subword indices for a unknown item.
The provided default item if the vocab can’t provide indices.
- Return type
-
static
read_chunk
(file: BinaryIO) → finalfusion.vocab.simple_vocab.SimpleVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
-
finalfusion.vocab.simple_vocab.
load_simple_vocab
(file: Union[str, bytes, int, os.PathLike]) → finalfusion.vocab.simple_vocab.SimpleVocab[source]¶ Load a SimpleVocab from the given finalfusion file.
- Parameters
file (str) – Path to file containing a SimpleVocab chunk.
- Returns
vocab – Returns the first SimpleVocab in the file.
- Return type
FinalfusionBucketVocab¶
-
class
finalfusion.vocab.subword.
FinalfusionBucketVocab
(*args, **kwds)[source]¶ Bases:
finalfusion.vocab.subword.SubwordVocab
Finalfusion Bucket Vocabulary.
-
__init__
(words: List[str], indexer: Optional[finalfusion.subword.hash_indexers.FinalfusionHashIndexer] = None)[source]¶ Initialize a FinalfusionBucketVocab.
Initializes the vocabulary with the given words.
If no indexer is passed, a FinalfusionHashIndexer with bucket exponent 21 is used.
The word list cannot contain duplicate entries.
- Parameters
words (List[str]) – List of unique words
indexer (FinalfusionHashIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2^21 buckets with range 3-6.
- Raises
AssertionError – If the indexer is not a FinalfusionHashIndexer or
words
contains duplicate entries.
-
to_explicit
() → finalfusion.vocab.subword.ExplicitVocab[source]¶ Return an ExplicitVocab built from this vocab.
This method iterates over the known words and extracts all ngrams within this vocab’s bounds. Each of the ngrams is hashed and mapped to an index. This index is not necessarily unique for each ngram, if hashes collide, multiple ngrams will be mapped to the same index.
The returned vocab will be unable to produce indices for unknown ngrams.
The indices of the new vocabs known indices will be cover [0, vocab.upper_bound)
- Returns
explicit_vocab – The converted vocabulary.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
property
word_index
¶ Get the index of known words
-
static
read_chunk
(file: BinaryIO) → finalfusion.vocab.subword.FinalfusionBucketVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
idx
(item: str, default: Optional[Union[int, List[int]]] = None) → Optional[Union[int, List[int]]]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
An integer if there is a single index for a known item.
A list if the vocab can provide subword indices for a unknown item.
The provided default item if the vocab can’t provide indices.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
subword_indices
(item: str, bracket: bool = True, with_ngrams: bool = False) → List[Union[int, Tuple[str, int]]]¶ Get the subword indices for the given item.
This list does not contain the index for known items.
-
subwords
(item: str, bracket: bool = True) → List[str]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
property
upper_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
upper_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
-
finalfusion.vocab.subword.
load_finalfusion_bucket_vocab
(file: Union[str, bytes, int, os.PathLike]) → finalfusion.vocab.subword.FinalfusionBucketVocab[source]¶ Load a FinalfusionBucketVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a FinalfusionBucketVocab chunk.
- Returns
vocab – Returns the first FinalfusionBucketVocab in the file.
- Return type
ExplicitVocab¶
-
class
finalfusion.vocab.subword.
ExplicitVocab
(*args, **kwds)[source]¶ Bases:
finalfusion.vocab.subword.SubwordVocab
A vocabulary with explicitly stored n-grams.
-
__init__
(words: List[str], indexer: finalfusion.subword.explicit_indexer.ExplicitIndexer)[source]¶ Initialize an ExplicitVocab.
Initializes the vocabulary with the given words and ExplicitIndexer.
The word list cannot contain duplicate entries.
- Parameters
words (List[str]) – List of unique words
indexer (ExplicitIndexer) – Subword indexer to use for the vocabulary.
- Raises
AssertionError – If the indexer is not an ExplicitIndexer.
See also
-
property
word_index
¶ Get the index of known words
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
static
read_chunk
(file: BinaryIO) → finalfusion.vocab.subword.ExplicitVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
write_chunk
(file) → None[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
idx
(item: str, default: Optional[Union[int, List[int]]] = None) → Optional[Union[int, List[int]]]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
An integer if there is a single index for a known item.
A list if the vocab can provide subword indices for a unknown item.
The provided default item if the vocab can’t provide indices.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
subword_indices
(item: str, bracket: bool = True, with_ngrams: bool = False) → List[Union[int, Tuple[str, int]]]¶ Get the subword indices for the given item.
This list does not contain the index for known items.
-
subwords
(item: str, bracket: bool = True) → List[str]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
property
upper_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
upper_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
-
finalfusion.vocab.subword.
load_explicit_vocab
(file: Union[str, bytes, int, os.PathLike]) → finalfusion.vocab.subword.ExplicitVocab[source]¶ Load a ExplicitVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a ExplicitVocab chunk.
- Returns
vocab – Returns the first ExplicitVocab in the file.
- Return type
FastTextVocab¶
-
class
finalfusion.vocab.subword.
FastTextVocab
(*args, **kwds)[source]¶ Bases:
finalfusion.vocab.subword.SubwordVocab
FastText vocabulary
-
__init__
(words: List[str], indexer: Optional[finalfusion.subword.hash_indexers.FastTextIndexer] = None)[source]¶ Initialize a FastTextVocab.
Initializes the vocabulary with the given words.
If no indexer is passed, a FastTextIndexer with 2_000_000 buckets is used.
The word list cannot contain duplicate entries.
- Parameters
words (List[str]) – List of unique words
indexer (FastTextIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2_000_000 buckets and range 3-6.
- Raises
AssertionError – If the indexer is not a FastTextIndexer or
words
contains duplicate entries.
-
to_explicit
() → finalfusion.vocab.subword.ExplicitVocab[source]¶ Return an ExplicitVocab built from this vocab.
This method iterates over the known words and extracts all ngrams within this vocab’s bounds. Each of the ngrams is hashed and mapped to an index. This index is not necessarily unique for each ngram, if hashes collide, multiple ngrams will be mapped to the same index.
The returned vocab will be unable to produce indices for unknown ngrams.
The indices of the new vocabs known indices will be cover [0, vocab.upper_bound)
- Returns
explicit_vocab – The converted vocabulary.
- Return type
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
property
word_index
¶ Get the index of known words
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
static
read_chunk
(file: BinaryIO) → finalfusion.vocab.subword.FastTextVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
static
chunk_identifier
()[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
idx
(item: str, default: Optional[Union[int, List[int]]] = None) → Optional[Union[int, List[int]]]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
An integer if there is a single index for a known item.
A list if the vocab can provide subword indices for a unknown item.
The provided default item if the vocab can’t provide indices.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
subword_indices
(item: str, bracket: bool = True, with_ngrams: bool = False) → List[Union[int, Tuple[str, int]]]¶ Get the subword indices for the given item.
This list does not contain the index for known items.
-
subwords
(item: str, bracket: bool = True) → List[str]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
property
upper_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
upper_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
-
finalfusion.vocab.subword.
load_fasttext_vocab
(file: Union[str, bytes, int, os.PathLike]) → finalfusion.vocab.subword.FastTextVocab[source]¶ Load a FastTextVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a FastTextVocab chunk.
- Returns
vocab – Returns the first FastTextVocab in the file.
- Return type
Interfaces¶
-
class
finalfusion.vocab.vocab.
Vocab
(*args, **kwds)[source]¶ Bases:
finalfusion.io.Chunk
,collections.abc.Collection
,typing.Generic
Finalfusion vocabulary interface.
Vocabs provide at least a simple string to index mapping and index to string mapping. Vocab is the base type of all vocabulary types.
-
abstract property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
abstract property
word_index
¶ Get the index of known words
-
abstract property
upper_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
upper_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
abstract
idx
(item: str, default: Optional[Union[int, List[int]]] = None) → Optional[Union[int, List[int]]][source]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
An integer if there is a single index for a known item.
A list if the vocab can provide subword indices for a unknown item.
The provided default item if the vocab can’t provide indices.
- Return type
-
abstract property
-
class
finalfusion.vocab.subword.
SubwordVocab
(*args, **kwds)[source]¶ Bases:
finalfusion.vocab.vocab.Vocab
Interface for vocabularies with subword lookups.
-
idx
(item: str, default: Optional[Union[int, List[int]]] = None) → Optional[Union[int, List[int]]][source]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
An integer if there is a single index for a known item.
A list if the vocab can provide subword indices for a unknown item.
The provided default item if the vocab can’t provide indices.
- Return type
-
property
upper_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
upper_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
abstract property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
subwords
(item: str, bracket: bool = True) → List[str][source]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
-
finalfusion.vocab.
load_vocab
(file: Union[str, bytes, int, os.PathLike]) → finalfusion.vocab.vocab.Vocab[source]¶ Load any vocabulary from a finalfusion file.
Loads the first known vocabulary from a finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a finalfusion vocab chunk.
- Returns
vocab – First vocabulary in the file.
- Return type
- Raises
ValueError – If the file did not contain a vocabulary.
Subwords¶
finalfusion.subword
File: src/finalfusion/subword/hash_indexers.pyx (starting at line 155) |
|
|
File: src/finalfusion/subword/hash_indexers.pyx (starting at line 17) |
File: src/finalfusion/subword/explicit_indexer.pyx (starting at line 9) |
|
|
File: src/finalfusion/subword/ngrams.pyx (starting at line 7) |
FinalfusionHashIndexer¶
-
class
finalfusion.subword.hash_indexers.
FinalfusionHashIndexer
(bucket_exp=21, min_n=3, max_n=6)¶ File: src/finalfusion/subword/hash_indexers.pyx (starting at line 17)
FinalfusionHashIndexer
FinalfusionHashIndexer is a hash-based subword indexer. It hashes n-grams with the FNV-1a algorithm and maps the hash to a predetermined bucket space.
N-grams can be indexed directly through the __call__ method or all n-grams in a string can be indexed in bulk through the subword_indices method.
-
buckets_exp
¶ The bucket exponent.
The indexer has 2**buckets_exp buckets.
- Returns
buckets_exp – The buckets exponent
- Return type
-
max_n
¶ The upper bound of the n-gram range.
- Returns
max_n – Upper bound of n-gram range
- Return type
-
min_n
¶ The lower bound of the n-gram range.
- Returns
min_n – Lower bound of n-gram range
- Return type
-
subword_indices
(self, unicode word, uint64_t offset=0, bracket=True, with_ngrams=False)¶ File: src/finalfusion/subword/hash_indexers.pyx (starting at line 97)
Get the subword indices for a word.
- Parameters
word (str) – The string to extract n-grams from
offset (int) – The offset to add to the index, e.g. the length of the word-vocabulary.
bracket (bool) – Toggles bracketing the input string with < and >
with_ngrams (bool) – Toggles returning tuples of (ngram, idx)
- Returns
indices – List of n-gram indices, optionally as (str, int) tuples.
- Return type
- Raises
TypeError – If word is None.
-
FastTextIndexer¶
-
class
finalfusion.subword.hash_indexers.
FastTextIndexer
(n_buckets=2000000, min_n=3, max_n=6)¶ File: src/finalfusion/subword/hash_indexers.pyx (starting at line 155)
FastTextIndexer
FastTextIndexer is a hash-based subword indexer. It hashes n-grams with (a slightly faulty) FNV-1a variant and maps the hash to a predetermined bucket space.
N-grams can be indexed directly through the __call__ method or all n-grams in a string can be indexed in bulk through the subword_indices method.
-
max_n
¶ The upper bound of the n-gram range.
- Returns
max_n – Upper bound of n-gram range
- Return type
-
min_n
¶ The lower bound of the n-gram range.
- Returns
min_n – Lower bound of n-gram range
- Return type
-
subword_indices
(self, unicode word, uint64_t offset=0, bracket=True, with_ngrams=False)¶ File: src/finalfusion/subword/hash_indexers.pyx (starting at line 219)
Get the subword indices for a word.
- Parameters
word (str) – The string to extract n-grams from
offset (int) – The offset to add to the index, e.g. the length of the word-vocabulary.
bracket (bool) – Toggles bracketing the input string with < and >
with_ngrams (bool) – Toggles returning tuples of (ngram, idx)
- Returns
indices – List of n-gram indices, optionally as (str, int) tuples.
- Return type
- Raises
TypeError – If word is None.
-
ExplicitIndexer¶
-
class
finalfusion.subword.explicit_indexer.
ExplicitIndexer
(ngrams: List[str], min_n: int = 3, max_n: int = 6, ngram_index: Optional[Dict[str, int]] = None)¶ File: src/finalfusion/subword/explicit_indexer.pyx (starting at line 9)
ExplicitIndexer
Explicit Indexers do not index n-grams through hashing but define an actual lookup table.
It can be constructed from a list of unique ngrams. In that case, the ith ngram in the list will be mapped to index i. It is also possible to pass a mapping via ngram_index which allows mapping multiple ngrams to the same value.
N-grams can be indexed directly through the __call__ method or all n-grams in a string can be indexed in bulk through the subword_indices method.
subword_indices optionally returns tuples of form (ngram, idx), otherwise a list of indices belonging to the input string is returned.
-
max_n
¶ The upper bound of the n-gram range.
- Returns
max_n – Upper bound of n-gram range
- Return type
-
min_n
¶ The lower bound of the n-gram range.
- Returns
min_n – Lower bound of n-gram range
- Return type
-
ngram_index
¶ Get the ngram-index mapping.
Note: If you mutate this mapping you can make the indexer invalid.
-
ngrams
¶ Get the list of n-grams.
Note: If you mutate this list you can make the indexer invalid.
- Returns
ngrams – The list of in-vocabulary n-grams.
- Return type
List[str]
-
subword_indices
(self, unicode word, uint64_t offset=0, bracket=True, with_ngrams=False)¶ File: src/finalfusion/subword/explicit_indexer.pyx (starting at line 129)
Get the subword indices for a word.
- Parameters
word (str) – The string to extract n-grams from
offset (int) – The offset to add to the index, e.g. the length of the word-vocabulary.
bracket (bool) – Toggles bracketing the input string with < and >
with_ngrams (bool) – Toggles returning tuples of (ngram, idx)
- Returns
indices – List of n-gram indices, optionally as (str, int) tuples.
- Return type
- Raises
TypeError – If word is None.
-
NGrams¶
Metadata¶
finalfusion metadata
-
class
finalfusion.metadata.
Metadata
[source]¶ Bases:
dict
,finalfusion.io.Chunk
Embeddings metadata
Metadata can be used as a regular Python dict. For serialization, the contents need to be serializable through toml.dumps. Finalfusion assumes metadata to be a TOML formatted string.
Examples
>>> metadata = Metadata({'Some': 'value', 'number': 1}) >>> metadata {'Some': 'value', 'number': 1} >>> metadata['Some'] 'value' >>> metadata['Some'] = 'other value' >>> metadata['Some'] 'other value'
-
static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
static
read_chunk
(file: BinaryIO) → finalfusion.metadata.Metadata[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
-
finalfusion.metadata.
load_metadata
(file: Union[str, bytes, int, os.PathLike]) → finalfusion.metadata.Metadata[source]¶ Load a Metadata chunk from the given file.
- Parameters
file (str, bytes, int, PathLike) – Finalfusion file with a metadata chunk.
- Returns
metadata – The Metadata from the file.
- Return type
- Raises
ValueError – If the file did not contain an Metadata chunk.
Norms¶
Norms module.
-
class
finalfusion.norms.
Norms
(array: numpy.ndarray)[source]¶ Bases:
numpy.ndarray
,finalfusion.io.Chunk
,collections.abc.Collection
,typing.Generic
Norms Chunk.
Norms subclass numpy.ndarray, all typical numpy operations are available.
-
static
__new__
(cls, array: numpy.ndarray)[source]¶ Construct new Norms.
- Parameters
array (numpy.ndarray) – Norms array.
- Returns
norms – The norms.
- Return type
- Raises
AssertionError – If array is not a 1-d array of float32 values.
-
static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
static
read_chunk
(file: BinaryIO) → finalfusion.norms.Norms[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
-
finalfusion.norms.
load_norms
(file: Union[str, bytes, int, os.PathLike])[source]¶ Load Norms from a finalfusion file.
Loads the first Norms chunk from a finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to finalfusion file containing a Norms chunk.
- Returns
norms – First finalfusion Norms in the file.
- Return type
- Raises
ValueError – If the file did not contain norms.
IO¶
-
class
finalfusion.io.
Chunk
[source]¶ Basic building blocks of finalfusion files.
-
write
(file: Union[str, bytes, int, os.PathLike])[source]¶ Write the Chunk as a standalone finalfusion file.
- Parameters
file (Union[str, bytes, int, PathLike]) – Output path
- Raises
TypeError – If the Chunk is a
Header
.
-
abstract static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
abstract static
read_chunk
(file: BinaryIO) → finalfusion.io.Chunk[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
-
class
finalfusion.io.
ChunkIdentifier
(value)[source]¶ Bases:
enum.IntEnum
Known finalfusion Chunk types.
-
Header
= 0¶
-
SimpleVocab
= 1¶
-
NdArray
= 2¶
-
BucketSubwordVocab
= 3¶
-
QuantizedArray
= 4¶
-
Metadata
= 5¶
-
NdNorms
= 6¶
-
FastTextSubwordVocab
= 7¶
-
ExplicitSubwordVocab
= 8¶
-
-
class
finalfusion.io.
FinalfusionFormatError
[source]¶ Bases:
Exception
Exception to specify that the format of a finalfusion file was incorrect.
-
class
finalfusion.io.
TypeId
(value)[source]¶ Bases:
enum.IntEnum
Known finalfusion data types.
-
u8
= 1¶
-
f32
= 10¶
-
Compat¶
finalfusion.compat
|
Read embeddings from a file in fastText format. |
|
Write embeddings in fastText format. |
|
Read embeddings in text format. |
|
Write embeddings in text format. |
|
Read emebddings in text-dims format. |
|
Write embeddings in text-dims format. |
|
Read embeddings in word2vec binary format. |
|
Write embeddings in word2vec binary format. |
Word2Vec¶
Word2vec binary format.
-
finalfusion.compat.word2vec.
load_word2vec
(file: Union[str, bytes, int, os.PathLike], lossy: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read embeddings in word2vec binary format.
The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.
Files are expected to start with a line containing rows and cols in utf-8. Words are encoded in utf-8 followed by a single whitespace. After the whitespace, the embedding components are expected as little-endian single-precision floats.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
finalfusion.compat.word2vec.
write_word2vec
(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings)[source]¶ Write embeddings in word2vec binary format.
If the embeddings are not compatible with the w2v format (e.g. include a SubwordVocab), only the known words and embeddings are serialized. I.e. the subword matrix is discarded.
Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.
The output file will contain the shape encoded in utf-8 on the first line as rows columns. This is followed by the embeddings.
Each embedding consists of:
utf-8 encoded word
single space
' '
following the wordcols
single-precision floating point numbers'\n'
newline at the end of each line.
- Parameters
file (str, bytes, int, PathLike) – Output file
embeddings (Embeddings) – The embeddings to serialize.
Text¶
Text based embedding formats.
-
finalfusion.compat.text.
load_text
(file: Union[str, bytes, int, os.PathLike], lossy: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read embeddings in text format.
The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.
Expects a file with utf-8 encoded lines with:
word at the start of the line
followed by whitespace
followed by whitespace separated vector components
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.
- Returns
embeddings – Embeddings from the input file. The resulting Embeddings will have a SimpleVocab, NdArray and Norms.
- Return type
-
finalfusion.compat.text.
load_text_dims
(file: Union[str, bytes, int, os.PathLike], lossy: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read emebddings in text-dims format.
The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.
The first line contains whitespace separated rows and cols, the rest of the file contains whitespace separated word and vector components.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in text format with dimensions on the first line.
lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
finalfusion.compat.text.
write_text
(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings, sep=' ')[source]¶ Write embeddings in text format.
Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.
- The output consists of utf-8 encoded lines with:
word at the start of the line
followed by whitespace
followed by whitespace separated vector components
- Parameters
file (str, bytes, int, PathLike) – Output file
embeddings (Embeddings) – Embeddings to write
sep (str) – Separator of word and embeddings.
-
finalfusion.compat.text.
write_text_dims
(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings, sep=' ')[source]¶ Write embeddings in text-dims format.
Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.
- The output consists of utf-8 encoded lines with:
rows cols on the first line
word at the start of the line
followed by whitespace
followed by whitespace separated vector components
- Parameters
file (str, bytes, int, PathLike) – Output file
embeddings (Embeddings) – Embeddings to write
sep (str) – Separator of word and embeddings.
FastText¶
Fasttext IO compat module.
-
finalfusion.compat.fasttext.
load_fasttext
(file: Union[str, bytes, int, os.PathLike], lossy: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read embeddings from a file in fastText format.
The returned embeddings have a FastTextVocab, NdArray storage and a Norms chunk.
Loading embeddings with this method will precompute embeddings for each word by averaging all of its subword embeddings together with the distinct word vector. Additionally, all precomputed vectors are l2-normalized and the corresponding norms are stored in the Norms. The subword embeddings are not l2-normalized.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
lossy (bool) – If set to true, malformed UTF8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
finalfusion.compat.fasttext.
write_fasttext
(file: Union[str, bytes, int, os.PathLike], embeds: finalfusion.embeddings.Embeddings)[source]¶ Write embeddings in fastText format.
Only embeddings with fastText vocabulary can be written to fastText format.
fastText models require values for all config keys, some of these can be inferred from finalfusion models others are assigned some default values:
dims: inferred from model
window_size: 0
min_count: 0
ns: 0
word_ngrams: 1
loss: HierarchicalSoftmax
model: CBOW
buckets: inferred from model
min_n: inferred from model
max_n: inferred from model
lr_update_rate: 0
sampling_threshold: 0
- Some information from original fastText models gets lost e.g.:
word frequencies
n_tokens
Embeddings are un-normalized before serialization: if norms are present, each embedding is scaled by the associated norm. Additionally, the original state of the embedding matrix is restored, precomputation and l2-normalization of word embeddings is undone.
- Parameters
file (str, bytes, int, PathLike) – Output file
embeds (Embeddings) – Embeddings to write
Scripts¶
Installing finalfusion
adds some exectuables:
ffp-convert
for converting embeddings
ffp-similar
for similarity queries
ffp-analogy
for analogy queries
ffp-bucket-to-explicit
to convert bucket subword to explicit subword embeddings
Convert¶
ffp-convert
makes conversion between all supported embedding formats possible:
$ ffp-convert --help
usage: ffp-convert [-h] [-f FORMAT] [-t FORMAT] [-l] [--mmap] INPUT OUTPUT
Convert embeddings.
positional arguments:
INPUT Input embeddings
OUTPUT Output path
optional arguments:
-h, --help show this help message and exit
-f FORMAT, --from FORMAT
Valid choices: ['word2vec', 'finalfusion', 'fasttext',
'text', 'textdims'] Default: 'word2vec'
-t FORMAT, --to FORMAT
Valid choices: ['word2vec', 'finalfusion', 'fasttext',
'text', 'textdims'] Default: 'finalfusion'
-l, --lossy Whether to fail on malformed UTF-8. Setting this flag
replaces malformed UTF-8 with the replacement character.
Not applicable to finalfusion format.
--mmap Whether to mmap the storage. Only applicable to
finalfusion files.
Similar¶
ffp-similar
supports similarity queries:
$ ffp-similar --help
usage: ffp-similar [-h] [-f FORMAT] [-k K] [-l] [--mmap] EMBEDDINGS [input]
Similarity queries.
positional arguments:
EMBEDDINGS Input embeddings
input Optional input file with one word per line. If
unspecified reads from stdin
optional arguments:
-h, --help show this help message and exit
-f FORMAT, --format FORMAT
Valid choices: ['word2vec', 'finalfusion', 'fasttext',
'text', 'textdims'] Default: 'finalfusion'
-k K Number of neighbours. Default: 10
-l, --lossy Whether to fail on malformed UTF-8. Setting this flag
replaces malformed UTF-8 with the replacement character.
Not applicable to finalfusion format.
--mmap Whether to mmap the storage. Only applicable to
finalfusion files.
Analogy¶
ffp-analogy
answers analogy queries:
$ ffp-analogy --help
usage: ffp-analogy [-h] [-f FORMAT] [-i {a,b,c} [{a,b,c} ...]] [-k K]
EMBEDDINGS [input]
Analogy queries.
positional arguments:
EMBEDDINGS Input embeddings
input Optional input file with 3 words per line. If
unspecified reads from stdin
optional arguments:
-h, --help show this help message and exit
-f FORMAT, --format FORMAT
Valid choices: ['word2vec', 'finalfusion', 'fasttext',
'text', 'textdims'] Default: 'finalfusion'
-i {a,b,c} [{a,b,c} ...], --include {a,b,c} [{a,b,c} ...]
Specify query parts that should be allowed as answers.
Valid choices: ['a', 'b', 'c']
-k K Number of neighbours. Default: 10
-l, --lossy Whether to fail on malformed UTF-8. Setting this flag
replaces malformed UTF-8 with the replacement character.
Not applicable to finalfusion format.
--mmap Whether to mmap the storage. Only applicable to
finalfusion files.
Bucket to Explicit¶
ffp-bucket-to-explicit
converts bucket subword embeddings to explicit subword embeddings:
$ ffp-bucket-to-explicit --help
usage: ffp-bucket-to-explicit [-h] [-f FORMAT] INPUT OUTPUT
Convert bucket embeddings to explicit lookups.
positional arguments:
INPUT Input embeddings
OUTPUT Output path
optional arguments:
-h, --help show this help message and exit
-f INPUT_FORMAT, --from FORMAT
Valid choices: ['finalfusion', 'fasttext'] Default:
'finalfusion'
-l, --lossy Whether to fail on malformed UTF-8. Setting this flag
replaces malformed UTF-8 with the replacement character.
Not applicable to finalfusion format.
--mmap Whether to mmap the storage. Only applicable to
finalfusion files.
Embedding Selection¶
It’s also possible to generate an embedding file based on an input vocabulary. For subword
vocabularies, ffp-select
adds computed representations for unknown words. Subword embeddings
are converted to embeddings with a simple lookup through this script. The resulting embeddings have
an array storage.
$ ffp-select --help
usage: ffp-select [-h] [-f FORMAT] INPUT OUTPUT [WORDS]
Build embeddings from list of words.
positional arguments:
INPUT Input embeddings
OUTPUT Output path
WORDS List of words to include in the embeddings. One word
per line. Spaces permitted.Reads from stdin if
unspecified.
optional arguments:
-h, --help show this help message and exit
-f FORMAT, --format FORMAT
Valid choices: ['word2vec', 'finalfusion', 'fasttext',
'text', 'textdims'] Default: 'finalfusion'
--ignore_unk, -i Skip unrepresentable words.
--verbose, -v Print which tokens are skipped because they can't be
represented to stderr.
-l, --lossy Whether to fail on malformed UTF-8. Setting this flag
replaces malformed UTF-8 with the replacement character.
Not applicable to finalfusion format.
--mmap Whether to mmap the storage. Only applicable to
finalfusion files.