FastText¶

Fasttext IO compat module.

finalfusion.compat.fasttext.load_fasttext(file: Union[str, bytes, int, os.PathLike]) → finalfusion.embeddings.Embeddings [source]¶

Read embeddings from a file in fastText format.

The returned embeddings have a FastTextVocab, NdArray storage and a Norms chunk.

Loading embeddings with this method will precompute embeddings for each word by averaging all of its subword embeddings together with the distinct word vector. Additionally, all precomputed vectors are l2-normalized and the corresponding norms are stored in the Norms. The subword embeddings are not l2-normalized.

Parameters: file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
Returns: embeddings – The embeddings from the input file.
Return type: Embeddings

finalfusion.compat.fasttext.write_fasttext(file: Union[str, bytes, int, os.PathLike], embeds: finalfusion.embeddings.Embeddings)[source]¶

Write embeddings in fastText format.

fastText requires Metadata with all expected keys for fastText configs:

dims: int (inferred from model)
window_size: int (default -1)
min_count: int (default -1)
ns: int (default -1)
word_ngrams: int (default 1)
loss: one of ['HierarchicalSoftmax', 'NegativeSampling', 'Softmax'] (default Softmax)
model: one of ['CBOW', 'SkipGram', 'Supervised'] (default SkipGram)
buckets: int (inferred from model)
min_n: int (inferred from model)
max_n: int (inferred from model)
lr_update_rate: int (default -1)
sampling_threshold: float (default -1)

dims, buckets, min_n and max_n are inferred from the model. If other values are unspecified, a default value of -1 is used for all numerical fields. Loss defaults to Softmax, model to SkipGram. Unknown values for loss and model are overwritten with defaults since the models are incompatible with fastText otherwise.

Some information from original fastText models gets lost e.g.:

word frequencies
n_tokens

Embeddings are un-normalized before serialization: if norms are present, each embedding is scaled by the associated norm. Additionally, the original state of the embedding matrix is restored, precomputation and l2-normalization of word embeddings is undone.

Only embeddings with a FastTextVocab or SimpleVocab can be serialized to this format.

Parameters

file (str, bytes, int, PathLike) – Output file
embeds (Embeddings) – Embeddings to write