FastText¶
Fasttext IO compat module.
-
finalfusion.compat.fasttext.
load_fasttext
(file: Union[str, bytes, int, os.PathLike], lossy: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read embeddings from a file in fastText format.
The returned embeddings have a FastTextVocab, NdArray storage and a Norms chunk.
Loading embeddings with this method will precompute embeddings for each word by averaging all of its subword embeddings together with the distinct word vector. Additionally, all precomputed vectors are l2-normalized and the corresponding norms are stored in the Norms. The subword embeddings are not l2-normalized.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
lossy (bool) – If set to true, malformed UTF8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
finalfusion.compat.fasttext.
write_fasttext
(file: Union[str, bytes, int, os.PathLike], embeds: finalfusion.embeddings.Embeddings)[source]¶ Write embeddings in fastText format.
Only embeddings with fastText vocabulary can be written to fastText format.
fastText models require values for all config keys, some of these can be inferred from finalfusion models others are assigned some default values:
dims: inferred from model
window_size: 0
min_count: 0
ns: 0
word_ngrams: 1
loss: HierarchicalSoftmax
model: CBOW
buckets: inferred from model
min_n: inferred from model
max_n: inferred from model
lr_update_rate: 0
sampling_threshold: 0
- Some information from original fastText models gets lost e.g.:
word frequencies
n_tokens
Embeddings are un-normalized before serialization: if norms are present, each embedding is scaled by the associated norm. Additionally, the original state of the embedding matrix is restored, precomputation and l2-normalization of word embeddings is undone.
- Parameters
file (str, bytes, int, PathLike) – Output file
embeds (Embeddings) – Embeddings to write