FastText

Fasttext IO compat module.

finalfusion.compat.fasttext.load_fasttext(file: Union[str, bytes, int, os.PathLike], lossy: bool = False)finalfusion.embeddings.Embeddings[source]

Read embeddings from a file in fastText format.

The returned embeddings have a FastTextVocab, NdArray storage and a Norms chunk.

Loading embeddings with this method will precompute embeddings for each word by averaging all of its subword embeddings together with the distinct word vector. Additionally, all precomputed vectors are l2-normalized and the corresponding norms are stored in the Norms. The subword embeddings are not l2-normalized.

Parameters
  • file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.

  • lossy (bool) – If set to true, malformed UTF8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings

finalfusion.compat.fasttext.write_fasttext(file: Union[str, bytes, int, os.PathLike], embeds: finalfusion.embeddings.Embeddings)[source]

Write embeddings in fastText format.

Only embeddings with fastText vocabulary can be written to fastText format.

fastText models require values for all config keys, some of these can be inferred from finalfusion models others are assigned some default values:

  • dims: inferred from model

  • window_size: 0

  • min_count: 0

  • ns: 0

  • word_ngrams: 1

  • loss: HierarchicalSoftmax

  • model: CBOW

  • buckets: inferred from model

  • min_n: inferred from model

  • max_n: inferred from model

  • lr_update_rate: 0

  • sampling_threshold: 0

Some information from original fastText models gets lost e.g.:
  • word frequencies

  • n_tokens

Embeddings are un-normalized before serialization: if norms are present, each embedding is scaled by the associated norm. Additionally, the original state of the embedding matrix is restored, precomputation and l2-normalization of word embeddings is undone.

Parameters
  • file (str, bytes, int, PathLike) – Output file

  • embeds (Embeddings) – Embeddings to write