Text

Text based embedding formats.

finalfusion.compat.text.load_text(file: Union[str, bytes, int, os.PathLike], lossy: bool = False)finalfusion.embeddings.Embeddings[source]

Read embeddings in text format.

The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.

Expects a file with utf-8 encoded lines with:

  • word at the start of the line

  • followed by whitespace

  • followed by whitespace separated vector components

Parameters
  • file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.

  • lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.

Returns

embeddings – Embeddings from the input file. The resulting Embeddings will have a SimpleVocab, NdArray and Norms.

Return type

Embeddings

finalfusion.compat.text.load_text_dims(file: Union[str, bytes, int, os.PathLike], lossy: bool = False)finalfusion.embeddings.Embeddings[source]

Read emebddings in text-dims format.

The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.

The first line contains whitespace separated rows and cols, the rest of the file contains whitespace separated word and vector components.

Parameters
  • file (str, bytes, int, PathLike) – Path to a file with embeddings in text format with dimensions on the first line.

  • lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings

finalfusion.compat.text.write_text(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings, sep=' ')[source]

Write embeddings in text format.

Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.

The output consists of utf-8 encoded lines with:
  • word at the start of the line

  • followed by whitespace

  • followed by whitespace separated vector components

Parameters
  • file (str, bytes, int, PathLike) – Output file

  • embeddings (Embeddings) – Embeddings to write

  • sep (str) – Separator of word and embeddings.

finalfusion.compat.text.write_text_dims(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings, sep=' ')[source]

Write embeddings in text-dims format.

Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.

The output consists of utf-8 encoded lines with:
  • rows cols on the first line

  • word at the start of the line

  • followed by whitespace

  • followed by whitespace separated vector components

Parameters
  • file (str, bytes, int, PathLike) – Output file

  • embeddings (Embeddings) – Embeddings to write

  • sep (str) – Separator of word and embeddings.