Text

Text based embedding formats.

finalfusion.compat.text.load_text_dims(file: Union[str, bytes, int, os.PathLike])finalfusion.embeddings.Embeddings[source]

Read emebddings in text-dims format.

The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.

The first line contains whitespace separated rows and cols, the rest of the file contains whitespace separated word and vector components.

Parameters

file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings

finalfusion.compat.text.load_text(file: Union[str, bytes, int, os.PathLike])finalfusion.embeddings.Embeddings[source]

Read embeddings in text format.

The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.

Expects a file with utf-8 encoded lines with:

  • word at the start of the line

  • followed by whitespace

  • followed by whitespace separated vector components

Parameters

file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.

Returns

embeddings – Embeddings from the input file. The resulting Embeddings will have a SimpleVocab, NdArray and Norms.

Return type

Embeddings

finalfusion.compat.text.write_text(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings, sep=' ')[source]

Write embeddings in text format.

Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.

The output consists of utf-8 encoded lines with:
  • word at the start of the line

  • followed by whitespace

  • followed by whitespace separated vector components

Parameters
  • file (str, bytes, int, PathLike) – Output file

  • embeddings (Embeddings) – Embeddings to write

  • sep (str) – Separator of word and embeddings.

finalfusion.compat.text.write_text_dims(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings, sep=' ')[source]

Write embeddings in text-dims format.

Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.

The output consists of utf-8 encoded lines with:
  • rows cols on the first line

  • word at the start of the line

  • followed by whitespace

  • followed by whitespace separated vector components

Parameters
  • file (str, bytes, int, PathLike) – Output file

  • embeddings (Embeddings) – Embeddings to write

  • sep (str) – Separator of word and embeddings.