Text¶
Text based embedding formats.
-
finalfusion.compat.text.
load_text
(file: Union[str, bytes, int, os.PathLike], lossy: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read embeddings in text format.
The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.
Expects a file with utf-8 encoded lines with:
word at the start of the line
followed by whitespace
followed by whitespace separated vector components
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.
- Returns
embeddings – Embeddings from the input file. The resulting Embeddings will have a SimpleVocab, NdArray and Norms.
- Return type
-
finalfusion.compat.text.
load_text_dims
(file: Union[str, bytes, int, os.PathLike], lossy: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read emebddings in text-dims format.
The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.
The first line contains whitespace separated rows and cols, the rest of the file contains whitespace separated word and vector components.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in text format with dimensions on the first line.
lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
finalfusion.compat.text.
write_text
(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings, sep=' ')[source]¶ Write embeddings in text format.
Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.
- The output consists of utf-8 encoded lines with:
word at the start of the line
followed by whitespace
followed by whitespace separated vector components
- Parameters
file (str, bytes, int, PathLike) – Output file
embeddings (Embeddings) – Embeddings to write
sep (str) – Separator of word and embeddings.
-
finalfusion.compat.text.
write_text_dims
(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings, sep=' ')[source]¶ Write embeddings in text-dims format.
Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.
- The output consists of utf-8 encoded lines with:
rows cols on the first line
word at the start of the line
followed by whitespace
followed by whitespace separated vector components
- Parameters
file (str, bytes, int, PathLike) – Output file
embeddings (Embeddings) – Embeddings to write
sep (str) – Separator of word and embeddings.