Word2Vec¶
Word2vec binary format.
-
finalfusion.compat.word2vec.
load_word2vec
(file: Union[str, bytes, int, os.PathLike], lossy: bool = False) → finalfusion.embeddings.Embeddings[source]¶ Read embeddings in word2vec binary format.
The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.
Files are expected to start with a line containing rows and cols in utf-8. Words are encoded in utf-8 followed by a single whitespace. After the whitespace, the embedding components are expected as little-endian single-precision floats.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
finalfusion.compat.word2vec.
write_word2vec
(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings)[source]¶ Write embeddings in word2vec binary format.
If the embeddings are not compatible with the w2v format (e.g. include a SubwordVocab), only the known words and embeddings are serialized. I.e. the subword matrix is discarded.
Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.
The output file will contain the shape encoded in utf-8 on the first line as rows columns. This is followed by the embeddings.
Each embedding consists of:
utf-8 encoded word
single space
' '
following the wordcols
single-precision floating point numbers'\n'
newline at the end of each line.
- Parameters
file (str, bytes, int, PathLike) – Output file
embeddings (Embeddings) – The embeddings to serialize.