Word2Vec¶

Word2vec binary format.

finalfusion.compat.word2vec.load_word2vec(file: Union[str, bytes, int, os.PathLike], lossy: bool = False) → finalfusion.embeddings.Embeddings [source]¶

Read embeddings in word2vec binary format.

The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.

Files are expected to start with a line containing rows and cols in utf-8. Words are encoded in utf-8 followed by a single whitespace. After the whitespace, the embedding components are expected as little-endian single-precision floats.

Parameters

file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings

finalfusion.compat.word2vec.write_word2vec(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings)[source]¶

Write embeddings in word2vec binary format.

If the embeddings are not compatible with the w2v format (e.g. include a SubwordVocab), only the known words and embeddings are serialized. I.e. the subword matrix is discarded.

Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.

The output file will contain the shape encoded in utf-8 on the first line as rows columns. This is followed by the embeddings.

Each embedding consists of:

utf-8 encoded word
single space ' ' following the word
cols single-precision floating point numbers
'\n' newline at the end of each line.

Parameters

file (str, bytes, int, PathLike) – Output file
embeddings (Embeddings) – The embeddings to serialize.