Word2Vec

Word2vec binary format.

finalfusion.compat.word2vec.load_word2vec(file: Union[str, bytes, int, os.PathLike], lossy: bool = False)finalfusion.embeddings.Embeddings[source]

Read embeddings in word2vec binary format.

The returned embeddings have a SimpleVocab, NdArray storage and a Norms chunk. The storage is l2-normalized per default and the corresponding norms are stored in the Norms.

Files are expected to start with a line containing rows and cols in utf-8. Words are encoded in utf-8 followed by a single whitespace. After the whitespace, the embedding components are expected as little-endian single-precision floats.

Parameters
  • file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.

  • lossy (bool) – If set to true, malformed UTF-8 sequences in words will be replaced with the U+FFFD REPLACEMENT character.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings

finalfusion.compat.word2vec.write_word2vec(file: Union[str, bytes, int, os.PathLike], embeddings: finalfusion.embeddings.Embeddings)[source]

Write embeddings in word2vec binary format.

If the embeddings are not compatible with the w2v format (e.g. include a SubwordVocab), only the known words and embeddings are serialized. I.e. the subword matrix is discarded.

Embeddings are un-normalized before serialization, if norms are present, each embedding is scaled by the associated norm.

The output file will contain the shape encoded in utf-8 on the first line as rows columns. This is followed by the embeddings.

Each embedding consists of:

  • utf-8 encoded word

  • single space ' ' following the word

  • cols single-precision floating point numbers

  • '\n' newline at the end of each line.

Parameters
  • file (str, bytes, int, PathLike) – Output file

  • embeddings (Embeddings) – The embeddings to serialize.