Scripts

Installing finalfusion adds some exectuables:

  • ffp-convert for converting embeddings

  • ffp-similar for similarity queries

  • ffp-analogy for analogy queries

  • ffp-bucket-to-explicit to convert bucket subword to explicit subword embeddings

Convert

ffp-convert makes conversion between all supported embedding formats possible:

$ ffp-convert --help
usage: ffp-convert [-h] [-f FORMAT] [-t FORMAT] [-l] [--mmap] INPUT OUTPUT

Convert embeddings.

positional arguments:
  INPUT                 Input embeddings
  OUTPUT                Output path

optional arguments:
  -h, --help            show this help message and exit
  -f FORMAT, --from FORMAT
                        Valid choices: ['word2vec', 'finalfusion', 'fasttext',
                        'text', 'textdims'] Default: 'word2vec'
  -t FORMAT, --to FORMAT
                        Valid choices: ['word2vec', 'finalfusion', 'fasttext',
                        'text', 'textdims'] Default: 'finalfusion'
  -l, --lossy           Whether to fail on malformed UTF-8. Setting this flag
                        replaces malformed UTF-8 with the replacement character.
                        Not applicable to finalfusion format.
  --mmap                Whether to mmap the storage. Only applicable to
                        finalfusion files.

Similar

ffp-similar supports similarity queries:

$ ffp-similar --help
usage: ffp-similar [-h] [-f FORMAT] [-k K] [-l] [--mmap] EMBEDDINGS [input]

Similarity queries.

positional arguments:
  EMBEDDINGS            Input embeddings
  input                 Optional input file with one word per line. If
                        unspecified reads from stdin


optional arguments:
  -h, --help            show this help message and exit
  -f FORMAT, --format FORMAT
                        Valid choices: ['word2vec', 'finalfusion', 'fasttext',
                        'text', 'textdims'] Default: 'finalfusion'
  -k K                  Number of neighbours. Default: 10
  -l, --lossy           Whether to fail on malformed UTF-8. Setting this flag
                        replaces malformed UTF-8 with the replacement character.
                        Not applicable to finalfusion format.
  --mmap                Whether to mmap the storage. Only applicable to
                        finalfusion files.

Analogy

ffp-analogy answers analogy queries:

$ ffp-analogy --help
usage: ffp-analogy [-h] [-f FORMAT] [-i {a,b,c} [{a,b,c} ...]] [-k K]
                   EMBEDDINGS [input]

Analogy queries.

positional arguments:
  EMBEDDINGS            Input embeddings
  input                 Optional input file with 3 words per line. If
                        unspecified reads from stdin

optional arguments:
  -h, --help            show this help message and exit
  -f FORMAT, --format FORMAT
                        Valid choices: ['word2vec', 'finalfusion', 'fasttext',
                        'text', 'textdims'] Default: 'finalfusion'
  -i {a,b,c} [{a,b,c} ...], --include {a,b,c} [{a,b,c} ...]
                        Specify query parts that should be allowed as answers.
                        Valid choices: ['a', 'b', 'c']
  -k K                  Number of neighbours. Default: 10
  -l, --lossy           Whether to fail on malformed UTF-8. Setting this flag
                        replaces malformed UTF-8 with the replacement character.
                        Not applicable to finalfusion format.
  --mmap                Whether to mmap the storage. Only applicable to
                        finalfusion files.

Bucket to Explicit

ffp-bucket-to-explicit converts bucket subword embeddings to explicit subword embeddings:

$ ffp-bucket-to-explicit --help
usage: ffp-bucket-to-explicit [-h] [-f FORMAT] INPUT OUTPUT

Convert bucket embeddings to explicit lookups.

positional arguments:
  INPUT                 Input embeddings
  OUTPUT                Output path

optional arguments:
  -h, --help            show this help message and exit
  -f INPUT_FORMAT, --from FORMAT
                        Valid choices: ['finalfusion', 'fasttext'] Default:
                        'finalfusion'
  -l, --lossy           Whether to fail on malformed UTF-8. Setting this flag
                        replaces malformed UTF-8 with the replacement character.
                        Not applicable to finalfusion format.
  --mmap                Whether to mmap the storage. Only applicable to
                        finalfusion files.

Embedding Selection

It’s also possible to generate an embedding file based on an input vocabulary. For subword vocabularies, ffp-select adds computed representations for unknown words. Subword embeddings are converted to embeddings with a simple lookup through this script. The resulting embeddings have an array storage.

$ ffp-select --help

usage: ffp-select [-h] [-f FORMAT] INPUT OUTPUT [WORDS]

Build embeddings from list of words.

positional arguments:
  INPUT                 Input embeddings
  OUTPUT                Output path
  WORDS                 List of words to include in the embeddings. One word
                        per line. Spaces permitted.Reads from stdin if
                        unspecified.

optional arguments:
  -h, --help            show this help message and exit
  -f FORMAT, --format FORMAT
                        Valid choices: ['word2vec', 'finalfusion', 'fasttext',
                        'text', 'textdims'] Default: 'finalfusion'
  --ignore_unk, -i      Skip unrepresentable words.
  --verbose, -v         Print which tokens are skipped because they can't be
                        represented to stderr.
  -l, --lossy           Whether to fail on malformed UTF-8. Setting this flag
                        replaces malformed UTF-8 with the replacement character.
                        Not applicable to finalfusion format.
  --mmap                Whether to mmap the storage. Only applicable to
                        finalfusion files.