FastTextIndexer

class finalfusion.subword.hash_indexers.FastTextIndexer(n_buckets=2000000, min_n=3, max_n=6)

File: src/finalfusion/subword/hash_indexers.pyx (starting at line 155)

FastTextIndexer

FastTextIndexer is a hash-based subword indexer. It hashes n-grams with (a slightly faulty) FNV-1a variant and maps the hash to a predetermined bucket space.

N-grams can be indexed directly through the __call__ method or all n-grams in a string can be indexed in bulk through the subword_indices method.

max_n

The upper bound of the n-gram range.

Returns

max_n – Upper bound of n-gram range

Return type

int

min_n

The lower bound of the n-gram range.

Returns

min_n – Lower bound of n-gram range

Return type

int

n_buckets

Get the number of buckets.

Returns

n_buckets – Number of buckets

Return type

int

subword_indices(self, unicode word, uint64_t offset=0, bracket=True, with_ngrams=False)

File: src/finalfusion/subword/hash_indexers.pyx (starting at line 219)

Get the subword indices for a word.

Parameters
  • word (str) – The string to extract n-grams from

  • offset (int) – The offset to add to the index, e.g. the length of the word-vocabulary.

  • bracket (bool) – Toggles bracketing the input string with < and >

  • with_ngrams (bool) – Toggles returning tuples of (ngram, idx)

Returns

indices – List of n-gram indices, optionally as (str, int) tuples.

Return type

list

Raises

TypeError – If word is None.