FastTextVocab¶

class finalfusion.vocab.subword.FastTextVocab(*args, **kwds)[source]¶

Bases: finalfusion.vocab.subword.SubwordVocab

FastText vocabulary

__init__(words: List[str], indexer: Optional[finalfusion.subword.hash_indexers.FastTextIndexer] = None)[source]¶

Initialize a FastTextVocab.

Initializes the vocabulary with the given words.

If no indexer is passed, a FastTextIndexer with 2_000_000 buckets is used.

The word list cannot contain duplicate entries.

Parameters

words (List[str]) – List of unique words
indexer (FastTextIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2_000_000 buckets and range 3-6.

Raises

AssertionError – If the indexer is not a FastTextIndexer or words contains duplicate entries.

to_explicit() → finalfusion.vocab.subword.ExplicitVocab [source]¶

Return an ExplicitVocab built from this vocab.

This method iterates over the known words and extracts all ngrams within this vocab’s bounds. Each of the ngrams is hashed and mapped to an index. This index is not necessarily unique for each ngram, if hashes collide, multiple ngrams will be mapped to the same index.

The returned vocab will be unable to produce indices for unknown ngrams.

The indices of the new vocabs known indices will be cover [0, vocab.upper_bound)

Returns: explicit_vocab – The converted vocabulary.
Return type: ExplicitVocab

property subword_indexer¶

Get this vocab’s subword Indexer.

The subword indexer produces indices for n-grams.

In case of bucket vocabularies, this is a hash-based indexer (FinalfusionHashIndexer, FastTextIndexer). For explicit subword vocabularies, this is an ExplicitIndexer.

Returns: subword_indexer – The subword indexer of the vocabulary.
Return type: ExplicitIndexer, FinalfusionHashIndexer, FastTextIndexer

property word_index¶

Get the index of known words

Returns: dict – index of known words
Return type: Dict[str, int]

property words¶

Get the list of known words

Returns: words – list of known words
Return type: List[str]

static read_chunk(file: BinaryIO) → finalfusion.vocab.subword.FastTextVocab [source]¶

Read the Chunk and return it.

The file must be positioned before the contents of the Chunk but after its header.

Parameters: file (BinaryIO) – a finalfusion file containing the given Chunk
Returns: chunk – The chunk read from the file.
Return type: Chunk

write_chunk(file: BinaryIO)[source]¶

Write the Chunk to a file.

Parameters: file (BinaryIO) – Output file for the Chunk

static chunk_identifier()[source]¶

Get the ChunkIdentifier for this Chunk.

Returns: chunk_identifier
Return type: ChunkIdentifier

idx(item: str, default: Optional[Union[int, List[int]]] = None) → Optional[Union[int, List[int]]]¶

Lookup the given query item.

This lookup does not raise an exception if the vocab can’t produce indices.

Parameters

item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.

Returns

index –

An integer if there is a single index for a known item.
A list if the vocab can provide subword indices for a unknown item.
The provided default item if the vocab can’t provide indices.

Return type

Optional[Union[int, List[int]]]

property max_n¶

Get the upper bound of the range of extracted n-grams.

Returns: max_n – upper bound of n-gram range.
Return type: int

property min_n¶

Get the lower bound of the range of extracted n-grams.

Returns: min_n – lower bound of n-gram range.
Return type: int

subword_indices(item: str, bracket: bool = True, with_ngrams: bool = False) → List[Union[int, Tuple[str, int]]]¶

Get the subword indices for the given item.

This list does not contain the index for known items.

Parameters

item (str) – The query item.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
with_ngrams (bool) – Toggles returning ngrams together with indices.

Returns

indices – The list of subword indices.

Return type

List[Union[int, Tuple[str, int]]]

subwords(item: str, bracket: bool = True) → List[str]¶

Get the n-grams of the given item as a list.

The n-gram range is determined by the min_n and max_n values.

Parameters

item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.

Returns

ngrams – List of n-grams.

Return type

List[str]

property upper_bound¶

The exclusive upper bound of indices in this vocabulary.

Returns: upper_bound – Exclusive upper bound of indices covered by the vocabulary.
Return type: int

write(file: Union[str, bytes, int, os.PathLike])¶

Write the Chunk as a standalone finalfusion file.

Parameters: file (Union[str, bytes, int, PathLike]) – Output path
Raises: TypeError – If the Chunk is a Header.

finalfusion.vocab.subword.load_fasttext_vocab(file: Union[str, bytes, int, os.PathLike]) → finalfusion.vocab.subword.FastTextVocab [source]¶

Load a FastTextVocab from the given finalfusion file.

Parameters: file (str, bytes, int, PathLike) – Path to file containing a FastTextVocab chunk.
Returns: vocab – Returns the first FastTextVocab in the file.
Return type: FastTextVocab