ExplicitVocab

class finalfusion.vocab.subword.ExplicitVocab(*args, **kwds)[source]

Bases: finalfusion.vocab.subword.SubwordVocab

A vocabulary with explicitly stored n-grams.

__init__(words: List[str], indexer: finalfusion.subword.explicit_indexer.ExplicitIndexer)[source]

Initialize an ExplicitVocab.

Initializes the vocabulary with the given words and ExplicitIndexer.

The word list cannot contain duplicate entries.

Parameters
  • words (List[str]) – List of unique words

  • indexer (ExplicitIndexer) – Subword indexer to use for the vocabulary.

Raises

AssertionError – If the indexer is not an ExplicitIndexer.

See also

ExplicitIndexer

property word_index

Get the index of known words

Returns

dict – index of known words

Return type

Dict[str, int]

property subword_indexer

Get this vocab’s subword Indexer.

The subword indexer produces indices for n-grams.

In case of bucket vocabularies, this is a hash-based indexer (FinalfusionHashIndexer, FastTextIndexer). For explicit subword vocabularies, this is an ExplicitIndexer.

Returns

subword_indexer – The subword indexer of the vocabulary.

Return type

ExplicitIndexer, FinalfusionHashIndexer, FastTextIndexer

property words

Get the list of known words

Returns

words – list of known words

Return type

List[str]

static chunk_identifier()finalfusion.io.ChunkIdentifier[source]

Get the ChunkIdentifier for this Chunk.

Returns

chunk_identifier

Return type

ChunkIdentifier

static read_chunk(file: BinaryIO)finalfusion.vocab.subword.ExplicitVocab[source]

Read the Chunk and return it.

The file must be positioned before the contents of the Chunk but after its header.

Parameters

file (BinaryIO) – a finalfusion file containing the given Chunk

Returns

chunk – The chunk read from the file.

Return type

Chunk

write_chunk(file)None[source]

Write the Chunk to a file.

Parameters

file (BinaryIO) – Output file for the Chunk

idx(item: str, default: Optional[Union[int, List[int]]] = None) → Optional[Union[int, List[int]]]

Lookup the given query item.

This lookup does not raise an exception if the vocab can’t produce indices.

Parameters
  • item (str) – The query item.

  • default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.

Returns

index

  • An integer if there is a single index for a known item.

  • A list if the vocab can provide subword indices for a unknown item.

  • The provided default item if the vocab can’t provide indices.

Return type

Optional[Union[int, List[int]]]

property max_n

Get the upper bound of the range of extracted n-grams.

Returns

max_n – upper bound of n-gram range.

Return type

int

property min_n

Get the lower bound of the range of extracted n-grams.

Returns

min_n – lower bound of n-gram range.

Return type

int

subword_indices(item: str, bracket: bool = True, with_ngrams: bool = False) → List[Union[int, Tuple[str, int]]]

Get the subword indices for the given item.

This list does not contain the index for known items.

Parameters
  • item (str) – The query item.

  • bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.

  • with_ngrams (bool) – Toggles returning ngrams together with indices.

Returns

indices – The list of subword indices.

Return type

List[Union[int, Tuple[str, int]]]

subwords(item: str, bracket: bool = True) → List[str]

Get the n-grams of the given item as a list.

The n-gram range is determined by the min_n and max_n values.

Parameters
  • item (str) – The query item to extract n-grams from.

  • bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.

Returns

ngrams – List of n-grams.

Return type

List[str]

property upper_bound

The exclusive upper bound of indices in this vocabulary.

Returns

upper_bound – Exclusive upper bound of indices covered by the vocabulary.

Return type

int

write(file: Union[str, bytes, int, os.PathLike])

Write the Chunk as a standalone finalfusion file.

Parameters

file (Union[str, bytes, int, PathLike]) – Output path

Raises

TypeError – If the Chunk is a Header.

finalfusion.vocab.subword.load_explicit_vocab(file: Union[str, bytes, int, os.PathLike])finalfusion.vocab.subword.ExplicitVocab[source]

Load a ExplicitVocab from the given finalfusion file.

Parameters

file (str, bytes, int, PathLike) – Path to file containing a ExplicitVocab chunk.

Returns

vocab – Returns the first ExplicitVocab in the file.

Return type

ExplicitVocab