ExplicitVocab¶
-
class
finalfusion.vocab.subword.
ExplicitVocab
(*args, **kwds)[source]¶ Bases:
finalfusion.vocab.subword.SubwordVocab
A vocabulary with explicitly stored n-grams.
-
__init__
(words: List[str], indexer: finalfusion.subword.explicit_indexer.ExplicitIndexer)[source]¶ Initialize an ExplicitVocab.
Initializes the vocabulary with the given words and ExplicitIndexer.
The word list cannot contain duplicate entries.
- Parameters
words (List[str]) – List of unique words
indexer (ExplicitIndexer) – Subword indexer to use for the vocabulary.
- Raises
AssertionError – If the indexer is not an ExplicitIndexer.
See also
-
property
word_index
¶ Get the index of known words
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
static
read_chunk
(file: BinaryIO) → finalfusion.vocab.subword.ExplicitVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
write_chunk
(file) → None[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
idx
(item: str, default: Optional[Union[int, List[int]]] = None) → Optional[Union[int, List[int]]]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
An integer if there is a single index for a known item.
A list if the vocab can provide subword indices for a unknown item.
The provided default item if the vocab can’t provide indices.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
subword_indices
(item: str, bracket: bool = True, with_ngrams: bool = False) → List[Union[int, Tuple[str, int]]]¶ Get the subword indices for the given item.
This list does not contain the index for known items.
-
subwords
(item: str, bracket: bool = True) → List[str]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
property
upper_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
upper_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
-
finalfusion.vocab.subword.
load_explicit_vocab
(file: Union[str, bytes, int, os.PathLike]) → finalfusion.vocab.subword.ExplicitVocab[source]¶ Load a ExplicitVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a ExplicitVocab chunk.
- Returns
vocab – Returns the first ExplicitVocab in the file.
- Return type