FinalfusionBucketVocab¶
-
class
finalfusion.vocab.subword.
FinalfusionBucketVocab
(words: List[str], indexer: Optional[finalfusion.subword.hash_indexers.FinalfusionHashIndexer] = None)[source]¶ Bases:
finalfusion.vocab.subword.SubwordVocab
Finalfusion Bucket Vocabulary.
-
__init__
(words: List[str], indexer: Optional[finalfusion.subword.hash_indexers.FinalfusionHashIndexer] = None)[source]¶ Initialize a FinalfusionBucketVocab.
Initializes the vocabulary with the given words.
If no indexer is passed, a FinalfusionHashIndexer with bucket exponent 21 is used.
The word list cannot contain duplicate entries.
- Parameters
words (List[str]) – List of unique words
indexer (FinalfusionHashIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2^21 buckets with range 3-6.
- Raises
AssertionError – If the indexer is not a FinalfusionHashIndexer or
words
contains duplicate entries.
-
to_explicit
() → finalfusion.vocab.subword.ExplicitVocab[source]¶ Return an ExplicitVocab built from this vocab.
This method iterates over the known words and extracts all ngrams within this vocab’s bounds. Each of the ngrams is hashed and mapped to an index. This index is not necessarily unique for each ngram, if hashes collide, multiple ngrams will be mapped to the same index.
The returned vocab will be unable to produce indices for unknown ngrams.
The indices of the new vocabs known indices will be cover [0, vocab.upper_bound)
- Returns
explicit_vocab – The converted vocabulary.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
property
word_index
¶ Get the index of known words
-
static
read_chunk
(file: BinaryIO) → finalfusion.vocab.subword.FinalfusionBucketVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
chunk_identifier
() → finalfusion.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
idx
(item: str, default: Union[int, List[int], None] = None) → Union[int, List[int], None]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
An integer if there is a single index for a known item.
A list if the vocab can provide subword indices for a unknown item.
The provided default item if the vocab can’t provide indices.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
subword_indices
(item: str, bracket: bool = True, with_ngrams: bool = False) → List[Union[int, Tuple[str, int]]]¶ Get the subword indices for the given item.
This list does not contain the index for known items.
-
subwords
(item: str, bracket: bool = True) → List[str]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
property
upper_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
upper_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
-
finalfusion.vocab.subword.
load_finalfusion_bucket_vocab
(file: Union[str, bytes, int, os.PathLike]) → finalfusion.vocab.subword.FinalfusionBucketVocab[source]¶ Load a FinalfusionBucketVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a FinalfusionBucketVocab chunk.
- Returns
vocab – Returns the first FinalfusionBucketVocab in the file.
- Return type