bit::LM Class Reference

Class for storing an n-gram language model in a bit-packed trie structure. More...

#include <LM.hh>

List of all members.

Public Types

typedef Trie< CompressedArrayTrie
 Internal Trie type.
typedef Trie::Iterator Iterator
 Iterator type.
typedef SymbolMap< std::string,
int > 
SymbolMap
 Type for symbol map.

Public Member Functions

 LM ()
 Default constructor.
void reset ()
 Reset the model to initial state.
unsigned int order () const
 Order of the model.
u64 size () const
 The number of bytes required to store all bit-buffers.
const FloatArrayscore_array (unsigned int level) const
 Access to score arrays.
const FloatArraybackoff_array (unsigned int level) const
 Access to backoff arrays.
const CompressedArraysymbol_array (unsigned int level) const
 Access to symbol arrays of the trie.
const CompressedArraypointer_array (unsigned int level) const
 Access to pointer arrays of the trie.
const CompressedArraychild_limit_array (unsigned int level) const
 Access to child limit arrays of the trie.
void read_arpa (FILE *file, const std::string &sentence_start_str="<s>", const std::string &sentence_end_str="</s>", bool verbose=false)
 Read language model from file in ARPA format.
void write_arpa (FILE *file) const
 Write the model in ARPA format.
void write (FILE *file) const
 Write the model in binary format.
void read (FILE *file)
 Read the model from file stored in binary format.
void linear_quantization (unsigned int bits)
 Quantize all floats linearly.
void compress_trie (unsigned int level)
 Compress the arrays of the trie on the given level.
void compress_trie ()
 Compress all levels of the trie.
void uncompress_trie (unsigned int level)
 Uncompress one level of the trie.
void uncompress_trie ()
 Uncompress the trie.
void separate_leafs (unsigned int level)
 Separate leafs of the trie on a given level, and modify the backoff array accordingly.
void unseparate_leafs (unsigned int level)
 Unseparate leafs (and remove possible compression) of the trie on a given level, and modify the backoff array accordingly.
void insert_ngram (const std::vector< int > &ngram, float score, float backoff)
 Insert a new ngram to the model.
void insert_ngram (const std::string &str, float score, float backoff)
 Insert a new ngram to the model allowing to insert new symbols.
void set_start_symbol (const std::string &str)
 Set sentence start symbol and add the string in symbol mapping if not there already.
void set_end_symbol (const std::string &str)
 Set sentence end symbol and add the string in symbol mapping if not there already.
int start_symbol () const
 The symbol starting the sentence.
int end_symbol () const
 The symbol ending the sentence.
const SymbolMapsymbol_map () const
 The mapping between symbols and strings.
template<class T>
std::string ngram_str (const std::vector< T > &vec) const
 Printable string of a ngram.
Iterator root () const
 Trie iterator pointing to the root.
float backoff (const Iterator &it) const
 Backoff weight at the iterator position.
float backoff (unsigned int level, u64 index) const
 Backoff weight from a given level.
float score (const Iterator &it) const
 Probability score at the iterator position.
float walk (Iterator &it, int symbol) const
 Walk iterator to given symbol backoffing if necessary.

Private Member Functions

int compare_ngrams (const std::vector< int > &a, const std::vector< int > &b)
 Compare two ngrams.

Private Attributes

SymbolMap m_symbol_map
 Mapping between model symbols strings and symbols.
int m_start_symbol
 Symbol corresponding to the sentence start symbol.
int m_end_symbol
 Symbol corresponding to the sentence end symbol.
Trie m_trie
 The internal Trie structure.
std::vector< FloatArraym_backoff_arrays
 Arrays containing backoff weights for each n-gram level.
std::vector< FloatArraym_score_arrays
 Arrays containing probability scores for each n-gram level.
std::vector< int > m_previous_ngram
 Previous ngram inserted in the model.


Detailed Description

Class for storing an n-gram language model in a bit-packed trie structure.

Note that the backoff weight must be zero for ngrams that do not have children.

Bug:
score_arrays waste half of the linear quantization because FloatArray uses sign bit even if all scores are negative.


Member Typedef Documentation

typedef Trie::Iterator bit::LM::Iterator
 

Iterator type.

typedef SymbolMap<std::string, int> bit::LM::SymbolMap
 

Type for symbol map.

typedef Trie<CompressedArray> bit::LM::Trie
 

Internal Trie type.


Constructor & Destructor Documentation

bit::LM::LM  )  [inline]
 

Default constructor.


Member Function Documentation

float bit::LM::backoff unsigned int  level,
u64  index
const [inline]
 

Backoff weight from a given level.

Parameters:
level = level to access
index = index on the level
Returns:
the backoff value
Exceptions:
bit::invalid_argument if level exceeds levels
bit::invalid_argument if index exceeds number of elements

float bit::LM::backoff const Iterator it  )  const [inline]
 

Backoff weight at the iterator position.

Exceptions:
bit::invalid_call if called at root

const FloatArray& bit::LM::backoff_array unsigned int  level  )  const [inline]
 

Access to backoff arrays.

const CompressedArray& bit::LM::child_limit_array unsigned int  level  )  const [inline]
 

Access to child limit arrays of the trie.

int bit::LM::compare_ngrams const std::vector< int > &  a,
const std::vector< int > &  b
[inline, private]
 

Compare two ngrams.

Returns:
-1 if a comes before b, 0 if equal, 1 if a comes after b

void bit::LM::compress_trie  )  [inline]
 

Compress all levels of the trie.

void bit::LM::compress_trie unsigned int  level  )  [inline]
 

Compress the arrays of the trie on the given level.

Parameters:
level = the level to compress

int bit::LM::end_symbol  )  const [inline]
 

The symbol ending the sentence.

void bit::LM::insert_ngram const std::string &  str,
float  score,
float  backoff
 

Insert a new ngram to the model allowing to insert new symbols.

Parameters:
str = white-space separated list of symbols
score = score (usually log-probability) of the ngram
backoff = the backoff weight of the ngram
Exceptions:
bit::invalid_argument if ngrams are not inserted in sorted order or if ngram is empty

void bit::LM::insert_ngram const std::vector< int > &  ngram,
float  score,
float  backoff
 

Insert a new ngram to the model.

Parameters:
ngram = vector of ngram symbols
score = score (usually log-probability) of the ngram
backoff = the backoff weight of the ngram
Exceptions:
bit::invalid_argument if ngrams are not inserted in sorted order or if ngram is empty

void bit::LM::linear_quantization unsigned int  bits  ) 
 

Quantize all floats linearly.

Does nothing on levels that are already quantized.

Parameters:
bits = number of bits per float (remember that sign requires a bit)
Exceptions:
bit::invalid_argument if bits < 2 or bits > 32.

template<class T>
std::string bit::LM::ngram_str const std::vector< T > &  vec  )  const [inline]
 

Printable string of a ngram.

Parameters:
vec = ngram to convert
Returns:
string representing the ngram

unsigned int bit::LM::order  )  const [inline]
 

Order of the model.

const CompressedArray& bit::LM::pointer_array unsigned int  level  )  const [inline]
 

Access to pointer arrays of the trie.

void bit::LM::read FILE *  file  ) 
 

Read the model from file stored in binary format.

Parameters:
FILE = file stream to write to
Exceptions:
bit::io_error on error

void bit::LM::read_arpa FILE *  file,
const std::string &  sentence_start_str = "<s>",
const std::string &  sentence_end_str = "</s>",
bool  verbose = false
 

Read language model from file in ARPA format.

Parameters:
file = file stream to read from
sentence_start_str = the label of sentence start symbol
sentence_end_str = the label of sentence end symbol

void bit::LM::reset  )  [inline]
 

Reset the model to initial state.

Iterator bit::LM::root  )  const [inline]
 

Trie iterator pointing to the root.

float bit::LM::score const Iterator it  )  const [inline]
 

Probability score at the iterator position.

const FloatArray& bit::LM::score_array unsigned int  level  )  const [inline]
 

Access to score arrays.

void bit::LM::separate_leafs unsigned int  level  ) 
 

Separate leafs of the trie on a given level, and modify the backoff array accordingly.

Exceptions:
bit::invalid_call if level is separated already
bit::out_of_range if level is greater or equal to highest level

void bit::LM::set_end_symbol const std::string &  str  )  [inline]
 

Set sentence end symbol and add the string in symbol mapping if not there already.

Exceptions:
bit::invalid_call if set already

void bit::LM::set_start_symbol const std::string &  str  )  [inline]
 

Set sentence start symbol and add the string in symbol mapping if not there already.

Exceptions:
bit::invalid_call if set already

u64 bit::LM::size  )  const [inline]
 

The number of bytes required to store all bit-buffers.

int bit::LM::start_symbol  )  const [inline]
 

The symbol starting the sentence.

const CompressedArray& bit::LM::symbol_array unsigned int  level  )  const [inline]
 

Access to symbol arrays of the trie.

const SymbolMap& bit::LM::symbol_map  )  const [inline]
 

The mapping between symbols and strings.

void bit::LM::uncompress_trie  )  [inline]
 

Uncompress the trie.

void bit::LM::uncompress_trie unsigned int  level  )  [inline]
 

Uncompress one level of the trie.

Parameters:
level = the level to uncompress

void bit::LM::unseparate_leafs unsigned int  level  ) 
 

Unseparate leafs (and remove possible compression) of the trie on a given level, and modify the backoff array accordingly.

Exceptions:
bit::invalid_call if level is not separated
bit::out_of_range if level is greater or equal to highest level

float bit::LM::walk Iterator it,
int  symbol
const [inline]
 

Walk iterator to given symbol backoffing if necessary.

It is also ensured that the resulting iterator has children, backoffing if necessary.

Parameters:
it = iterator to walk
symbol = symbol to find
Returns:
the score of the walked path

void bit::LM::write FILE *  file  )  const
 

Write the model in binary format.

Parameters:
FILE = file stream to write to
Exceptions:
bit::io_error on error

void bit::LM::write_arpa FILE *  file  )  const
 

Write the model in ARPA format.

Parameters:
FILE = file stream to write to


Member Data Documentation

std::vector<FloatArray> bit::LM::m_backoff_arrays [private]
 

Arrays containing backoff weights for each n-gram level.

int bit::LM::m_end_symbol [private]
 

Symbol corresponding to the sentence end symbol.

std::vector<int> bit::LM::m_previous_ngram [private]
 

Previous ngram inserted in the model.

std::vector<FloatArray> bit::LM::m_score_arrays [private]
 

Arrays containing probability scores for each n-gram level.

int bit::LM::m_start_symbol [private]
 

Symbol corresponding to the sentence start symbol.

SymbolMap bit::LM::m_symbol_map [private]
 

Mapping between model symbols strings and symbols.

Trie bit::LM::m_trie [private]
 

The internal Trie structure.


The documentation for this class was generated from the following files:
Generated on Mon Jan 8 15:51:04 2007 for bit by  doxygen 1.4.6