Concordia
SentenceTokenizer Class Reference

#include <sentence_tokenizer.hpp>

Public Member Functions

 SentenceTokenizer (boost::shared_ptr< ConcordiaConfig > config) throw (ConcordiaException)
 
virtual ~SentenceTokenizer ()
 
TokenizedSentence tokenize (const std::string &sentence, bool byWhitespace=false)
 

Detailed Description

Class for tokenizing sentence before generating hash. Tokenizer ignores unnecessary symbols, html tags and possibly stop words (if the option is enabled) in sentences added to index as well as annotates named entities. All these have to be listed in files (see Concordia configuration).

Constructor & Destructor Documentation

SentenceTokenizer::SentenceTokenizer ( boost::shared_ptr< ConcordiaConfig config)
throw (ConcordiaException
)
explicit

Constructor.

Parameters
configconfig object, holding paths to necessary files
SentenceTokenizer::~SentenceTokenizer ( )
virtual

Destructor.

Member Function Documentation

TokenizedSentence SentenceTokenizer::tokenize ( const std::string &  sentence,
bool  byWhitespace = false 
)

Tokenizes the sentence.

Parameters
sentenceinput sentence
byWhitespacewhether to tokenize the sentence by whitespace
Returns
tokenized sentence object build on the input sentence

Here is the call graph for this function:


The documentation for this class was generated from the following files: