#include <sentence_tokenizer.hpp>

Public Member Functions
	SentenceTokenizer (boost::shared_ptr< ConcordiaConfig > config) throw (ConcordiaException)

virtual	~SentenceTokenizer ()

TokenizedSentence	tokenize (const std::string &sentence, bool byWhitespace=false)

Detailed Description

Class for tokenizing sentence before generating hash. Tokenizer ignores unnecessary symbols, html tags and possibly stop words (if the option is enabled) in sentences added to index as well as annotates named entities. All these have to be listed in files (see Concordia configuration).

Constructor & Destructor Documentation

SentenceTokenizer::SentenceTokenizer	(	boost::shared_ptr< ConcordiaConfig >	config	)
throw	(	ConcordiaException
	)

explicit

Constructor.

Parameters

config config object, holding paths to necessary files

SentenceTokenizer::~SentenceTokenizer ( )

virtual

Destructor.

Member Function Documentation

TokenizedSentence SentenceTokenizer::tokenize	(	const std::string &	sentence,
		bool	byWhitespace = `false`
	)

Tokenizes the sentence.

Parameters

sentence	input sentence
byWhitespace	whether to tokenize the sentence by whitespace

Returns: tokenized sentence object build on the input sentence

Here is the call graph for this function:

The documentation for this class was generated from the following files:

/home/rafalj/projects/concordia/concordia/concordia/sentence_tokenizer.hpp
/home/rafalj/projects/concordia/concordia/concordia/sentence_tokenizer.cpp

Public Member Functions

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation