Concordia
TokenizedSentence Class Reference

#include <tokenized_sentence.hpp>

Public Member Functions

 TokenizedSentence (std::string sentence)
 
virtual ~TokenizedSentence ()
 
std::string getSentence () const
 
std::string getOriginalSentence () const
 
std::string getTokenizedSentence () const
 
std::list< TokenAnnotationgetAnnotations () const
 
std::vector< INDEX_CHARACTER_TYPE > getCodes () const
 
std::vector< TokenAnnotationgetTokens () const
 
void generateHash (boost::shared_ptr< WordMap > wordMap)
 
void generateTokens ()
 
void toLowerCase ()
 
void addAnnotations (std::vector< TokenAnnotation > annotations)
 

Detailed Description

A sentence after tokenizing operations. The class holds the current string represenation of the sentence along with the annotations list. The class also allows for generating hash. After that operation the class also holds the list of hashed codes and corresponding tokens.

Constructor & Destructor Documentation

TokenizedSentence::TokenizedSentence ( std::string  sentence)
explicit

Constructor.

TokenizedSentence::~TokenizedSentence ( )
virtual

Destructor.

Member Function Documentation

void TokenizedSentence::addAnnotations ( std::vector< TokenAnnotation annotations)

Add new annotations to the existing annotations list. Assumptions:

  1. existing _tokenAnnotations vector contains disjoint, sorted intervals;
  2. the annotations to be added list also has the above properties. The below algorithm will only add the annotations that do not intersect with any of the existing ones.
Parameters
annotationslist of annotations to be added

Here is the caller graph for this function:

void TokenizedSentence::generateHash ( boost::shared_ptr< WordMap wordMap)

Method for generating hash based on annotations. This method takes into account annotations of type word and named entity. These are encoded and added to code list. Annotations corresponding to these tokens are added to the tokens list.

Parameters
wordMapword map to use when encoding tokens

Here is the call graph for this function:

Here is the caller graph for this function:

void TokenizedSentence::generateTokens ( )

Method for generating tokens based on annotations. This method takes into account annotations of type word and named entity. Unlike in generateHash, these are not encoded or added to code list. Annotations corresponding to these tokens are added to the tokens list.

Here is the call graph for this function:

Here is the caller graph for this function:

std::list<TokenAnnotation> TokenizedSentence::getAnnotations ( ) const
inline

Getter for all annotations list. This method returns all annotations, including those which are not considered in the hash, i.e. stop words and html tags.

Returns
annotations list
std::vector<INDEX_CHARACTER_TYPE> TokenizedSentence::getCodes ( ) const
inline

Getter for codes list. This data is available after calling the hashGenerator method.

Returns
codes list

Here is the caller graph for this function:

std::string TokenizedSentence::getOriginalSentence ( ) const
inline

Getter for the original string sentence, which was used for extracting tokens.

Returns
originalSentence
std::string TokenizedSentence::getSentence ( ) const
inline

Getter for the string sentence, which might have been modified during tokenization.

Returns
sentence

Here is the caller graph for this function:

std::string TokenizedSentence::getTokenizedSentence ( ) const

Method for getting tokenized sentence in a string format ( tokens separated by single spaces.

Returns
tokenized sentence

Here is the call graph for this function:

std::vector<TokenAnnotation> TokenizedSentence::getTokens ( ) const
inline

Getter for tokens list. This method returns only those annotations considered in the hash, i.e. words and named entities.

Returns
tokens list

Here is the caller graph for this function:

void TokenizedSentence::toLowerCase ( )

Transform the sentence to lower case.

Here is the call graph for this function:

Here is the caller graph for this function:


The documentation for this class was generated from the following files: