Concordia
|
This section describes a few examples of programs in C++ which make use of the Concordia library. You can run them after successful installation of Concordia (the installation process is covered in Build & installation). Their source codes are located in the project's main directory, in the subfolder "examples".
The directory also contains a simple CMakeLists.txt file, which helps to perform compilation and linking of the examples. In order to compile the examples, issue the following commands from within the examples directory:
mkdir build cd build cmake .. make
After these operations, three executables are created in the build directory: first, simple_search and concordia_search. A small config.hpp file is also generated to store the path to the examples folder.
This program only creates the Concordia object and print version of the library.
File first.cpp:
#include <concordia/concordia.hpp> #include <iostream> #include "config.hpp" using namespace std; int main() { Concordia concordia("/tmp", EXAMPLES_DIR"/../tests/resources/concordia-config/concordia.cfg"); cout << concordia.getVersion() << endl; }
This code snippet shows the basic Concordia functionality - simple substring lookup in the index.
File simple_search.cpp:
#include <concordia/concordia.hpp> #include <concordia/matched_pattern_fragment.hpp> #include <concordia/example.hpp> #include "config.hpp" #include <boost/shared_ptr.hpp> #include <vector> using namespace std; int main() { Concordia concordia("/tmp", EXAMPLES_DIR"/../tests/resources/concordia-config/concordia.cfg"); // adding sentences to index concordia.addExample(Example("Alice has a cat", 56)); concordia.addExample(Example("Alice has a dog", 23)); concordia.addExample(Example("New test product has a mistake", 321)); concordia.addExample(Example("This is just testing and it has nothing to do with the above", 14)); // generating index concordia.refreshSAfromRAM(); // searching cout << "Searching for pattern: has a" << endl; vector<MatchedPatternFragment> result = concordia.simpleSearch("has a"); // printing results for(vector<MatchedPatternFragment>::iterator it = result.begin(); it != result.end(); ++it) { cout << "Found substring in sentence: " << it->getExampleId() << " at offset: " << it->getExampleOffset() << endl; } // clearing index concordia.clearIndex(); }
First, sentences are added to the index along with their integer IDs. The pair (sentence, id) is called an Example. Note that the IDs used in the above code are not consecutive, as there is no such requirement. Sentence ID may come from other data sources, e.g. a database and is used only as sentence meta-information.
After adding the examples, index needs to be generated using the method refreshSAfromRAM. Details of this operation are covered in Concept of HDD and RAM index.
The search returns a vector of MatchedPatternFragment objects, which is then printed out. Each matched fragment represents a single match of the pattern. The pattern has to be matched within a single sentence. Information about the match consists of two integer values: ID of the sentence where the match occured and word-level, 0-based offset of the matched pattern in the sentence. The above code should return the following results:
Found substring in sentence: 56 at offset: 1 Found substring in sentence: 23 at offset: 1 Found substring in sentence: 321 at offset: 3
Match (321, 3) represents matching of the pattern "has a" in the sentence 321 ("New test product has a mistake"), starting at position 3, i.e. after the third word, which is "product".
Concordia is equipped with a unique functionality of so called Concordia search, which is best suited to use in Computer-Aided Translation systems. This operation is aimed at finding the longest matches from the index that cover the search pattern. Such match is called MatchedPatternFragment. Then, out of all matched pattern fragments, the best pattern overlay is computed. Pattern overlay is a set of matched pattern fragments which do not intersect with each other. Best pattern overlay is an overlay that matches the most of the pattern with the fewest number of fragments.
Additionally, the score for this best overlay is computed. The score is a real number between 0 and 1, where 0 indicates, that the pattern is not covered at all (i.e. not a single word from this pattern is found in the index). The score 1 represents the perfect match - pattern is covered completely by just one fragment, which means that the pattern is found in the index as one of the examples.
Moreover, the below example presents the feature of retrieving a tokenized version of the example.
File concordia_searching.cpp:
#include <concordia/concordia.hpp> #include <concordia/concordia_search_result.hpp> #include <concordia/matched_pattern_fragment.hpp> #include <concordia/example.hpp> #include <concordia/tokenized_sentence.hpp> #include "config.hpp" #include <boost/shared_ptr.hpp> #include <boost/foreach.hpp> using namespace std; int main() { Concordia concordia("/tmp", EXAMPLES_DIR"/../tests/resources/concordia-config/concordia.cfg"); TokenizedSentence ts = concordia.addExample(Example("Alice has a cat", 56)); cout << "Added the following tokens: " << endl; BOOST_FOREACH(TokenAnnotation token, ts.getTokens()) { cout << "\"" << token.getValue() << "\"" << " at positions: [" << token.getStart() << "," << token.getEnd() << ")" << endl; } concordia.addExample(Example("Alice has a dog", 23)); concordia.addExample(Example("New test product has a mistake", 321)); concordia.addExample(Example("This is just testing and it has nothing to do with the above", 14)); concordia.refreshSAfromRAM(); cout << "Searching for pattern: Our new test product has nothing to do with computers" << endl; boost::shared_ptr<ConcordiaSearchResult> result = concordia.concordiaSearch("Our new test product has nothing to do with computers"); cout << "Printing all matched fragments:" << endl; BOOST_FOREACH(MatchedPatternFragment fragment, result->getFragments()) { cout << "Matched pattern fragment found. Pattern fragment: [" << fragment.getStart() << "," << fragment.getEnd() << "]" << " in sentence " << fragment.getExampleId() << ", at offset: " << fragment.getExampleOffset() << endl; } cout << "Best overlay:" << endl; BOOST_FOREACH(MatchedPatternFragment fragment, result->getBestOverlay()) { cout << "\tPattern fragment: [" << fragment.getStart() << "," << fragment.getEnd() << "]" << " in sentence " << fragment.getExampleId() << ", at offset: " << fragment.getExampleOffset() << endl; } cout << "Best overlay score: " << result->getBestOverlayScore() << endl; // clearing index concordia.clearIndex(); }
This program should print:
Added the following tokens: "alice" at positions: [0,5) "has" at positions: [6,9) "a" at positions: [10,11) "cat" at positions: [12,15) Searching for pattern: Our new test product has nothing to do with computers Printing all matched fragments: Matched pattern fragment found. Pattern fragment: [4,9] in sentence 14, at offset: 6 Matched pattern fragment found. Pattern fragment: [1,5] in sentence 321, at offset: 0 Matched pattern fragment found. Pattern fragment: [5,9] in sentence 14, at offset: 7 Matched pattern fragment found. Pattern fragment: [2,5] in sentence 321, at offset: 1 Matched pattern fragment found. Pattern fragment: [6,9] in sentence 14, at offset: 8 Matched pattern fragment found. Pattern fragment: [3,5] in sentence 321, at offset: 2 Matched pattern fragment found. Pattern fragment: [7,9] in sentence 14, at offset: 9 Matched pattern fragment found. Pattern fragment: [8,9] in sentence 14, at offset: 10 Best overlay: Pattern fragment: [1,5] in sentence 321, at offset: 0 Pattern fragment: [5,9] in sentence 14, at offset: 7 Best overlay score: 0.53695
These results list all the longest matched pattern fragments. The longest is [4,9] (length 5, as the end index is exclusive) which corresponds to the pattern fragment "has nothing to do with", found in the sentence 14 at offset 7. However, this longest fragment was not chosen to the best overlay. The best overlay are two fragments of length 4: [1,5] "new test product has" and [5,9] "nothing to do with". Notice that if the fragment [4,9] was chosen to the overlay, it would eliminate the [1,5] fragment.
The score of such overlay is 0.53695, which can be considered as quite satisfactory to serve as an aid for a translator.
Concordia index consists of 4 data structures: hashed index, markers array, word map and suffix array. For searching to work, all of these structures must be present in RAM.
However, due to the fact that hashed index, markers array and word map are potentially large and their generation might take considerable amount of time, they are backed up on hard disk. Each operation of adding to index adds simultaneously to hashed index, markers array and word map in RAM and on HDD.
The last element of the index, the suffix array, is never backed up on disk but always dynamically generated. The generation is done by the method refreshSAfromRAM(). It is used to generate suffix array (SA) based on current hashed index, markers array and word map in RAM. Generation of SA for an index containing 2 000 000 000 sentences takes about 7 seconds on a personal computer. The reason for not backing up SA on HDD is that it needs to be freshly generated from scratch everytime the index changes. There is no way of incrementally augmenting this structure.
There is another method: loadRAMIndexFromDisk(). It loads hashed index, markers array and word map from HDD to RAM and calls refreshSAfromRAM(). The method loadRAMIndexFromDisk() is called when Concordia starts and the paths of hashed index, markers array and word map point to non-empty files on HDD (i.e. something was added to the index in previous runs of Concordia).
Concordia is configured by the means of a configuration file in the libconfig format (http://www.hyperrealm.com/libconfig/). Here is the sample configuration file, which comes with the library. Its path is <CONCORDIA_HOME>/tests/resources/concordia-config/concordia.cfg. Note that all the settings in this file are required.
Every option is documented in comments within the configuration file.
#---------------------------- # Concordia configuration file #--------------------------- # #------------------------------------------------------------------------------- # The following settings control the sentence tokenizer mechanism. Tokenizer # takes into account html tags, substitutes predefined symbols # with a single space, removes stop words (if the option is enabled), as well as # named entities and special symbols. All these have to be listed in files. # File containing all html tags (one per line) html_tags_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/html_tags.txt" # File containing all symbols to be replaced by spaces space_symbols_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/space_symbols.txt" # If set to true, words from predefined list are removed stop_words_enabled = "false" # If stop_words_enabled is true, set the path to the stop words file #stop_words_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/stop_words.txt" # File containing regular expressions that match named entities named_entities_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/named_entities.txt" # File containing special symbols (one per line) to be removed stop_symbols_path = "<CONCORDIA_HOME>/tests/resources/anonymizer/stop_symbols.txt" ### eof
After successful build of the project (see Build & installation procedure) the concordia-console program is available in the folder build/concordia-console.
The full list of program options is given below:
-h [ --help ] Display this message -c [ --config ] arg Concordia configuration file (required) -i [ --index ] arg Index directory path (required) -s [ --simple-search ] arg Pattern to be searched in the index -n [ --silent ] While searching, do not output search results -a [ --anubis-search ] arg Pattern to be searched by anubis search in the index -x [ --concordia-search ] arg Pattern to be searched by concordia search in the index -r [ --read-file ] arg File to be read and added to index -t [ --test ] arg Run performance and correctness tests on file
From <CONCORDIA_HOME> directory:
Read sentences from file sentences.txt
./build/concordia-console/concordia-console -i /tmp -c tests/resources/concordia-config/concordia.cfg -r ~/sentences.txt
Run concordia search on the index
./build/concordia-console/concordia-console -i /tmp -c tests/resources/concordia-config/concordia.cfg -x "some pattern"