Summarizer.java revision 345
0N/A * Copyright 2005 The Apache Software Foundation 0N/A * Licensed under the Apache License, Version 2.0 (the "License"); 407N/A * you may not use this file except in compliance with the License. 0N/A * You may obtain a copy of the License at 0N/A * Unless required by applicable law or agreed to in writing, software 0N/A * distributed under the License is distributed on an "AS IS" BASIS, 0N/A * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 0N/A * See the License for the specific language governing permissions and 0N/A * limitations under the License. 395N/A/** Implements hit summarization. */ 0N/A /** The number of context terms to display preceding and following matches.*/ 1138N/A /** The total number of terms to display in a summary.*/ 0N/A /** Converts text to tokens. */ * Class Excerpt represents a single passage found in the * document, with some appropriate regions highlit. * Return how many unique toks we have * How many fragments we have. * Add a frag to the list. * Return an Enum for all the fragments /** Returns a summary for the given pre-tokenized text. */ // Simplistic implementation. Finds the first fragments in the document // containing any query terms. // TODO: check that phrases in the query are matched in the fragment // Create a SortedSet that ranks excerpts according to // how many query terms are present. An excerpt is // a Vector full of Fragments and Highlights // Iterate through all terms in the document // If we find a term that's in the query... // Start searching at a point SUM_CONTEXT terms back, // and move SUM_CONTEXT terms into the future. // Iterate from the start point to the finish, adding // terms all the way. The end of the passage is always // SUM_CONTEXT beyond the last query-term. // Iterate through as long as we're before the end of // the document and we haven't hit the max-number-of-items // Now grab the hit-element, if present // We found the series of search-term hits and added // them (with intervening text) to the excerpt. Now // we need to add the trailing edge of text. // So if (j < tokens.length) then there is still trailing // text to add. (We haven't hit the end of the source doc.) // Add the words since the last hit-term insert. // Remember how many terms are in this excerpt // Store the excerpt for later sorting // Start SUM_CONTEXT places away. The next // search for relevant excerpts begins at i-SUM_CONTEXT // If the target text doesn't appear, then we just // excerpt the first SUM_LENGTH words from the document. // Now choose the best items from the excerpt set. // Stop when our Summary grows too large. // Don't add fragments if it takes us over the max-limit * Get the terms from a query and adds them to hightlite * Tests Summary-generation. User inputs the name of a * text file and a query string System.
out.
println(
"Usage: java org.apache.nutch.searcher.Summarizer <textfile> <queryStr>");
// Load the text file into a single string. // Convert the query string into a proper Query