@docholliday Hi I sent you a pre-release version of WPFTS to the chat. Please check it out. Thank you!
Excerpt Text Peculiarities
I have a problem with the excerpt text WPFTS is displaying in the search results. It seems to be selecting some, but not all, of the text surrounding the search term, almost as though some text in the paragraph did not belong there. By way of example, here is some text that WPFTS has Indexed correctly from one of our documents:
"Tuborg Brewery with red and green straw hats, so familiar a sight on the streets of Copenhagen. JEOFFRY SPENCE
THE BRISTOL AND SOUTH WALES UNION RAILWAY, John Norris, 32 pp, 5 photo illus, 2 maps, soft covers. RCHS 1985, ISBN 0-901461-38-5 £2.40 + p&p.
The rail journey between Bristol and South Wales was shortened by the Severn Bridge in 1879 and again by the Severn Tunnel in 1886, but an earlier scheme to avoid the detour via Gloucester utilised a combination of ferry and rail travel. For that purpose the Bristol and South Wales Union Railway Company was incorporated in 1857. An existing ferry had to be improved and various difficulties overcome before the new link could be formally opened on 1 January 1864."
When I searched for "ISBN 0-901461-38-5" the excerpt was "RCHS 1985, ISBN 0-901461-38–5 £2.40 + p&p.".
When I searched for "incorporated in 1857" the excerpt was "For that purpose the Bristol and South Wales Union Railway Company was incorporated in 1857."
When I searched for "via Gloucester" the excerpt was "The rail journey between Bristol and South Wales was shortened by the Severn Bridge in 1879 and again by the Severn Tunnel in 1886, but an earlier scheme to avoid the detour via Gloucester utilised a combination of ferry and rail travel."
The text in these three examples was continuous from the index, but much shorter than the 500 characters I had specified in the WPFTS settings.
However, when I searched for "BRISTOL AND SOUTH WALES UNION RAILWAY", the excerpt was "JEOFFRY SPENCE THE BRISTOL AND SOUTH WALES UNION RAILWAY, John Norris, 32 pp, 5 photo illus, 2 maps, soft covers. For that purpose the Bristol and South Wales Union Railway Company was incorporated in 1857." So here there is a whole sentence and more missing out of the middle of the excerpt.
Could you investigate, please?
According to your information, the Smart Excerpt algorithm works correctly for your specific example.
The Smart Excerpt has a goal to show as much as possible relevant information from the found post content. At the same time, it tries to avoid showing any non-relevant information, since the max length of the excerpt is limited.
First, the algorithm breaks the whole text into sentences. Then it selects the most relevant sentences from the list (those ones which have most of the search words included). It also throws out those sentences which do not contain at least one of the search words (it's your case #3 - all the middle sentences were thrown away because they didn't have any queried words).
Your first case looks wrong, but actually the phrase "RCHS 1985, ISBN 0-901461-38–5 £2.40 + p&p." is limited by periods from both ends, so the algorithm assumes it's a complete sentence.
Yes, this is the bad case, and it's relatively rare, and I don't know how to make it better.
I could add one more previous sentence or the next sentence to "extend" the excerpt up to 500 symbols, but not sure it will be a good solution in the common case. Nobody likes the dummy "water" in the text just to make the text bigger, right?
What was your idea or recommendations by the way? How the ideal excerpt should look for you?
Again thank you for sharing issues and propositions!
BRISTOL AND SOUTH WALES UNION RAILWAY
To me, the main problem is that the excerpt text is not a continuous copy of the text in the original document, because it has thrown out a sentence in the middle of the relevant paragraph. For someone reading the excerpt, this missing sentence might be vital in order to understand the context of the search term within the document.
I'm sure there will always be situations where any algorithm creates anomalies, but my current view is that the excerpt should always be a continuous copy of the original.
Where to start and end the excerpt is more tricky, but paragraph breaks might be good indicators, better still a double paragraph break (i.e. a blank line in the text). In the example above, the text above the blank line (containing "Tuborg") belongs to a completely different topic, and is irrelevant to the search term.
It might also help if the specified character limit was more fully used. We have ours set to 500 (I assume this is characters), but in some cases we are getting excerpts of well under 100 characters.
I'm assuming here that WPFTS only returns a single result for a document containing the search term, even thought the search term might appear several times in various parts of the document? How does it decide which excerpt to display, and would it be possible to add a flag in the search results to state something like "Search term appears a further x times in the document"?
I'm assuming here that WPFTS only returns a single result for a document containing the search term, even though the search term might appear several times in various parts of the document?
The Smart Excerpt algorithm will try to show all found sentences while it's possible to fit them into the excerpt length limit (e.g. 500 characters). In case it found just one sentence with the queried word(s), it will show only that sentence. It avoids adding some "dummy" sentences around in order to save screen space for other search results.
How does it decide which excerpt to display,
First, it finds the shortest sentences with all queried words inside. Then it will find the shortest sentence with the remaining words. It will add more sentences after that, in case it's possible within the length limit.
and would it be possible to add a flag in the search results to state something like "Search term appears a further x times in the document"?
It depends on the query mode you're using. Sometimes people prefer to use "OR" logic, in this case, these sentences which contain at least one word may be recognized as "good". If you are using "AND" logic, then we should recognize only sentences that contain ALL words only. However, in the real-life, the search algorithm does not make difference between sentences while searching. It's intentionally done because in complex texts you can find words close, but placed in different sentences. For example for the phrase "beautiful cats," the next text should be recognized as good:
"Article about cats. They are just beautiful.". In this case, Smart Excerpt will show both sentences. So it's just not possible to count the exact "number of appears".
What we actually can do is to place something like "(...there are more appears)" at the end of the excerpt in case we were unable to show all the "good" sentences because of length limitation. Good idea. I definitely need to implement this.
my current view is that the excerpt should always be a continuous copy of the original
It is mostly impossible to save all the text between two sentences because two "good" sentences can be far from each other (and it's often so) so it's not possible to show the whole of this construction within the boundaries of 500 characters.
Yes, sometimes it's not enough text to understand the context since sentences could be pretty short like your #1 case. But people always can click on to search result item and find the context in the original post.
paragraph breaks might be good indicators, better still a double paragraph break (i.e. a blank line in the text)
Often it's not simple to detect the end of the start of the paragraph. Because of text type. If you're using plain text, you can use either one or two linefeed characters to form paragraphs. When you're using HTML, you may use <p> tags instead or two <br> tags... etc. That's why we still stick with sentence boundaries.
One more good idea which I have is to add some delimiter characters between the sentences which are not linked in the original text, for example, if we removed some text between two "good" sentences, we may put "..." there to indicate that the part of the text was hidden. It should remove the mess.
What do you think?
Your explanation helps a lot, and I think adding a summary to your documentation would help other people too.
Originally, I had assumed that each occurrence of the search term in the document would produce a separate result with a 500 word excerpt "wrapped around" the search term. However, if I understand correctly, each document containing the search term only returns one result, and the excerpt might include a number of sentences from different parts of the document, these generally being the shortest sentences found (as you've described), up to the 500 word limit?
When these excerpt sentences are not continuous in the document, perhaps they could be numbered and placed in new paragraphs, to make clear that they are not a continuous section of text from the document?
Your idea of adding text at the end of the excerpt to signify when there are further "good" sentences, would also help. Maybe "there are "X" further appearances in other parts of this document". (You could leave the number "X" out if the software can't provide the number).
I just made a short check for Google Search results' appearance and I found that they are trying to extend the excerpt by other (not "good") sentences even there is only one "good" sentence and it's short (your case #1).
So yes, I think it would be a great idea to add one or two other sentences near the result to give more context information to the user.
At the same time, I know that some developers still want to keep excerpts as short as possible to keep more space on the screen for other results. So I think it will be a configurable option in WPFTS Settings.
Also, I would like to add a configurable option to put a text like "...other 5 appearances" at the end of the excerpt. Thanks for the idea!
We also have some other things to play with Smart Excerpts (for example, lots of developers asked me to add page numbers in case an excerpt was taken from the paged document), so I guess it will add those features one-by-one.
Also, the documentation for the plugin requires a sufficient upgrade, and I definitely will include this explanation there.