How to modify word-break behaviour of the indexing engine?

EpsilonAdmin

Sometimes it is necessary to change the behavior of the indexing engine when breaking the text into separate words (you may be more familiar with the word "tokenization").

For example, you want to make it so that SKU numbers that contain periods, minuses, or spaces are perceived by the search engine as a whole. Let's say the article number "EBR-001-567" should be found only on the substrings "EBR" or "EBR-001", but never on the substrings "56" or "567".

By default, the indexer tokenizer of the WPFTS treats minus as a separator, so it will place three different words "EBR", "001", "567" in the index, and even though the phrase "EBR 001 567" will still have priority in the search (since the engine gives a bonus of relevance to whole phrases), it will still be possible to find "567" or "001" separately, which is unacceptable in our case.

In order to overcome this problem, we must change the behavior of the tokenizer so that the minus is no longer a word separator. Note that this can be solved in at least two ways: a simple one - to exclude the minus from the list of separators for the entire text and a complex one - to calculate which words are articles and turn off the breakdown only for them.

Here's some sample code we could use to follow a simple script.

It uses two regular expressions to split the text (they are very similar, but actually different - look carefully!)

add_filter('wpfts_split_to_words', function($words, $text)
{
    // The context stores useful information about current post and cluster
    global $wpfts_context;
    
    // Check if we are in the indexing stage
    if ($wpfts_context && ($wpfts_context->index_post > 0)) {
            // Ok, we are indexing now
            // Let's apply different rules for post_title and any other cluster
            if ($wpfts_context->index_token == 'post_title') {
                // The part number can be in the title, using the rule where "minus" is NOT a divider
                $rule = "~([\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w][\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w'\-]*[\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w]+|[\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w]+)~u";
            } else {
                // Other parts of the document will be broken assuming "minus" is a divider
                $rule = "~([\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w][\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w']*[\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w]+|[\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w]+)~u";
            }

            // Finally let's make a split
            $matches = false;
            preg_match_all($rule, $text, $matches);
            if (isset($matches[1])) {
                $words = $matches[1];
            } else {
                $words = array();
            }
    }
  
    return $words;
});

Yes, it may look a bit complex, but actually nothing too hard to understand.

How to modify word-break behaviour of the indexing engine?

Suggested Topics

[Solved] The files excerpt is not visible in Scientia Theme

[Solved] The License become not valid and Update API is not accessible

[Solved] The Astra theme does not show Smart Excerpt for files

[Solved] Indexing and Search files by content in BuddyDrive

[Solved] How to index new posts which was added by the script?