Get WPFTS Pro today with 25% discount!

How to modify word-break behaviour of the indexing engine?


  • Sometimes it is necessary to change the behavior of the indexing engine when breaking the text into separate words (you may be more familiar with the word "tokenization").

    For example, you want to make it so that SKU numbers that contain periods, minuses, or spaces are perceived by the search engine as a whole. Let's say the article number "EBR-001-567" should be found only on the substrings "EBR" or "EBR-001", but never on the substrings "56" or "567".

    By default, the indexer tokenizer of the WPFTS treats minus as a separator, so it will place three different words "EBR", "001", "567" in the index, and even though the phrase "EBR 001 567" will still have priority in the search (since the engine gives a bonus of relevance to whole phrases), it will still be possible to find "567" or "001" separately, which is unacceptable in our case.

    In order to overcome this problem, we must change the behavior of the tokenizer so that the minus is no longer a word separator. Note that this can be solved in at least two ways: a simple one - to exclude the minus from the list of separators for the entire text and a complex one - to calculate which words are articles and turn off the breakdown only for them.

    Here's some sample code we could use to follow a simple script.

    It uses two regular expressions to split the text (they are very similar, but actually different - look carefully!)

    add_filter('wpfts_split_to_words', function($words, $text)
    {
        // The context stores useful information about current post and cluster
        global $wpfts_context;
        
        // Check if we are in the indexing stage
        if ($wpfts_context && ($wpfts_context->index_post > 0)) {
                // Ok, we are indexing now
                // Let's apply different rules for post_title and any other cluster
                if ($wpfts_context->index_token == 'post_title') {
                    // The part number can be in the title, using the rule where "minus" is NOT a divider
                    $rule = "~([\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w][\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w'\-]*[\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w]+|[\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w]+)~u";
                } else {
                    // Other parts of the document will be broken assuming "minus" is a divider
                    $rule = "~([\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w][\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w']*[\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w]+|[\x{00C0}-\x{1FFF}\x{2C00}-\x{D7FF}\w]+)~u";
                }
    
                // Finally let's make a split
                $matches = false;
                preg_match_all($rule, $text, $matches);
                if (isset($matches[1])) {
                    $words = $matches[1];
                } else {
                    $words = array();
                }
        }
      
        return $words;
    });
    

    Yes, it may look a bit complex, but actually nothing too hard to understand.

Suggested Topics

Be the first to read the news!

We are always improving our products, adding new functions and fixes. Subscribe now to be the first to get the updates and stay informed about our sales! We are not spammy. Seriously.

Join Us Now!

We are a professional IT-team. Many of us have been working in a Web IT field for more than 10 years. Our advanced experience of software development has been employed in the creation of the WordPress FullText Search plugin. All solutions implemented into the plugin have been used for 5 or more years in over 60 different web-projects.

We are looking forward to your comments, requests and suggestions in relation to the current plugin and future updates.

ewm-logo-450

The forum powered by NodeBB | Contributors