Even more speedup - A patch for WB 2.8.x

Aug.2009

The text-excerpts, introduced in Website Baker 2.6.7 / 2.7, was a great improvement (at least in my opinion) — unfortunately that feature slowed down the search noticeable.

Since Website Baker 2.7, I tested occasionally many alternative procedures, but it seems to be very difficult to find a solution that matches all possible configurations and situations well. The key problem is WB's missing-SET-NAMES-issue!
See this hint on how to bypass this issue.

A simple search-situation takes place if the search string contains only ASCII characters, or when Website Baker uses iso8859-1.

A complex search-situation takes place when the missing-SET-NAMES-issue appears. That is in case Website Baker doesn't use iso8859-1.

In both cases mySQL's charset doesn't matter.

The zip-archive below contains some core-replacements with my recent improvements for the search-function.
While with WB 2.7 the search needs the same (very long) time for every search no matter what text was entered, I managed to increase the speed for simple search-situations drastically and for complex situations noticeable.

Download is available here.

Development

Some results from my local test site

Using Website Baker 2.8.0
(600 pages and about 150MB of WYSIWYG-data, up to 15 excerpts per page)

search wordsmatches (pages)time (sec)remark
Nachrichtentext129-
published?timeout (>30)-
follows [AND] under?timeout-
follows [OR] under?timeout-
überfährt?timeoutusing iso-8859-1 in Website Baker
gemäß?timeoutusing iso-8859-1
gemäß?timeoutusing utf-8, missing-SET-NAMES-issue
überfährt?timeoutusing utf-8, missing-SET-NAMES-issue
öäü?timeoutusing utf-8, missing-SET-NAMES-issue
Нихо́н?timeoutusing utf-8, missing-SET-NAMES-issue

That really sucks! Something must be broken in 2.8.0 …

Although, without text-excerpts all searches are performed in below 1-2 sec!

Step One

After fixing some issues (e.g. a very time-consuming RegEx), the results looks a lot better!

search wordsmatches (pages)time (sec)remark
Nachrichtentext11-
published1005-
follows [AND] under1005-
follows [OR] under13011-
überfährt11using iso-8859-1 in Website Baker
gemäß1007using iso-8859-1
gemäß10010using utf-8, missing-SET-NAMES-issue
überfährt117using utf-8, missing-SET-NAMES-issue
öäü318using utf-8, missing-SET-NAMES-issue
Нихо́н5010using utf-8, missing-SET-NAMES-issue

A patch is available here: patch (You have to register to see the attachment). Although, this patch is already included in the download below.

ToDo

  • Improve the “simple-situations” further.
  • Relax! It's ok to be most meticulously, but in ”complex-situations”, it's ok to miss some matches if we could speedup the search that way.

Step Two

I relaxed, i.e. I accepted that the search may miss some matches to improve the speed further.

search wordsmatches (pages)time (sec)remark
gemäß10010 5using utf-8, missing-SET-NAMES-issue
überfährt117 1using utf-8, missing-SET-NAMES-issue
öäü318 2using utf-8, missing-SET-NAMES-issue
Нихо́н50 010 1using utf-8, missing-SET-NAMES-issue

Wow, that's really fast! But look, for Нихо́н there is no match any more!
That means that this will work only for Latin characters (those with an HTML-Entity), while the search will fail completely for e.g. Cyrillic characters!

ToDo

  • Check how to keep this speed improvements while making Cyrillic chars working again.

Step Three

search wordsmatches (pages)time (sec)remark
Нихо́н5010using utf-8, missing-SET-NAMES-issue

I managed to make Cyrillic characters work again. Although, for Cyrillic chars the search is slow (as before) when the missing-SET-NAMES-issue appears. This is true for all characters without own HTML-Entities. Some East-European languages may be affected, too (Czech, Polish, Slovak, Hungarian, Estonian, Latvian, Lithuanian).


Conclusion

The fact that Website Baker doesn't make use of mySQL's SET NAMES-function is called the “missing-SET-NAMES-issue”.

Normally, a web-application must tell mySQL what client-charset it uses, so that mySQL knowns how to convert that client-charset to the actual column-charset. This is done through the SET NAMES-function.

Without this information, mySQL assumes iso-8859-1 (“latin1”) all the time.

With these changes the speed is increased drastically for simple situations, i.e. when Website Baker's default charset is iso-8859-1, or when the search string doesn't contain non-ASCII characters.

In case the missing-SET-NAMES-issue appears in combination with latin characters (áäâ…) the speed is increased noticeable. Although, the search may miss some matches (but it's highly unlikely that you notice that).

Even in case the missing-SET-NAMES-issue appears in combination with Cyrillic or East-European charsets, the search is somewhat faster as in Website Baker 2.7.

BTW: What's the best charset for Website Baker?

Short answer: utf-8, nevertheless.

Long answer:
In a perfect world Website Baker would make use of mySQL's ”SET NAMES”-function, and would create the database itself and all tables using utf-8 as default charset. Furthermore, Website Baker would use UTF-8 as default charset, too. There would be no other charsets to choose from at all.

But unfortunately, Website Baker doesn't make use of mySQL's ”SET NAMES”-function, and doesn't set a character-set while installing the database tables. To cap it all, Website Baker provides a rich set of character-sets the user can choose from. As Result mySQL doesn't know which character-set Website Baker uses, and assumes iso-8859-1 all the time.

So, isn't the immediate answer iso8859-1? — Yes, but iso8859-1 supports only a small set of characters, insufficient for most East-/South-East-European languages, so we can't suggest to use iso8859-1 in general.
Furthermore, many functions inside Website Baker uses UTF-8 internal (e.g. to create text-excerpts, filenames, to highlight search-results, for sorting). Using a different character-set means that the strings in concern has to be converted from wb's character-set to UTF-8 and vice-versa all along.

It's advised to use UTF-8 to prevent WB converting between DEFAULT_CHARSET and UTF-8 all along.

Hints

  • To measure Website Baker's time consumption just go to Settings –> Website Footer: and enter [PROCESS_TIME]. This will display a timer in the page footer.
  • If you need a really fast search (below ~1sec), set Settings –> Max lines of excerpt: to 0.

Installation

Download the zip-archive, unzip manually, and copy all files to your server.
You must copy all files from wb/search/, wb/modules/wysiwyg/search.php, and wb/modules/news/search.php.
You may copy wb/modules/guestbook/search.php and wb/modules/manual/search.php.

Please rename or backup the original files before.

This core-replacement is written for Website Baker 2.8.x, but it works with 2.7, too.

Download

This is experimental!

Remember to backup

  • all files from the search-directory
  • the file modules/wysiwyg/search.php
  • the file modules/news/search.php
  • the file modules/guestbook/search.php
  • the file modules/manual/search.php

Attention!

the file modules/guestbook/search.php is for Guestbook v2.8.01 or above, only!

Do not copy this file if using an older Guestbook module.

search_experimental_07.zip (34.49 KiB, 1y ago) (last updated 24 April 2010 — compatible with WB 2.7.x and 2.8.x)

BTW: There is a more recent patch available: ”Adding correct Character-Set handling to Website Baker”, but that patch requires changes to various core-files, too.


Archive:  search_experimental_07.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
       0  Stored        0   0% 2009-08-30 14:24 00000000  wb/
       0  Stored        0   0% 2009-08-30 17:42 00000000  wb/search/
   29056  Defl:N     7328  75% 2010-04-24 13:43 7fb453f5  wb/search/search.php
   18138  Defl:N     6141  66% 2010-04-24 13:49 fa218c2d  wb/search/search_modext.php
   64194  Defl:N    12203  81% 2009-08-30 17:42 05dabfe9  wb/search/search_convert_ul.php
    3444  Defl:N     1447  58% 2010-04-24 13:45 9267c91a  wb/search/search_convert.php
       0  Stored        0   0% 2009-08-30 15:56 00000000  wb/modules/
       0  Stored        0   0% 2009-08-30 14:24 00000000  wb/modules/wysiwyg/
    1985  Defl:N     1000  50% 2009-08-30 14:24 090b99af  wb/modules/wysiwyg/search.php
       0  Stored        0   0% 2009-08-30 15:56 00000000  wb/modules/manual/
    3574  Defl:N     1247  65% 2009-08-30 15:56 0b1b0bfe  wb/modules/manual/search.php
       0  Stored        0   0% 2009-08-30 14:24 00000000  wb/modules/news/
    5291  Defl:N     1872  65% 2009-08-30 14:24 8372a75b  wb/modules/news/search.php
       0  Stored        0   0% 2009-08-30 15:30 00000000  wb/modules/guestbook/
    3295  Defl:N     1503  54% 2009-08-30 15:30 e1e35d07  wb/modules/guestbook/search.php
--------          -------  ---                            -------
  128977            32741  75%                            15 files

Even more speedup - A patch for WB 2.8.x 0 Comments
 
projects/new_search/even_more_speedup.txt · Last modified: 2010-05-30 16:05 by Thomas "thorn" Hornik
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki