Aug.2009
The text-excerpts, introduced in Website Baker 2.6.7 / 2.7, was a great improvement (at least in my opinion) — unfortunately that feature slowed down the search noticeable.
Since Website Baker 2.7, I tested occasionally many alternative procedures, but it seems to be very difficult to find a solution that matches all possible configurations and situations well. The key problem is WB's missing-SET-NAMES-issue!
See this hint on how to bypass this issue.
A simple search-situation takes place if the search string contains only ASCII characters, or when Website Baker uses iso8859-1.
A complex search-situation takes place when the missing-SET-NAMES-issue appears. That is in case Website Baker doesn't use iso8859-1.
In both cases mySQL's charset doesn't matter.
The zip-archive below contains some core-replacements with my recent improvements for the search-function.
While with WB 2.7 the search needs the same (very long) time for every search no matter what text was entered, I managed to increase the speed for simple search-situations drastically and for complex situations noticeable.
Download is available here.
Using Website Baker 2.8.0
(600 pages and about 150MB of WYSIWYG-data, up to 15 excerpts per page)
| search words | matches (pages) | time (sec) | remark |
|---|---|---|---|
| Nachrichtentext | 1 | 29 | - |
| published | ? | timeout (>30) | - |
| follows [AND] under | ? | timeout | - |
| follows [OR] under | ? | timeout | - |
| überfährt | ? | timeout | using iso-8859-1 in Website Baker |
| gemäß | ? | timeout | using iso-8859-1 |
| gemäß | ? | timeout | using utf-8, missing-SET-NAMES-issue |
| überfährt | ? | timeout | using utf-8, missing-SET-NAMES-issue |
| öäü | ? | timeout | using utf-8, missing-SET-NAMES-issue |
| Нихо́н | ? | timeout | using utf-8, missing-SET-NAMES-issue |
That really sucks! Something must be broken in 2.8.0 …
Although, without text-excerpts all searches are performed in below 1-2 sec!
After fixing some issues (e.g. a very time-consuming RegEx), the results looks a lot better!
| search words | matches (pages) | time (sec) | remark |
|---|---|---|---|
| Nachrichtentext | 1 | 1 | - |
| published | 100 | 5 | - |
| follows [AND] under | 100 | 5 | - |
| follows [OR] under | 130 | 11 | - |
| überfährt | 1 | 1 | using iso-8859-1 in Website Baker |
| gemäß | 100 | 7 | using iso-8859-1 |
| gemäß | 100 | 10 | using utf-8, missing-SET-NAMES-issue |
| überfährt | 1 | 17 | using utf-8, missing-SET-NAMES-issue |
| öäü | 3 | 18 | using utf-8, missing-SET-NAMES-issue |
| Нихо́н | 50 | 10 | using utf-8, missing-SET-NAMES-issue |
A patch is available here: patch (You have to register to see the attachment). Although, this patch is already included in the download below.
I relaxed, i.e. I accepted that the search may miss some matches to improve the speed further.
| search words | matches (pages) | time (sec) | remark |
|---|---|---|---|
| gemäß | 100 | using utf-8, missing-SET-NAMES-issue | |
| überfährt | 1 | using utf-8, missing-SET-NAMES-issue | |
| öäü | 3 | using utf-8, missing-SET-NAMES-issue | |
| Нихо́н | using utf-8, missing-SET-NAMES-issue |
Wow, that's really fast! But look, for Нихо́н there is no match any more!
That means that this will work only for Latin characters (those with an HTML-Entity), while the search will fail completely for e.g. Cyrillic characters!
| search words | matches (pages) | time (sec) | remark |
|---|---|---|---|
| Нихо́н | 50 | 10 | using utf-8, missing-SET-NAMES-issue |
I managed to make Cyrillic characters work again. Although, for Cyrillic chars the search is slow (as before) when the missing-SET-NAMES-issue appears. This is true for all characters without own HTML-Entities. Some East-European languages may be affected, too (Czech, Polish, Slovak, Hungarian, Estonian, Latvian, Lithuanian).
The fact that Website Baker doesn't make use of mySQL's SET NAMES-function is called the “missing-SET-NAMES-issue”.
Normally, a web-application must tell mySQL what client-charset it uses, so that mySQL knowns how to convert that client-charset to the actual column-charset. This is done through the SET NAMES-function.
Without this information, mySQL assumes iso-8859-1 (“latin1”) all the time.
With these changes the speed is increased drastically for simple situations, i.e. when Website Baker's default charset is iso-8859-1, or when the search string doesn't contain non-ASCII characters.
In case the missing-SET-NAMES-issue appears in combination with latin characters (áäâ…) the speed is increased noticeable. Although, the search may miss some matches (but it's highly unlikely that you notice that).
Even in case the missing-SET-NAMES-issue appears in combination with Cyrillic or East-European charsets, the search is somewhat faster as in Website Baker 2.7.
Short answer: utf-8, nevertheless.
Long answer:
In a perfect world Website Baker would make use of mySQL's ”SET NAMES”-function, and would create the database itself and all tables using utf-8 as default charset. Furthermore, Website Baker would use UTF-8 as default charset, too. There would be no other charsets to choose from at all.
But unfortunately, Website Baker doesn't make use of mySQL's ”SET NAMES”-function, and doesn't set a character-set while installing the database tables. To cap it all, Website Baker provides a rich set of character-sets the user can choose from. As Result mySQL doesn't know which character-set Website Baker uses, and assumes iso-8859-1 all the time.
So, isn't the immediate answer iso8859-1? — Yes, but iso8859-1 supports only a small set of characters, insufficient for most East-/South-East-European languages, so we can't suggest to use iso8859-1 in general.
Furthermore, many functions inside Website Baker uses UTF-8 internal (e.g. to create text-excerpts, filenames, to highlight search-results, for sorting). Using a different character-set means that the strings in concern has to be converted from wb's character-set to UTF-8 and vice-versa all along.
It's advised to use UTF-8 to prevent WB converting between DEFAULT_CHARSET and UTF-8 all along.
Settings –> Website Footer: and enter [PROCESS_TIME]. This will display a timer in the page footer.
Settings –> Max lines of excerpt: to 0.
Download the zip-archive, unzip manually, and copy all files to your server.
You must copy all files from wb/search/, wb/modules/wysiwyg/search.php, and wb/modules/news/search.php.
You may copy wb/modules/guestbook/search.php and wb/modules/manual/search.php.
Please rename or backup the original files before.
This core-replacement is written for Website Baker 2.8.x, but it works with 2.7, too.
search-directory
modules/wysiwyg/search.php
modules/news/search.php
modules/guestbook/search.php
modules/manual/search.php
the file modules/guestbook/search.php is for Guestbook v2.8.01 or above, only!
Do not copy this file if using an older Guestbook module.
search_experimental_07.zip (34.49 KiB, 1y ago) (last updated 24 April 2010 — compatible with WB 2.7.x and 2.8.x)
BTW: There is a more recent patch available: ”Adding correct Character-Set handling to Website Baker”, but that patch requires changes to various core-files, too.
Archive: search_experimental_07.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Stored 0 0% 2009-08-30 14:24 00000000 wb/
0 Stored 0 0% 2009-08-30 17:42 00000000 wb/search/
29056 Defl:N 7328 75% 2010-04-24 13:43 7fb453f5 wb/search/search.php
18138 Defl:N 6141 66% 2010-04-24 13:49 fa218c2d wb/search/search_modext.php
64194 Defl:N 12203 81% 2009-08-30 17:42 05dabfe9 wb/search/search_convert_ul.php
3444 Defl:N 1447 58% 2010-04-24 13:45 9267c91a wb/search/search_convert.php
0 Stored 0 0% 2009-08-30 15:56 00000000 wb/modules/
0 Stored 0 0% 2009-08-30 14:24 00000000 wb/modules/wysiwyg/
1985 Defl:N 1000 50% 2009-08-30 14:24 090b99af wb/modules/wysiwyg/search.php
0 Stored 0 0% 2009-08-30 15:56 00000000 wb/modules/manual/
3574 Defl:N 1247 65% 2009-08-30 15:56 0b1b0bfe wb/modules/manual/search.php
0 Stored 0 0% 2009-08-30 14:24 00000000 wb/modules/news/
5291 Defl:N 1872 65% 2009-08-30 14:24 8372a75b wb/modules/news/search.php
0 Stored 0 0% 2009-08-30 15:30 00000000 wb/modules/guestbook/
3295 Defl:N 1503 54% 2009-08-30 15:30 e1e35d07 wb/modules/guestbook/search.php
-------- ------- --- -------
128977 32741 75% 15 files| Even more speedup - A patch for WB 2.8.x | 0 Comments |