Skip to main content
OCLC Support

WorldCat Discovery release notes, Thai language search and sort

 

Release Date: August 2024

Introduction

The following release notes are for Thai language searching and sorting support in WorldCat Discovery, completed August 2024.

WorldCat Discovery now includes the following enhancements to searching and sorting for the Standard Thai language that a native speaker expects:

Search

  • We tokenize phrases to identify individual Thai words when building and parsing queries for word indexes.
  • We maintain a list of common Thai words that we combine with adjacent words when we build word indexes and parse word index queries.

Sort

  • We apply Unicode collation to sort Thai script along with characters from all other writing systems

These Thai language searching and sorting improvements complement WorldCat Discovery’s Thai language user interface.

Standard Thai search and sort features 

Search

Your library’s users searching for words or phrases in Standard Thai now get search results that meet the expectations of native Thai speakers. This is achieved through:

Normalization

We do not apply normalization to any Thai indexes. We always treat all Thai vowel symbols and diacritics as significant and never ignore them.

Exception: Characters with tone marks: composed/decomposed

We index Thai characters that have tone marks in both composed and decomposed forms.

Example:

Thai composed-decomposed.png

When entered in composed form in a search term, the character will match both composed and decomposed forms in the index and vice versa: if searched in decomposed form it will match both the decomposed and composed form.

Tokenization

Because Thai script phrases are written without spaces between words, to recognize and index individual Thai words, we apply tokenization to build all Thai word indexes and to parse word index queries. We do not apply tokenization for phrase indexes or phrase index queries.

Indexing of individual Thai words enables word index searching whereby records containing the query terms anywhere in the appropriate indexed fields are retrieved.

Example

 

Thai

English translation

Query

ti:สนทนาภาษาจีน

Chinese conversation

Tokenized query

·        สนทนา

·        ภาษา

·        จีน

 

·        talk/converse

·        language

·        China

Matching record title

สนทนา 3 ภาษา ไทย-อังกฤษ-จีน โต้ตอบอย่างมั่นใจ พิชิตงานบริการในโรงแรม

Conversation in three languages: Thai-English-Chinese. Respond confidently and conquer service jobs in hotels.

Tokenized record title

สนทนา  3 ภาษา ไทย อังกฤษ จีน โต้ตอบ อย่าง มั่นใจ พิชิต งาน บริการ บริการ_ใน ใน ใน_โรงแรม โรงแรม

Talk/Converse 3 language Thai England China respond at/manner confident conquer work service service_in in in_hotel hotel

Common words

Referring to the list of common Standard Thai words below, rather than treating them as stop words whereby we would ignore them for indexing and matching, we combine them with adjacent words when we build word indexes and parse word index queries.

When a common Thai word is ignored (treated as a stop word), an adjacent word that remains can have a different meaning than when combined with/adjacent to a common word. This different meaning can lead to the retrieval of irrelevant records. In cases where the meaning of a word would have changed had we removed the adjacent common word, combining it with the adjacent common word helps to disambiguate its meaning, providing greater search precision by reducing retrieval of irrelevant records.

We apply the above processing of common words when building and searching the following indexes:

  • se: Series
  • ti: Title
  • kw: Keyword

Example

Common word treated as a stop word

  • Query ti:มาตราการ (measures/procedure)
  • Tokenized into มาตรา and การ
  • The common word การ is removed as a stop word leaving only มาตรา
  • มาตรา has a different meaning (section or clause of law) from มาตราการ (measures/procedure) and therefore retrieves records that are not relevant to the query ti:มาตราการ

Common word combined with an adjacent word

  • Query ti:มาตราการ (measures/procedure)
  • Tokenized into มาตรา_การ (because การ is defined as a common word)
  • Records with title fields containing มาตราการ
  • Titles are tokenized into มาตรา มาตรา_การ การ
  • Only records containing มาตราการ are retrieved.

Thai common word list

กว่า

กับ

การ

ก็

ขณะ

ของ

ความ

คือ

จะ

จึง

ซึ่ง

ด้วย

ตั้งแต่

ต่างๆ

ถึง

ถ้า

ทั้ง

ทั้งนี้

ที่

นั้น

นี้

ว่า

หรือ

หาก

อะไร

อาจ

อีก

เช่น

เนื่องจาก

เป็นการ

เพื่อ

เมื่อ

เลย

เอง

แต่

และ

แล้ว

โดย

ใน

ไว้

Sort

We sort Standard Thai author, title, and call number fields using the default collation order of the Unicode collation algorithm that we apply for all scripts and languages.

Alphabetical sorting is available in WorldCat Discovery when using the following features:

Sort search results:

  • Author (A-Z)
  • Title (A-Z)

The Author search filter expanded to show more:

  • The Author search filter initially displays authors sorted by matching record count, highest first. Selecting the Show More option to expand the filter sorts the authors alphabetically.
  • If the expanded and alphabetically sorted view includes author names in multiple scripts, names in Latin script are presented first followed by those in other scripts.

Browse the Shelf from the item details page:

  • Browse the Shelf uses sorting of call numbers. Call number sorting commonly differentiates items with the same call number using an alphabetical suffix. Thus, QV772 ร451 would sort before QV772 ล148ย because ร sorts before ล.

Important links

Product website 

More product information can be found here.

Support website(s) 

Support information for this product and related products can be found at:

If you have additional questions, please contact OCLC Customer Service by calling 1-800-848-5800 or 1-614-793-8682 Monday – Friday 8 a.m. – 7 p.m. ET, or email support@oclc.org. For support enquiries in the UK and Ireland, please contact the Support Desk by calling +44-(0)114-281 60 42 or emailing support-uk@oclc.org. Support is available between the hours of 09:00 and 17:30 (UK Time).

Include Request ID with problem reports

When reporting an issue with WorldCat Discovery, it is extremely helpful to include the Request ID. The Request ID is found at the bottom of the screen on which the issue occurred. Including this information allows us to directly trace what happened on the request we are troubleshooting.

Request ID screen.jpg