Monday, March 26, 2012

FT Index on column with multiple language doesn't work - any ideas

Hi all,
I have a table on SQL Server 2005 with a column (collation
SQL_Latin1_General_CP1_CI_AS) that can contain multiple languages. Some of
the possible languages include English, German, French, Kazakh, Arabic,
Chinese (Simplified and Traditional), Japanese, Korean, Thai, etc.
I tried querying for documents that contains 台灣 (meaning "Taiwan" in
Simplified Chinese), but it is not working as I expected.
I ran the following query:
select * from tblDocuments
where CONTAINS(DocumentTitle, '台灣', LANGUAGE 'Traditional Chinese')
With the English language breaker, I get 0 results.
With the Traditional Chinese language breaker, I get 1 result and it is the
document I was looking for.
I would've expected the LANGUAGE keyword to ignore the default language
breaker and use whatever is specified, but apparently I'm missing
something... maybe table setup or whatnot.
Any ideas would be greatly appreaciated.
Cheers,
Andry
This is correct behavior. When the document is indexed and you are
specifying the English word breaker it interprets the contents of the row to
be English words, unless you are using the xml language tags and specify
that the XML doc is in Chinese and that you are using the XML data type, or
you are using Word and marked that passage containing the character as
Chinese, or the language settings for your Word document are Chinese, or you
are using html and have set the ms.locale to Chinese and set the correct
code page. With Word Docs and HTML docs you must save them in the image or
varbinary data type columns and use a document type column with the
extension the document would have if it was stored in the file system.
If you don't follow the above the Chinese character will be interpreted as a
Unicode sequence which will not be interpreted correctly in your queries.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Andry" <Andry@.discussions.microsoft.com> wrote in message
news:569E9782-47E3-4657-AC68-7787214D3D97@.microsoft.com...
> Hi all,
> I have a table on SQL Server 2005 with a column (collation
> SQL_Latin1_General_CP1_CI_AS) that can contain multiple languages. Some of
> the possible languages include English, German, French, Kazakh, Arabic,
> Chinese (Simplified and Traditional), Japanese, Korean, Thai, etc.
> I tried querying for documents that contains ? (meaning "Taiwan" in
> Simplified Chinese), but it is not working as I expected.
> I ran the following query:
> select * from tblDocuments
> where CONTAINS(DocumentTitle, '?', LANGUAGE 'Traditional Chinese')
> With the English language breaker, I get 0 results.
> With the Traditional Chinese language breaker, I get 1 result and it is
> the
> document I was looking for.
> I would've expected the LANGUAGE keyword to ignore the default language
> breaker and use whatever is specified, but apparently I'm missing
> something... maybe table setup or whatnot.
> Any ideas would be greatly appreaciated.
> Cheers,
> Andry

No comments:

Post a Comment