Monday, March 26, 2012

FT Index on [German|English] columns. SQL Server 2005

Hi,
sorry for the cryptic topic. I wasn't able to come up with anything
better.
We are currently working on a project thats aims to keep all students
thesis es in a database and enable searches on it via a ASP .NET web
app. The problem we have is that some students write their thesis in
german and some in english. Right now there are more germans texts than
english ones but i guess that will change in the near future. We've
heard it's possible to maintain a fulltext index with both languages at
once but we would have to specify with our queries which index should
be used. While this is ok for entering data, asking the user if he is
going to enter his thesis in english or german, asking someone that is
searching we find it to be troublesome or at least leading to
confusion. Most of the german thesis texts contain a fair number of
english words too (computer science kinda lives from english buzzwords
;-)). So languages might be even mixed inside a text.
Anyone recommendations on how to solve this kinda of problem? Would it
be reasonable to "merge" the englisch and german stopword list and
create a "new language"?
Thanks for your help! All hints/links appreciated.
Kind regards from Germany
Phil
Phil,
In SQL Server 2005 Full-Text Search (FTS) you can store and search on
language-specific words or phrases at query time via CONTAINS or
FREETEXT. So, you can store both languages (English and German are both
supported FTS languages) and use either English or German for the
"Language for Word Breaker" and then use the following CONTAINS
statement (with or without the formsof(inflectional) parameter):
select * from FTSTable where
contains(*,'formsof(inflectional,"english_word")', language 1033) OR
contains(*,'formsof(inflectional,"German_word")',l anguage 1031)
Additionally, on an advance search form, you may want to add a box for
keyword and the associated language, so that the searchers can select
the approprate language for the keyword. Note, the default language is
whatever language you defined for the "Language for Word Breaker".
Hope that helps!
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/
|||Such a project is doomed. German and English generate a large number of
false friends/false conjugates; wander words/worts which are common in both
languages but with different meanings. For instance Gift in English means a
present; in German its means poison.
So what will happen no matter what the language you specify your query to
occur in you will get false hits. The better way to do it is to sniff your
browser settings and check for German as the supported language and then
have the docs stored in different columns/tables/databases. Then a query
would be directed against a language segregated column/table or database.
If your queries are done using the Contains predicate the false conjugates
will be minimized.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Philipp Stader" <pstader@.gmail.com> wrote in message
news:1131651975.271992.103620@.z14g2000cwz.googlegr oups.com...
> Hi,
> sorry for the cryptic topic. I wasn't able to come up with anything
> better.
> We are currently working on a project thats aims to keep all students
> thesis es in a database and enable searches on it via a ASP .NET web
> app. The problem we have is that some students write their thesis in
> german and some in english. Right now there are more germans texts than
> english ones but i guess that will change in the near future. We've
> heard it's possible to maintain a fulltext index with both languages at
> once but we would have to specify with our queries which index should
> be used. While this is ok for entering data, asking the user if he is
> going to enter his thesis in english or german, asking someone that is
> searching we find it to be troublesome or at least leading to
> confusion. Most of the german thesis texts contain a fair number of
> english words too (computer science kinda lives from english buzzwords
> ;-)). So languages might be even mixed inside a text.
> Anyone recommendations on how to solve this kinda of problem? Would it
> be reasonable to "merge" the englisch and german stopword list and
> create a "new language"?
> Thanks for your help! All hints/links appreciated.
> Kind regards from Germany
> Phil
>
|||Thanks for your input! This basicaly means i'll have to tell SQL Server
which language i want it to use? On inserts and searches? Meaning i
have to ask the user if the text he is going to submit is english or
german (or apply some magic to detect the language)?
|||I wish there was a german-english word breaker :-) Detecting browser
language settings won't help us much. People can choose to write their
diploma/master thesis in german or english. Browser language settings
will be german all the way since student can only enter their thesis
from within the university, where all clients are "german". Most user
searching for a thesis will use some acronyms/keywords like XML, C#,
UML etc which are kinda language indepedent. But there are some
technological terms which have a good and commonly used german
expression, usually a longer word combined from substantives so i think
the correct wordbreaker would help a lot if a user search for a
substring of the longer word. Thanks for pointing the problems out.
Seems to me that our goal is kinda not feasable right now :-/
|||You're welcome, Philipp,
Yes, you still need to define the "Language for Word Breaker" (either German
or English) and you only need to do this once when you define the Full-Text
Catalog and the specific FT-enabled column where these languages are stored.
The default will be which ever language you define, and you only need to
specific the non-default language at query time, i.e., when you execute a
CONTAINS or FREETEXT searches.
No, you don't (and cannot) define the language during INSERTS as
syntactically this is not possible with the INSERT statement. The SQL Server
2005 FTS engine (msftesql.exe) will detect and determine the language of the
mixed text in this column.
Regards,
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/
"Philipp Stader" <pstader@.gmail.com> wrote in message
news:1131886568.811593.201700@.o13g2000cwo.googlegr oups.com...
> Thanks for your input! This basicaly means i'll have to tell SQL Server
> which language i want it to use? On inserts and searches? Meaning i
> have to ask the user if the text he is going to submit is english or
> german (or apply some magic to detect the language)?
>

No comments:

Post a Comment