Wednesday, March 7, 2012

FreeTextTable Rank

Hi. I have a question - does anyone know on what criteria is the rank set by
FreeTextTable ? I created some records in the table "titles" containing the
words "drink", "much" and "abstinence" and... well... I can find no rhyme
or reason to the way this rank is established. I got the ranks 293, 266, 266
and 154. The funny thing is that a title like "No *abstinence* for me,
please !" has a higher rank than "ABSTINENCE for dummies." with the word in
capitals.
So... how exactly is this rank set ? How reliable is it ?
I'm "afraid" to filter the returns (for instance TOP N) and I am "afraid" to
set conditions like "where rank >..." (of course, I could always show all
the hits and page the datagrid).
Thanks a lot.
Alex.
After some searching, I found this (for SQL2005):
Ranking of FREETEXT
Freetext ranking is based on the OKAPI BM25 ranking formula. Each term in
the query is ranked, and the values are summed. Freetext queries will add
words to the query via inflectional generation (stemmed forms of the
original query terms); these words are treated as separate terms with no
special weighting or relationship with the words from which they were
generated. Synonyms generated from the Thesaurus feature are treated as
separate, equally weighted terms.
Rank = ?[Terms in Query] w ( ( ( k1 + 1 ) tf ) / ( K + tf ) ) * ( ( k3 + 1 )
qtf / ( k3 + qtf ) ) )
Where:
w is the Robertson-Sparck Jones weight.
Originally, w is defined as:
w = log10 ( ( ( r + 0.5 ) * ( N - n - R + r + 0.5 ) ) / ( ( R - r + 0.5 ) *
( n - r + 0.5 ) ) )
This was simplified to:
w = log10 ( ( ( r + 0.5 ) * ( N - R + r + 0.5 ) ) / ( ( R - r + 0.5 ) * (
n - r + 0.5 ) ) )
R is the number of documents marked relevant by a user. This is not
implemented in SQL Server 2005 full-text search, and thus is ignored.
r is the number of documents marked relevant by a user containing the term.
This is not implemented.
N is the number of documents with values for the property in the query.
n is the number of documents containing the term.
K is ( k1 * ( ( 1 - b ) + ( b * dl / avdl ) ) )
dl is the document length, in word occurrences.
avdl is the average document length of the property over which the query
spans, in word occurrences.
k1, b, and k3 are the constants 1.2, 0.75, and 8.0, respectively.
tf is the frequency of the term in a specific document.
qtf is the frequency of the term in the query.
So now I know :-)))) Anyway, apparently the ranking results have to be taken
with a *BIG" grain of salt.
Alex.
|||Hi Alex,
Yes, I do, but it is complex and you need to understand a bit about basic
Informational Retrieval theory. The Rank value from FREETEXTTABLE is based
upon what is known as OKAPI or BM25 that was developed by Stephen Robertson
(http://research.microsoft.com/users/robertson/). You can see some of the
formula documented in "SQL Server 2005 Full-Text Search: Internals and
Enhancements" at:
http://msdn.microsoft.com/library/de...05ftsearch.asp
Specifically, under "Ranking of FREETEXT" -
"Freetext ranking is based on the OKAPI BM25 ranking formula. Each term in
the query is ranked, and the values are summed. Freetext queries will add
words to the query via inflectional generation (stemmed forms of the
original query terms); these words are treated as separate terms with no
special weighting or relationship with the words from which they were
generated. Synonyms generated from the Thesaurus feature are treated as
separate, equally weighted terms."...
See also http://wickedsmrt.blogspot.com/2003_...t_archive.html
"From MS Newsgroups: However it was my understanding that rank is based on
this formula:
? W(i)=(K1+1) ?idf(i)?(K2+1) ?tf(i.j) /( K1?[(1-b)+b?dl
(j)/avdl])?K3(tf(i.j))
W(i) - rank from each term in the search phrase
idf(i) - iS the inverse document frequency of term i
tf(i,j) - is the term frequency for term i, in row j
K1,K2,K3 - are constants
dl - is row/column length in words
AVdl - is the average row/column length length in words"
Another factor is the number of rows and the number of unique non-noise
words per row as you must have a statistically significant number of rows
(at least 10,000) for the OKAPI BM25 Freetexttable Rank value to be
meaningful. How many rows are in your table "titles"? Also keep in mind that
the Rank values are specific to your freetext query and primarily useful for
ordering of the results. See SQL Server 2000 BOL title "Full-text Search
Recommendations" - "What is RANK and how is it determined when used with
CONTAINSTABLE and FREETEXTTABLE predicates?..." for more info.
Yes, OKAPI BM25 is very reliable, but complex. Note, that for US English,
SQL FTS is case insensitive and "ABSTINENCE" will have the same rank value
as "abstinence" with all other factors being equal. It is complex as you can
see, but what is your true objective? Could you provide the exact Freetext
query with sample data and results as well as the full output of SELECT
@.@.version ?
Hope that helps!
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/
"msnews.microsoft.com" <REMOVETHIScuca_macaii2000@.yahoo.com> wrote in
message news:#YEbgXDBFHA.3824@.TK2MSFTNGP10.phx.gbl...
> Hi. I have a question - does anyone know on what criteria is the rank set
by
> FreeTextTable ? I created some records in the table "titles" containing
the
> words "drink", "much" and "abstinence" and... well... I can find no
rhyme
> or reason to the way this rank is established. I got the ranks 293, 266,
266
> and 154. The funny thing is that a title like "No *abstinence* for me,
> please !" has a higher rank than "ABSTINENCE for dummies." with the word
in
> capitals.
> So... how exactly is this rank set ? How reliable is it ?
> I'm "afraid" to filter the returns (for instance TOP N) and I am "afraid"
to
> set conditions like "where rank >..." (of course, I could always show all
> the hits and page the datagrid).
> Thanks a lot.
> Alex.
>
|||Alex,
Yep, so you found (via Google?) the SQL 2005 FTS paper published in Dec
2004. While it documents SQL 2005, I *believe* that as far as the contains
and freetext ranking formula's that they hold true for SQL Server 2000 as
well. Even so, I'm surprised that for SQL Server 2005 (or for that matter
SQL 2000) that relevance feedback (R & r in the formula) was not implemented
as there are standard methods using T-SQL and feedback tables that can be
use to implement automatic relevance feedback...
Oh, and that *BIG" grain of salt, that you speak of, not necessary... You
now know the formula, and with your table's unique non-noise words, you can
calculate the rank values from your query by yourself!
Regards,
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/
"msnews.microsoft.com" <REMOVETHIScuca_macaii2000@.yahoo.com> wrote in
message news:eAAhomDBFHA.2012@.TK2MSFTNGP15.phx.gbl...
> After some searching, I found this (for SQL2005):
> Ranking of FREETEXT
> Freetext ranking is based on the OKAPI BM25 ranking formula. Each term in
> the query is ranked, and the values are summed. Freetext queries will add
> words to the query via inflectional generation (stemmed forms of the
> original query terms); these words are treated as separate terms with no
> special weighting or relationship with the words from which they were
> generated. Synonyms generated from the Thesaurus feature are treated as
> separate, equally weighted terms.
> Rank = ?[Terms in Query] w ( ( ( k1 + 1 ) tf ) / ( K + tf ) ) * ( ( k3 +
1 )
> qtf / ( k3 + qtf ) ) )
> Where:
> w is the Robertson-Sparck Jones weight.
> Originally, w is defined as:
> w = log10 ( ( ( r + 0.5 ) * ( N - n - R + r + 0.5 ) ) / ( ( R - r + 0.5 )
*
> ( n - r + 0.5 ) ) )
> This was simplified to:
> w = log10 ( ( ( r + 0.5 ) * ( N - R + r + 0.5 ) ) / ( ( R - r + 0.5 ) * (
> n - r + 0.5 ) ) )
> R is the number of documents marked relevant by a user. This is not
> implemented in SQL Server 2005 full-text search, and thus is ignored.
> r is the number of documents marked relevant by a user containing the
term.
> This is not implemented.
> N is the number of documents with values for the property in the query.
> n is the number of documents containing the term.
> K is ( k1 * ( ( 1 - b ) + ( b * dl / avdl ) ) )
> dl is the document length, in word occurrences.
> avdl is the average document length of the property over which the query
> spans, in word occurrences.
> k1, b, and k3 are the constants 1.2, 0.75, and 8.0, respectively.
> tf is the frequency of the term in a specific document.
> qtf is the frequency of the term in the query.
> So now I know :-)))) Anyway, apparently the ranking results have to be
taken
> with a *BIG" grain of salt.
> Alex.
>
|||Hello, John. Thank you for your reply.
Yes, I understand. Unfortunately, I'm using the pubs db, and I only have
about 50 titles in the 'Titles' table.
Here are the results of my query:
------
Rank/Title/Notes/First Name/Last Name
375/We don't drink that much/Essay on abstinence./Napoleon/Borcan
266/Que c'est bon, c'est bon, c'est bon !/Essai sur les joies de boire
beaucoup. Abstinents s'abstenir (eng: abstinence)./Pisica/Rindunel
121/No *abstinence* for me, please !/A smoker's paradise. Smoke-smoke-smoke,
boy, oh, isn't it a joy ?/Alberta/Curisor
121/ABSTINENCE for dummies./An epicurian's guide./Pupu/Balacarescu
------
and this is the query, which is correct:
------
strSearch = "SELECT " & _
"SearchTable.[Rank], Titles.title as Title, Titles.notes as
Notes, Authors.au_fname as [First Name], Authors.au_lname as [Last Name]" &
_
"FROM " & _
"FREETEXTTABLE(Titles, *, '" & strText & "') as SearchTable
" & _
"INNER JOIN Titles ON SearchTable.[Key] = Titles.title_id "
& _
"INNER JOIN TitleAuthor ON Titles.title_id =
TitleAuthor.title_id " & _
"INNER JOIN Authors ON TitleAuthor.au_id = Authors.au_id " &
_
"ORDER BY SearchTable.[Rank] DESC"
------
The version is:
Microsoft SQL Server 2000 - 8.00.194 (Intel X86) Aug 6 2000 00:57:48
Copyright (c) 1988-2000 Microsoft Corporation Developer Edition on Windows
NT 5.1 (Build 2600: Service Pack 2)
Alex.
"John Kane" <jt-kane@.comcast.net> wrote in message
news:Oof9C5DBFHA.3644@.TK2MSFTNGP15.phx.gbl...
> Hi Alex,
> Yes, I do, but it is complex and you need to understand a bit about basic
> Informational Retrieval theory. The Rank value from FREETEXTTABLE is based
> upon what is known as OKAPI or BM25 that was developed by Stephen
> Robertson
> (http://research.microsoft.com/users/robertson/). You can see some of the
> formula documented in "SQL Server 2005 Full-Text Search: Internals and
> Enhancements" at:
> http://msdn.microsoft.com/library/de...05ftsearch.asp
> Specifically, under "Ranking of FREETEXT" -
> "Freetext ranking is based on the OKAPI BM25 ranking formula. Each term in
> the query is ranked, and the values are summed. Freetext queries will add
> words to the query via inflectional generation (stemmed forms of the
> original query terms); these words are treated as separate terms with no
> special weighting or relationship with the words from which they were
> generated. Synonyms generated from the Thesaurus feature are treated as
> separate, equally weighted terms."...
> See also http://wickedsmrt.blogspot.com/2003_...t_archive.html
> "From MS Newsgroups: However it was my understanding that rank is based on
> this formula:
> ? W(i)=(K1+1) ?idf(i)?(K2+1) ?tf(i.j) /( K1?[(1-b)+b?dl
> (j)/avdl])?K3(tf(i.j))
> W(i) - rank from each term in the search phrase
> idf(i) - iS the inverse document frequency of term i
> tf(i,j) - is the term frequency for term i, in row j
> K1,K2,K3 - are constants
> dl - is row/column length in words
> AVdl - is the average row/column length length in words"
> Another factor is the number of rows and the number of unique non-noise
> words per row as you must have a statistically significant number of rows
> (at least 10,000) for the OKAPI BM25 Freetexttable Rank value to be
> meaningful. How many rows are in your table "titles"? Also keep in mind
> that
> the Rank values are specific to your freetext query and primarily useful
> for
> ordering of the results. See SQL Server 2000 BOL title "Full-text Search
> Recommendations" - "What is RANK and how is it determined when used with
> CONTAINSTABLE and FREETEXTTABLE predicates?..." for more info.
> Yes, OKAPI BM25 is very reliable, but complex. Note, that for US English,
> SQL FTS is case insensitive and "ABSTINENCE" will have the same rank value
> as "abstinence" with all other factors being equal. It is complex as you
> can
> see, but what is your true objective? Could you provide the exact Freetext
> query with sample data and results as well as the full output of SELECT
> @.@.version ?
> Hope that helps!
> John
> --
> SQL Full Text Search Blog
> http://spaces.msn.com/members/jtkane/
>
> "msnews.microsoft.com" <REMOVETHIScuca_macaii2000@.yahoo.com> wrote in
> message news:#YEbgXDBFHA.3824@.TK2MSFTNGP10.phx.gbl...
> by
> the
> rhyme
> 266
> in
> to
>
|||You're welcome, Alex,
Yes, I thought so. While the pubs and northwind database tables are good
examples for experimenting with SQL FTS queries, they are not large enough
to be used effectively with containstable or freetextable and RANK as you
need production level table sizes to get meaningful Ranking results from the
SQL FTS queries such as the one below.
FYI, I'd recommend that you apply the latest service pack to your SQL Server
2000 (8.00.194) Developer's Edition on WinXP SP2 as the build (194) of SQL
Server 2000 that you are using has no service packs applied and you may be
open security bugs with this RTM version.
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/
"msnews.microsoft.com" <REMOVETHIScuca_macaii2000@.yahoo.com> wrote in
message news:O2wwSFEBFHA.3472@.TK2MSFTNGP14.phx.gbl...
> Hello, John. Thank you for your reply.
> Yes, I understand. Unfortunately, I'm using the pubs db, and I only have
> about 50 titles in the 'Titles' table.
> Here are the results of my query:
> ----

> Rank/Title/Notes/First Name/Last Name
> 375/We don't drink that much/Essay on abstinence./Napoleon/Borcan
> 266/Que c'est bon, c'est bon, c'est bon !/Essai sur les joies de boire
> beaucoup. Abstinents s'abstenir (eng: abstinence)./Pisica/Rindunel
> 121/No *abstinence* for me, please !/A smoker's paradise.
Smoke-smoke-smoke,
> boy, oh, isn't it a joy ?/Alberta/Curisor
> 121/ABSTINENCE for dummies./An epicurian's guide./Pupu/Balacarescu
> ----

> and this is the query, which is correct:
> ----

> strSearch = "SELECT " & _
> "SearchTable.[Rank], Titles.title as Title, Titles.notes
as
> Notes, Authors.au_fname as [First Name], Authors.au_lname as [Last Name]"
&
> _
> "FROM " & _
> "FREETEXTTABLE(Titles, *, '" & strText & "') as
SearchTable
> " & _
> "INNER JOIN Titles ON SearchTable.[Key] = Titles.title_id
"
> & _
> "INNER JOIN TitleAuthor ON Titles.title_id =
> TitleAuthor.title_id " & _
> "INNER JOIN Authors ON TitleAuthor.au_id = Authors.au_id "
&
> _
> "ORDER BY SearchTable.[Rank] DESC"
> ----

> The version is:
> Microsoft SQL Server 2000 - 8.00.194 (Intel X86) Aug 6 2000 00:57:48
> Copyright (c) 1988-2000 Microsoft Corporation Developer Edition on
Windows[vbcol=seagreen]
> NT 5.1 (Build 2600: Service Pack 2)
> Alex.
>
>
> "John Kane" <jt-kane@.comcast.net> wrote in message
> news:Oof9C5DBFHA.3644@.TK2MSFTNGP15.phx.gbl...
basic[vbcol=seagreen]
based[vbcol=seagreen]
the[vbcol=seagreen]
http://msdn.microsoft.com/library/de...05ftsearch.asp[vbcol=seagreen]
in[vbcol=seagreen]
add[vbcol=seagreen]
http://wickedsmrt.blogspot.com/2003_...t_archive.html[vbcol=seagreen]
on[vbcol=seagreen]
rows[vbcol=seagreen]
English,[vbcol=seagreen]
value[vbcol=seagreen]
Freetext[vbcol=seagreen]
set[vbcol=seagreen]
266,[vbcol=seagreen]
word[vbcol=seagreen]
"afraid"
>

No comments:

Post a Comment