SqlSearchProvider?

Topics: General Discussion, General Questions
Dec 1, 2008 at 3:37 PM
I've noticed some mention of a SqlSearchProvider that may be coming down the pipeline...are there any rough estimates as to when this might be available?  At this point I've had to disable any searching functionality on my site because I can't get the search up and running...I tried setting the request timeout to an hour, but it still timed out before finishing generating the index....so I moved the index generation out to an asyncronous thread, but the machine ran out of memory and subsequently became unacceptably slow.  If it's not in the works yet I'd be interested in possibly writing it myself and contributing the source back to the project...is this possible?
Dec 1, 2008 at 4:13 PM
Hi,

I've been deferring this feature with each release.  IIRC, you're the first person to show any interest in it :)

I've been planning for 1.12 to at least create a file-based provider that works exactly the same as the in-memory provider, because in my experience it gets annoying having to regenerate the index each time the application restarts due to inactivity.  I'd be happy to work on the SQL provider for 1.12 as well although there's still a lot more for me to do, so I may not get to it.

If you'd like to give it a shot I'd certainly take a look at the end result and consider merging it into DocProject's source code.  You could try using CodePlex's patching feature (you'd be the first to use it so I can't say whether it's the best approach, but I'd be interested in seeing how it works :).

Making an SQL search provider should be fairly straightforward.  It's mostly a matter of creating an SqlDocSiteSearchProvider class that derives from DocSiteSearchProvider and constructing a textual SQL query based on the QueryExpression that is passed to the Search method (yes, it has an erie resemblance to LINQ, but I wrote it before learning how LINQ works so there wasn't much influence ;).

Take a look at how DocSiteMemorySearchProvider is implemented as well.  The GetQueryOperator method and the QueryOperator-related properties should be explained by their implementations.

I also recommend caching some information in memory, such as KeywordCount and DocumentCount, because once the index is generated for the first time it can be treated as immutable.

If you have any questions feel free to ask.

- Dave
Dec 1, 2008 at 4:38 PM
Cool....sounds good.  The file-based provider could probably work as well as a SQL-based one as long as it doesn't attempt to load the whole thing into memory....but yeah I'll definitly look into a sql-based provider...probably won't get too into it for a week or 2 yet tho....the "real" work is starting to stack up a bit :P
Dec 1, 2008 at 4:52 PM
Hi,

> The file-based provider could probably work as well as a SQL-based one as long as it doesn't attempt to load the whole thing into memory

Good point.  I could probably get away with loading small blocks of the file into memory while searching and then cache the blocks that contain matches, to increase performance for related queries.  To alleviate a build-up there would also have to be a timeout and a cache memory usage limit, of course.  The settings for this could be configurable.

> I'll definitly look into a sql-based provider....probably won't get too into it for a week or 2

Sure.  If you find the time to try this let me know and I'll set up an appropriate DB model for you.

> the "real" work is starting to stack up a bit :P

I have a love-hate relationship with "real" work :)

- Dave
Dec 6, 2008 at 8:25 PM
Soooo...just taking a quick look through the in-memory provider...I'm thinkin a schema like this should suffice....the nvarchar sizes are most likely not proper...I'm not really sure how big that data gets...what 'cha think?

- Chris


CREATE TABLE Files (
      Id INT IDENTITY(1,1) NOT NULL,
      Filename NVARCHAR(255) NOT NULL,
      Title NVARCHAR(255) NOT NULL,
      MSHelpTitle NVARCHAR(255) NOT NULL,
      MSHelpApiLocation NVARCHAR(255) NOT NULL,
      MSHelpApiType NVARCHAR(255) NOT NULL,
      MSHelpApiName NVARCHAR(255) NOT NULL,
      MSHelpLocale NVARCHAR(255) NOT NULL,
      MSHelpTopicType NVARCHAR(255) NOT NULL,
      MSHelpAbstract NVARCHAR(255) NOT NULL,
      PRIMARY KEY(Id)
)
GO
CREATE TABLE Keywords (
      FileId INT NOT NULL,
      Keyword NVARCHAR(50) NOT NULL,
      Occurances INT NOT NULL,
      Weight INT NOT NULL,
      PRIMARY KEY(FileId, Keyword)
)
GO
ALTER TABLE Keywords ADD CONSTRAINT fkFileId FOREIGN KEY (FileId) REFERENCES Files(Id)
GO
CREATE INDEX ixKeyword ON Keywords(Keyword)
GO
CREATE UNIQUE INDEX ixFilename ON Files(Filename)
GO

Dec 8, 2008 at 5:56 PM
I'm thinkin it might be good to also add something like "ExactTitleMatch BIT NOT NULL" field to the Keywords table to make the searching faster/easier.
Dec 9, 2008 at 2:20 PM
Hi Chris,

Your initial logical design looks pretty good but I do see some flaws.  Here are a few suggestions:

  1. Set the Keyword column's collation to SQL_Latin1_General_CP1_CI_AS.  (Of particular importance is CI - case-insensitive.)
  2. Make Keyword index clustered (non-unique).  Locating string keywords in a performant manner seems to be of more importance than increasing the speed of a join on a numeric column.
  3. Drop the no-longer-clustered PRIMARY KEY on Keywords.  I don't think this constraint is necessary - data integrity isn't as important as keeping the size of the database to a minimum.
  4. Occurrences (spelled correctly ;).
  5. Add ExactTitleMatch BIT NOT NULL, PartialTitleMatch BIT NOT NULL.  (Good idea :)
  6. Filename on Files table should be renamed to FilePath.  I guess 255 is fine for the size although techincally it's possible (although not yet in managed code, without some Win32 API calls) to have larger path lengths in NTFS.
  7. Set FilePath collation to SQL_Latin1_General_CP1_CI_AS.  (Of particular importance is CI - case-insensitive.)
  8. I recommend thinking about having two threads: one that buffers keywords into an in-memory queue each time AddKeyword is called, while another dequeues in batches and sends each IndexEntry to a stored proc in the context of a single DB connection.  This may help to take advantage of dual core machines when filling the index.
These are the things I noticed looking at your design while briefly reviewing the code in DocSiteMemorySearchProvider.cs - I could be wrong about some of my suggestions.  The best way to find out is to try to implement the provider, modify the design based on your experience, and then run some performance tests.

BTW, thanks for your effort :)

- Dave
Dec 9, 2008 at 3:44 PM
cool....sounds good....I'm thinkin the MSHelp* fields probably for the most part don't have to be 255 in length...esp. Locale but I guess that's not going to add much overhead really....

hehehe....1 weird thing in the code that had me confused for a bit that you may want to know about....in your ListHelper class...'Union()' actually performs an Intersect operation....and 'Combine()' actually performs a Union operation :P
Dec 9, 2008 at 4:37 PM
Hi Chris,

> I'm thinkin the MSHelp* fields probably for the most part don't have to be 255 in length

Yes, probably not, but I agree that it's fine for now.  Or better yet, if we're targeting SQL Server 2005 at a minimum, then you could use MAX instead for Title, ApiLocation, ApiType and ApiName.  We can examine the domains for the other columns in the future.

> in your ListHelper class...'Union()' actually performs an Intersect operation....and 'Combine()' actually performs a Union operation :P

I never said that math was my strong point ;)  Although maybe I just changed the implementations but forgot to rename them - I can't remember.

Anyway, thanks for pointing that out.  It will be fixed in the next release.

- Dave
Dec 9, 2008 at 4:40 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.
Dec 9, 2008 at 6:36 PM
another thing I noticed...the hot keywords and excluded keywords comparisons in DocSiteSearch/StringList seem to be case sensitive....I would think they probably shouldn't be...
Dec 9, 2008 at 7:16 PM
Hi Chris,

Another good observation :)

The quickest solution is to simply add the following to the StringList class constructor(s).

    : base(StringComparer.OrdinalIgnoreCase)

I'll make the change for 1.12.

Thanks, 
Dave
Dec 9, 2008 at 7:59 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.
Dec 19, 2008 at 5:49 PM
sooooo.....I have a SqlSearchProvider implementation done and working.  The good thing about it is that it requires no changes to existing code to use it.  It performs quite well, but there are definitly a couple things about it that I'm not terribly happy with...but it works.  I think its about as good as I can make it without making changes to the codebase, and its not looking like I'm going to get any time to work on it anymore for at least a month...so I figure it's as good a time as ever to hand it over :P  I've tested it with my doc-set which is 92000 files with 10000000 keywords (occurances), and it performs decently well....not spectacular, but definately usable.  How shall I send you the files?
Dec 19, 2008 at 6:00 PM
actually...here...I've uploaded a rar that contains all the files: http://www.virgeweb.com/redec/crap/docsite.rar ....take a look through it when you have some time....I'm sure you'll have questions about why I did certain things...there was quite a bit of trial/error to find the most performant method
Dec 24, 2008 at 7:05 PM
Hi Chris,

Thanks for the code.  I've download it and will take a look when I get some time.

Sorry for the delay, I've been quite busy myself.  If it works for me then I'll consider packaging and distributing it as an add-on for DocProject until I get some time to properly integrate it.  You'll be getting credit for your work, of course :)

- Dave