MyConText
MyConText is a Perl module that provides ways to index various text documents (Perl scalars, files, web pages, database fields) using the MySQL database in such a way, that queries like "what documents contain words penguin and yellow but not red" (written +penguin +yellow -red) can be processed. You can also index phrases, so that query "little penguin" returns list of documents where these two words appear in this sequence. A work on support of more complex queries, like "little penguin near swim% not penguin flies" is under way -- you are welcome to send patches for this feature.
Perl interface is available that includes create method for creating new index, index_document to add (or update) info about document in the index and methods contains and econtains to fetch list of documents (document names, like filenames or URLs) that contain specified set of words or phrase.
Main goals were:
- use of MySQL, since database provides remote access and access control and is generally nice; the mixture of database indexes for speed with blobs for compact storage seem very effective;
- many flexible ways of storing the data -- you can have index that is small but slow for upgrades, or bigger index that is usefull for documents that change often, or index for phrases; read the man page to understand the concept of frontend and backend;
- documents may be indexed (named) either by integers or by string names -- conversion to internal numeric form (where needed) is done by the module;
- extendable design that provides easy ways of adding new storage backends or application frontends; this makes this modules suitable for tests and benchmarks of various ideas;
- perlish ways of specifying things -- you can provide your own Perl code to specify how document is divided into words and how words are treated; using stemming algorithms and the like is easy -- just specify you Perl code that will be called to do the job;
- usefull for indexing mailinglist archives or small to medium web page collections, or generally documents;
- command line utility to maintain the indexes -- you do not need to write Perl code to do tests or simple things.
This module is currently on CPAN and SourceForge as DBIx::FullTextSearch. I haven't been participating in the development lately. The distribution MyConText-0.49.tar.gz was the last release under the MyConText name, and it comes with a man page.
Author
Copyright: (c) 1999 Jan Pazdziora.
All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.