Inicio arrow Artículos Técnicos arrow Creating a Personal Search Engine
Creating a Personal Search Engine
Escrito por TecnoApoyo   
ImageSearch facilities have become an expected part of every web site. But this is not always possible. For example, if yours is a personal web site that is not always connected to the Internet, or you are in charge of an Intranet with confidential information, you may not or cannot make use of the site indexing capabilities of commercial search engines like Beseen or Altavista. That is exactly why we tried to implement a simple text search facility with the tools that we already have, an ASP capable web server and the VBScript objects.

Our solution is based principally on two of VBScript's objects: The FileSystemObject, in charge of retrieving the target pages' text, and the RegExp object, to do the actual search and to extract the document's title. We encapsulated the search functionality within two self-contained procedures to allow us flexibility in the search page design. This means that you can change the search page to match the look and feel of your site without requiring major changes in the code.

Our program relies heavily on the RegExp Object. This object allow us to do search or search and replace operations using 'Regular Expressions'. A regular expression is a pattern of text that consists of ordinary characters and special characters, known as metacharacters. The pattern describes one or more strings to match when searching a body of text. The regular expression serves as a template for matching a character pattern to the string being searched. For more information on Regular Expressions and Scripting Technologies in general please refer to the Microsoft Scripting Technologies web site at http://msdn.microsoft.com/scripting/default.htm.

We begin by creating our starting procedure. This is the procedure we call to start the search process. It takes a single parameter, SearchString, that will hold our search criteria. First, we do a standard instantiation of the objects. Second, we set up the RegExp objects, and here's where the magic begins. By setting the RegExp's Global property we instruct the object to find every match of our search pattern. If we set this to False, as is the case of the GetTitle object, the search stops at the first match found. The IgnoreCase property should be self-explanatory, this simply instructs the object to do case insensitive searches. The Pattern property is where we state the search expression. Note the difference between Regex.Pattern and GetTitle.Pattern below. In the former we just feed the content of the SearchString parameter as it came from the user. In the later we construct a special pattern to match text enclosed in <title> tags. Observe in the code window below the special metacharacters right between <title> and </title>. We use parenthesis to change the order of precedence, the . match any single character except the new line character (In VBScript this would be vbCrLf). The \n match the new line character. The pipe character | in between indicates an or, and the asterisk * indicates to match zero or more of the preceding characters. In summary, this pattern will match anything (any amount of characters or new lines) between <title> and </title>. Third, we make sure that our paths variables contain their trailing slashes as we will be using these as the base path for our matched documents. Fourth, we start the actual search process by calling the SearchFiles procedure. And fifth and last, we display a message if no matches were found and we do some object cleaning. Find below the code for our starting procedure.

Listing 1 - Starting Procedure

<%

Sub Search(SearchString)

Set fs = CreateObject("Scripting.FileSystemObject")
Set GetTitle = New RegExp
Set Regex = New RegExp

With Regex
    ' 
    .Global = True
    .IgnoreCase = True
    .Pattern = Trim(SearchString)
End With
With GetTitle
    .Global = False
    .IgnoreCase = True
    .Pattern = "<title>(.|\n)*</title>"
End With

RootFolder = Server.MapPath(RootFld)

If Right(RootFld,1) <> "/" Then
RootFld = RootFld & "/"
End If

If Right(RootFolder, 1) <> "\" Then
    RootFolder = RootFolder & "\"
End If
rfLen = Len(RootFolder) + 1

SearchFiles RootFolder

If MatchedCount = 0 Then
   Response.Write "&nbsp;&nbsp;<B>No Matches Found.</b><BR>"
End If

Set Regex = Nothing
Set GetTitle = Nothing
Set fs = Nothing
    
End Sub

%>

~~*~~

The next part of our project is the search engine itself. This engine is in the form of a self calling procedure, otherwise known as recursive. We decided to implement the engine as a recursive procedure to simplify the process of traversing a directory tree. Note that in a recursive procedure, a new and independent set of variables and objects are created each time it is called. First, we get the current 'root' folder where files and other folders may exist. Then we iterate thru each file in the folder. We then compare each file's extension to a global variable (not shown) holding a list of extensions for valid files (e.g. html, asp, txt, etc.). If a match is found, the file is opened to get the text contained inside, and the RegExp search is applied. If the search returned one or more matches we then proceed to try and get hold of the document's title by executing the GetTitle RegExp search. This, of course, will only return something for HTML and some ASP files. If we find a title, we use this as our results entry text, otherwise we use the file name. Note that we need to strip out the <title> tags. In version 5.5 of the scripting engine (as found in Windows 2000) a SubMatches object is available, returning what's inside the entities called captured matches, a pattern enclosed in parenthesis, avoiding the need to prepare the match manually. Unfortunately, there's no SubMatches object in the more popular versions 4 or 5 of the scripting engine. Anyway, once we got our entry's name, we proceed to construct the line that will be displayed on our results page. We add some miscellaneous (also known as fancy or mostly useless) information to the entry, and do some html-formatting as we go. Check out the somewhat commented code to the recursive procedure below.

Listing 2 - Recursive Search Procedure


<%

Sub SearchFiles(FolderPath)
Dim fsFolder
Dim fsFolder2
Dim fsFile
Dim fsText
Dim FileText
Dim FileTitle
Dim FileTitleMatch
Dim MatchCount
Dim OutputLine

' Get the starting folder
Set fsFolder = fs.GetFolder(FolderPath)
' Iterate thru every file in the folder
For Each fsFile In fsFolder.Files
    ' Compare the current file extension with the list of valid target files
    If InStr(1, ValidFiles, Right(fsFile.Name, 3), vbTextCompare) > 0 Then
     DocCount = DocCount + 1
     ' Open the file to read its content
        Set fsText = fsFile.OpenAsTextStream
            FileText = fsText.ReadAll
            ' Apply the regex search and get the count of matches found
            MatchCount = Regex.Execute(FileText).Count
            MatchedCount = MatchedCount + MatchCount
            If  MatchCount > 0 Then
                DocMatchCount = DocMatchCount + 1
                ' Apply another regex to get the html document's title
                Set FileTitleMatch = GetTitle.Execute(FileText)
                If FileTitleMatch.Count > 0 Then
                    ' Strip the title tags
                    FileTitle = Trim(replace(Mid(FileTitleMatch.Item(0),8),"</title>","",1,1,1))
                    ' In case the title is empty
                    If FileTitle = "" Then
                     FileTitle = "No Title (" & fsFile.Name & ")"
                    End If
                Else
                    ' Create an alternate entry name (if no title found)
                    FileTitle = "No Title (" & fsFile.Name & ")"
                End If
                ' Create the entry line with proper formatting
                ' Add the entry number
                OutputLine = "&nbsp;&nbsp;<b>" & DocMatchCount & ".</B>&nbsp;"
                ' Add the document name and link
                OutputLine = OutputLine & "<A href=" & chr(34) & RootFld & _
replace(Mid(fsFile.Path, rfLen),"\","/") & chr(34) & "><B>"
                OutputLine = OutputLine & FileTitle & "</B></a>"
                ' Add the document information
                OutputLine = OutputLine & _
"<font size=1><br>&nbsp;&nbsp;Criteria matched " & MatchCount & _
" times - Size: " 
                OutputLine = OutputLine & FormatNumber(fsFile.Size / 1024,2 ,-1,0,-1) & "K bytes"
                OutputLine = OutputLine & " - Last Modified: " & formatdatetime(fsFile.DateLastModified,vbShortDate) & "</Font><br>"
                ' Display entry 
                Response.Write OutputLine
                Response.Flush
            End If
        fsText.Close
    End If
Next

' Iterate thru each subfolder and recursively call this procedure
For Each fsFolder2 In fsFolder.SubFolders
    SearchFiles fsFolder2.Path
Next

' Do some objects clean-up
Set FileTitleMatch = Nothing
Set fsText = Nothing
Set fsFile = Nothing
Set fsFolder2 = Nothing
Set fsFolder = Nothing
End Sub

%>

 

As you can see, it is very easy to create a simple search engine without expending big bucks on third-party solutions. Bear in mind that this is a very simplistic approach to the search engine problem. Aside from the fact of the absent-minded nature of this engine (it will match text inside code procedures or text inside html tags, something not always desirable), a robust solution would index each file in a separate process and store the information in a database for fast retrieval. Even thought, the solution presented here is sure to satisfy many web developers in need of a simple search facility, and it sure demonstrate what can be done with the sometimes neglected tools available in every ASP developer's toolbox.

Feel free to send your comments and suggestions to sixto@tecnoapoyo.com (threat mail is strongly discouraged).