Sunday, December 19, 2010

First Post

First I would like to talk a little bit about myself.  I am (as of the time of writing) 25 years old.  I live in Charlotte, North Carolina.  I work as a software developer for a marketing company, writing in PHP, MySQL and Javascript.  This will primarily be a tech blog, with some other stuff mixed in (basically, whatever I feel like).  I have severe (self-diagnosed) ADHD, so bear with me.

Today's post is about a project I'm working on for my job.  For a specific website, we would like to have a good domain name suggestion tool.  When someone types in a keyword or a set of keywords, we want to suggest the best possible domain name for them (based on how short it is, how it would rank in the search engines, etc).  From a programmer's point of view, this is actually extremely difficult.  It involves knowing language context to find relevant related keywords, dealing with things like prepositions, word ordering, etc.  We haven't really decided how this is going to happen yet, but we are taking it one step at a time.

My first task is to start towards the end of the process, by assuming we already have the keywords to search for.  My company has a few particularly strong abilities, one of which is access to good search data.  So we have access to a dataset that basically maps a keyword set to its search ranking (think of the keyword as the array key, and the ranking as a numeric value).  The higher the value, the stronger the keyword.  My specific task was to find the fastest possible way of accessing this data.  We ruled the database out, as this has to be FAST.  The next approach I considered was local file access.  The keyword set is so large, that in memory operations are out of the question.  So we structured the file like this:

keyword1|2600
keyword2|502
keyword3|52230

Basically a giant keyword / value store.  My coworker started on this task by sorting the file by the relevancy, and doing a full token scan until he found the word.  It took, to put it frankly, forever.  More time than to use the database, which is not good.  So I put my school training to the test, and decided to use a different approach.  My thought was to look at this thing like a giant array, and I decided to try to do a binary search.

For all those not familiar, a binary search is quite simple conceptually.  If a list is sorted, one of the quickest ways to find something specific is to start in the middle, and compare it to the value you are searching for.  If your value is greater (in the case of text, we compare the starting characters), you have search the second half of the list, otherwise search the first half.  You have essentially eliminated half the list with just one operation.  Now, do the same thing again.  Eventually, you will either find what you are searching for, or you won't, and the number of operations taken to do this is very small (think of O log(n)).

This is a bit trickier to do when scanning a basic text file.  There is no concept of the number of lines, unless you scan the whole file and index it internally.  So I decided to bust out my current favorite language, Python, and take a whack at the problem.  Here is the basic idea of my algorithm:

Take the size of the file (in bytes).  So you now have your start (0), and your end (the size).  I then take the size, and divide in half.  I use the c based command, 'fseek' (or just 'seek' in Python), to position the position indicator directly to that byte offset in the file.  I then analyze the character at that position.  If it is not a newline character ('\n'), I then backtrack one bite at a time (decrementing my byte counter) until I find a newline character, then read the whole line in.  If I have a match, awesome.  Done.  Otherwise, I do the greater than / less than comparison.  I then repeat the steps (fseek, backtrack, compare) until I find my keyword or my start position exceed my end position, which is how I know that nothing was found.  Below is the Python code I used to accomplish).  By the way, in order to simplify the testing, the file ONLY contained keywords and not the relevancy, so I didn't have to split the line by the pipe character.



def getLineBeg(FileObject, pos):

        ''' Seek to our given binary position '''
        FileObject.seek(pos)

        while pos > 0 and FileObject.read(1) != "\n":
                pos -= 1
                FileObject.seek(pos)

        ''' Return our line and position '''
        return (FileObject.readline(), pos)




def binarySearch(searchWord):

        fileName = 'keywordfile.txt'

        ''' Open our file and get the stats (used for file size) '''
        FileObject = open(fileName)
        stat = os.stat(fileName)
        first = 0

        ''' This is how I get the byte size of the file '''
        last = stat[ST_SIZE]

        ''' Convert our word to lower case, easier to just keep consistency '''
        searchWord = string.lower(searchWord)
        numIterations = 0

        while (last >= first):

                mid = int((first + last) / 2)
                line, pos = getLineBeg(FileObject, mid)
                line = string.lower(line.strip())
                numIterations += 1

                if line == searchWord:
                        print('Num iterations: ' + str(numIterations))
                        return pos
                elif line < searchWord:
                        first = pos + len(line) + 1
                else:
                        last = pos - 1

        print('Num iterations: ' + str(numIterations))
        return False


This proved to be quite fast (I don't have the stats in front of me, but the result took hundreds of milliseconds.  Beat the hell out of a database solution).  But I felt like that could be faster.  The slowest solution took about half a second, which was good, but I knew that I could make it better.  This was the fastest possible solution given the file I had.  But then I realized, how could I reduce the number of iterations?  This is a binary search!  It then occurred to me that I could break the file into smaller files by partioning it by the first character of the keyword (having like 'a.txt, b.txt, etc'.  String operations in memory are pretty damn fast in modern scripting languages.  So I take the first character of the search word, do a direct file open on that text file, and do the binary search in there.  It only took a few moments to write a script to scan the file and break it out by first character, and one extra line in my code:


        fileName = 'alpha/' + searchWord[0] + '.txt'

Cut the search down into tens of milliseconds.  Mission accomplished (at least for now).  Sometimes school does pay off.

I have considered break the files out further by doing the first 2 characters, but I think it would be a waste of time considering the solution I came up with was performing more than fast enough.

2 comments:

  1. Your algorithm is incorrect. Try searching for a word that is greater than all the words in the file. e.g. File contains one word "abc", and you search for "def". It will seek out of bounds

    ReplyDelete
  2. No, it takes care of that. Once the "last" var becomes less than the "first" var, it terminates the search and considers the search to have failed. I ran it on a single line file with "abc" as the only test, and searched for "def". It correctly displays that it ran a single iteration, and that searching failed.

    ReplyDelete