Saturday, February 11, 2012

Learning New Technologies

As a tech guy and a developer, few things excite me more than learning something new (those things are NSFW).  In the last few months alone, thanks to my new position at work, I have been fortunate enough to tinker with several new(er) technologies, outside of simply reading about them.  Those techs include node.js (quickly becoming my 2nd fav language/platform behind python, maybe even contending for 1st), MongoDB, Redis, Silex (for PHP), and Infobright.  We are not exploring these for cool factors alone, but for true, practical applications at work.  Traditionally, my work has always be a strictly php/mysql shop.

What worries me, however, is that there are people in technology that are extremely reluctant to step outside that comfort zone.  I do understand the need for things to run smoothly with few to no errors.  My gripe, however, is when what we use is NOT working, and we still cling to the golden hammers with which we have become accustomed.  This is, unfortunately, becoming quite a common practice as technology companies mature and senior tech workers become older.

Now, this is where I am likely to be misunderstood.  I started off my tech career writing very basic C and C++, in school.  I learned PHP (and mysql) to solve a specific problem I was taking a stab at for my uncle and my father's company.  They needed a web presence for the company (which was a car donation based charity), and the tool for managing online donations quickly grew into a full inventory control system.  Since their were few (reasonable) options at the time for writing web applications in C, I turned to an older friend and master techie for advice.  He recommended PHP (just crossed version 4.0 at the time), which would be quick to get started up, easily hosted on shared hosting, and somewhat C-like in its syntax.  So away I chipped at trying to get apache, php, and mysql configured locally on Windows 98.  This was no easy task.

That small project, which ballooned into a huge project, is what basically launched my career as a programmer.  For over 10 years now, I have written various applications, mostly in PHP with mysql as a storage engine.  The company I have worked for, for 3 years, is mainly a php/mysql shop.  I owe my current success, salary, and the state of my career to this wonderful combination.  I am, however, a critic of said platform.  I will be the first to admit the limitations of using (what was designed as) a strictly web-based language (the cli was woefully hacked in).  I still think it is great for web based applications of many different flavors, however, I'm back to exploring more traditional client/server based languages/platforms.  Daemon programs, for example, are horrendously painful to write in php.  Given frequent memory leaks, very few blocking options (blocking sockets are very, very non-maintainable in php), and high resource usage, they are just not a great option.  However, daemons are very useful in many applications.  The golden hammer of php is not a viable solution for this particular problem.

To my original point, some techies would be very comfortable spending many extra hours hacking around php to get it to work, and then weeks or months dealing with the operational fallout (this is especially bad for startups without money for huge servers setups, and aren't ready for the switch to the cloud).  I'm not.  I've recently written things in node.js in hours that would have taken me weeks in php to implement, and then run them with minimal hardware and operational complexity.  The other side of that coin, however, is that node.js is very young, and still not considered production ready (current "stable" version is 0.7).  So you have to take that into consideration.

This was less of an educational post, like usual, and more of a rant.  My apologies.  I know, TL;DR.  Whatever.

Monday, December 12, 2011

Node.js and process.nextTick - why you don't use it

Lately I have been messing with a new tool in my hypothetical toolbox - node.js.  Node.js is a platform for developing applications on the server using javascript (based on the V8 javascript engine used in Google Chrome).  The paradigm prevalent in node is event driven programming.  Node is designed that a node process runs inside a single thread, and all of the IO calls (network, file access, database, etc) are asynchronous.  Most of this is done "under the hood."  One thing that node does NOT do, however, is run your code in parallel (like other thread-ready languages, such as python).  This has the benefit of freeing you from worrying about shared resources (read: memory), but the drawback of CPU intensive processes blocking the current execution.

Why is this important?  Well one of the mantras of node.js is to make sure you don't write blocking code.  Some developers hear this and try to find a way to make CPU intensive processes run asynchronously.  Browsing the documentation, they come across a method on the process object called "nextTick", which you can pass a callback to.  The documentation says that this pushes the execution of that method to the next loop around the event loop.  This often gets interpreted as "runs the function in parallel."  False.  It simply defers the execution of that method until the current execution is finished (read:  finished blocking the CPU).  This means that if you have some really CPU intensive code, don't attempt to use process.nextTick to prevent it from blocking requests.  There are some ways to mitigate this, such as spawning a new node process (not terribly efficient, but gets the job done).

One important thing to note, is that in control flow libraries like async.js, there are some misleading method names.  For example, async.js has a method called "parallel."  This is very misleading, because at its core, it uses process.nextTick.  Parallel is really used to coordinate the execution of several methods that run asynchronous IO calls.  The code, however, is always single threaded (although you cannot guarantee the order in which they run).  But if you are trying to parallel simple CPU heavy code blocks using this library, well, you are out of luck.

So, to sum this all up, if you are using process.nextTick, there is a better than good chance you are looking at your project the wrong way.  Remember that while IO is asynchronous, node.js code is not.  Even though it's typically very, very fast :)  I'm quickly growing to love this platform for development, it scales very well.  Nginx uses a very similar structure to produce an extremely resource efficient, highly scalable server.

Happy coding.

Monday, June 27, 2011

PHP class x has no unserializer

I apologize for not having posted in quite a while, but I have been quite busy lately.  Writing lots of new code, developing a few new systems, trying to make everyone's lives better.  And, in doing that, I would like to post a rare problem on here, because this is one of the few problems I've had where Google was no help whatsoever.  Here goes:


Where I work, we use memcache to help make things a little bit faster.  For those who don't know, memcache is a daemon that basically accepts primitive data types (except anything equating to false) and keeps them in memory until they are needed again (written in C, very fast).  The PHP extension for memcache takes the data, serializes it (turns it into a string representation of whatever object you have given it), and stores it in memory (this post can also be my argument against using serialization wherever possible, especially for the lack of interoperability with other programming languages).  Recently, we have been getting an intermittent error when retrieving items out of the cache:

Warning:  Class Collection has no unserializer (Cache line 73)


Which a quick Google search returns almost nothing useful for.  A good friend and colleague of mine and I took to the memcache source code, and finally the PHP source code to try to find a solution.  There was certainly nothing obvious, except that the error is triggered when the object is unserialized (duh).  Also, we could not reproduce this error in our development environment (only in production).  But the only things we did know, was that it only happened on our 'Collection' class, and that it only happened in specific systems.  My incredibly keen colleague figured out that these systems were all part of the same zend cluster, which was one of our last clusters still on PHP 5.2 (we are in the process of upgrading).  I should also point out that our Collection class extends 'ArrayObject', a built in SPL class that is great for treating arrays as objects.  Well, as luck would have it, ArrayObject implements an interface in 5.3 that is not available in 5.2, called 'Serializeable'.  This interface allows customization of the serialize and unserialize functions.  The warning we were receiving occurred when a collection would be cached from a script running on a 5.3 server, and retrieved from a script running on a 5.2 server.  It seems like this should have been a fatal error, but apparently it doesn't because of the way the PHP source registers the custom serializations.

Anyways, the simple answer is to create a new memcache pool for our machines still on 5.2 separate from the 5.3 boxes.  The long term solution is finish the upgrade :)

Lesson here, be careful for problems that don't seem like they could stem from a version upgrade.  Also, always be wary of using serialize / unserialize for transporting data.  If you ever wanted to use a different language to parse the data, you would be pretty much SOL.  I recommend json_encode/decode wherever possible, because then you start thinking in the mindset of data, as opposed to specific language constructs.

Happy debugging

Wednesday, December 29, 2010

Quick hit post

I'm pretty busy today, so I'm not going to talk long.  One thing I did learn today:  in php (and presumably other languages), when using the built in mcrypt library for 2 way encryption, the mcrypt_encrypt function returns a string with non-ascii characcters (i.e. - binary data).  This is a potential big problem when setting cookies, as it may produce unexpected results (i.e. - not freaking working).  The way to resolve this is:  base64_encode.  This method will take the binary string returned from mcrypt_encrypt and turn it into an ascii friendly string, which can be safely stored in a cookie.  When decrypting, simply run the string through base64_decode (before decrypting), and you will get the correct decryption.  Sweet!

Sunday, December 19, 2010

First Post

First I would like to talk a little bit about myself.  I am (as of the time of writing) 25 years old.  I live in Charlotte, North Carolina.  I work as a software developer for a marketing company, writing in PHP, MySQL and Javascript.  This will primarily be a tech blog, with some other stuff mixed in (basically, whatever I feel like).  I have severe (self-diagnosed) ADHD, so bear with me.

Today's post is about a project I'm working on for my job.  For a specific website, we would like to have a good domain name suggestion tool.  When someone types in a keyword or a set of keywords, we want to suggest the best possible domain name for them (based on how short it is, how it would rank in the search engines, etc).  From a programmer's point of view, this is actually extremely difficult.  It involves knowing language context to find relevant related keywords, dealing with things like prepositions, word ordering, etc.  We haven't really decided how this is going to happen yet, but we are taking it one step at a time.

My first task is to start towards the end of the process, by assuming we already have the keywords to search for.  My company has a few particularly strong abilities, one of which is access to good search data.  So we have access to a dataset that basically maps a keyword set to its search ranking (think of the keyword as the array key, and the ranking as a numeric value).  The higher the value, the stronger the keyword.  My specific task was to find the fastest possible way of accessing this data.  We ruled the database out, as this has to be FAST.  The next approach I considered was local file access.  The keyword set is so large, that in memory operations are out of the question.  So we structured the file like this:

keyword1|2600
keyword2|502
keyword3|52230

Basically a giant keyword / value store.  My coworker started on this task by sorting the file by the relevancy, and doing a full token scan until he found the word.  It took, to put it frankly, forever.  More time than to use the database, which is not good.  So I put my school training to the test, and decided to use a different approach.  My thought was to look at this thing like a giant array, and I decided to try to do a binary search.

For all those not familiar, a binary search is quite simple conceptually.  If a list is sorted, one of the quickest ways to find something specific is to start in the middle, and compare it to the value you are searching for.  If your value is greater (in the case of text, we compare the starting characters), you have search the second half of the list, otherwise search the first half.  You have essentially eliminated half the list with just one operation.  Now, do the same thing again.  Eventually, you will either find what you are searching for, or you won't, and the number of operations taken to do this is very small (think of O log(n)).

This is a bit trickier to do when scanning a basic text file.  There is no concept of the number of lines, unless you scan the whole file and index it internally.  So I decided to bust out my current favorite language, Python, and take a whack at the problem.  Here is the basic idea of my algorithm:

Take the size of the file (in bytes).  So you now have your start (0), and your end (the size).  I then take the size, and divide in half.  I use the c based command, 'fseek' (or just 'seek' in Python), to position the position indicator directly to that byte offset in the file.  I then analyze the character at that position.  If it is not a newline character ('\n'), I then backtrack one bite at a time (decrementing my byte counter) until I find a newline character, then read the whole line in.  If I have a match, awesome.  Done.  Otherwise, I do the greater than / less than comparison.  I then repeat the steps (fseek, backtrack, compare) until I find my keyword or my start position exceed my end position, which is how I know that nothing was found.  Below is the Python code I used to accomplish).  By the way, in order to simplify the testing, the file ONLY contained keywords and not the relevancy, so I didn't have to split the line by the pipe character.



def getLineBeg(FileObject, pos):

        ''' Seek to our given binary position '''
        FileObject.seek(pos)

        while pos > 0 and FileObject.read(1) != "\n":
                pos -= 1
                FileObject.seek(pos)

        ''' Return our line and position '''
        return (FileObject.readline(), pos)




def binarySearch(searchWord):

        fileName = 'keywordfile.txt'

        ''' Open our file and get the stats (used for file size) '''
        FileObject = open(fileName)
        stat = os.stat(fileName)
        first = 0

        ''' This is how I get the byte size of the file '''
        last = stat[ST_SIZE]

        ''' Convert our word to lower case, easier to just keep consistency '''
        searchWord = string.lower(searchWord)
        numIterations = 0

        while (last >= first):

                mid = int((first + last) / 2)
                line, pos = getLineBeg(FileObject, mid)
                line = string.lower(line.strip())
                numIterations += 1

                if line == searchWord:
                        print('Num iterations: ' + str(numIterations))
                        return pos
                elif line < searchWord:
                        first = pos + len(line) + 1
                else:
                        last = pos - 1

        print('Num iterations: ' + str(numIterations))
        return False


This proved to be quite fast (I don't have the stats in front of me, but the result took hundreds of milliseconds.  Beat the hell out of a database solution).  But I felt like that could be faster.  The slowest solution took about half a second, which was good, but I knew that I could make it better.  This was the fastest possible solution given the file I had.  But then I realized, how could I reduce the number of iterations?  This is a binary search!  It then occurred to me that I could break the file into smaller files by partioning it by the first character of the keyword (having like 'a.txt, b.txt, etc'.  String operations in memory are pretty damn fast in modern scripting languages.  So I take the first character of the search word, do a direct file open on that text file, and do the binary search in there.  It only took a few moments to write a script to scan the file and break it out by first character, and one extra line in my code:


        fileName = 'alpha/' + searchWord[0] + '.txt'

Cut the search down into tens of milliseconds.  Mission accomplished (at least for now).  Sometimes school does pay off.

I have considered break the files out further by doing the first 2 characters, but I think it would be a waste of time considering the solution I came up with was performing more than fast enough.