Is There an Open-Source Address Scrubber?
The "Fast Lane" Answer
Slow down a second. Before we go any further, we need to determine two things. First, do you mean "open source" or "free?" Second, when you say "address scrubber" are you referring to parsing the address? Or do you mean standardizing it? The distinctions are important, because it changes the answers you need.
You know what? We'll just answer everything.
Open Source vs. Free: Open source means that the source code is visible to the public. Free means there's no payment required to use it.
Parsing vs. Standardization: Parsing an address means breaking it into smaller chunks and labeling those chunks for easier processing. Standardizing means reformatting an address so that it looks the way the USPS wants it to look. The difference is a fine line, though, and most who do one also do the other. Some, however, go a step further and offer address validation, a process that incorporates the other two.
Below is a better explanation of the differences, and how knowing what to ask for helps you find what you want.
The "Scenic Route" Answer
Open Source vs. Free
The open dialog that exists between developers and users in an open-source format is unique—by opening the source code to the public, developers are offering users the chance to participate in development. They invite everyone to be the developers of the program, to participate in the creative process, and to help make the code (and by association the program as a whole) better.
Open source, however, does not automatically indicate "free." Unix, for example, is free to use in most forms, but there are other open-source offerings that are not.
A word that here means: without cost, requires no fee to participate/use, charges you nothing, etc. Free is when you receive a gift for your birthday, or find a five dollar bill on the ground. But free doesn't always offer freedom. Sometimes it resembles a borrowing system: "Yeah, you can use my stuff, but keep it the way I had it." Adobe Acrobat reader is free to use. But it's also proprietary software (i.e., not open source), meaning you can't pop the hood and tinker with the engine. Adobe keeps that locked down tight.
So keep in mind what you're looking for. Are you wanting to pay nothing? Or do you want a big sandbox—open to any sort of creation or destruction you wish to bestow upon it?
Address Parsing vs. Address Standardization
Parsing is complicated. It tries to determine the intent of content. It has many applications, not just regarding the processing of an address. At its core it's an attempt to have a computer decipher the meaning of human communication. Normally computers work best when we speak to them in their language; parsing is a computer trying to speak ours.
Time to get down to brass tacks: when a program parses input, it tries to break it into pieces and categorize
it. For instance, if someone were to take an address—Spiderman's house, maybe,
20 Ingram St. Queens,
New York—and parse it, they would run it through a program that has to decide for itself what
portion of the address is the city, which part is the street, and which part is the house number. Then it would
label it: "20" is the house number, "Ingram" is the street, "Queens" is the city, and so forth.
That would be a pretty simple process (any parsing would be straightforward, really), if we as humans weren't so fond of ambiguity and repetition.
Take the above example again. As a human, it's pretty easy to assume that the city is "Queens, New York." Queens is pretty famous, and there's no "Saint Queens," New York. So it's an easy call.
But not everywhere is like that.
Take Helena, California. There are two of them: Helena and St. Helena. So to illustrate the problem here, let's
relocate Spidey to the west coast. That way, when the computer goes to parse
20 Ingram St. Helena,
California, it has no way of knowing if it's Ingram St. in Helena, California, or if it's Ingram whatever
in St. Helena, California. That leaves the computer to try to determine intent, and hope for the best. This is a
problem inherent in using regular expressions (aka: regex) for parsing addresses.
Now as hard and as complicated as that is, it still has value. Going on the presumption that the parser accurately estimated the intent of the entry, it can then run it through a secondary process, as done in industries like address verification/validation. But more on that later.
So if you're looking to have your address broken down and identified, you're looking for parsing. It's a complicated method, but it has its payoffs.
Standardization is when an address is reformatted to match the standards set by the USPS so it can be processed by them. When run through a standardizing program, addresses will be "cleaned up;" commas will be added between city and state if they're missing, words like "street" and "boulevard" will be abbreviated properly, words will be capitalized in keeping with the USPS system, and so on. In short, it's the process of making an address look like it was written by the postman himself.
Standardizing doesn't break anything down into components. Nothing is labeled or categorized by the end of the process. The word "street" is recognized because it's the word "street," not because of it's position in the syntax of the address.
So if you're looking to shape up that address list so it looks more professional, you're looking for standardization.
The ultimate form of "address scrubbing" is address validation (also known as address verification). Validation runs an address through a string of processes, and does at least three things: it parses the address, it standardizes the address, and it validates it. The validation part means that an address is compared against an authoritative database and checked to see if it is real.
An address currently receiving mail from the postal system will be returned by our system as valid, and is a real address. Addresses that, for whatever reason, do not currently receive mail are marked as invalid. That doesn't necessarily mean that the address is fictional; it just means that it's not currently in the system. An invalid address may or may not be real; a valid address is real for sure.
An address has to be standardized before being compared to the system, and the system has the addresses already broken down into parts. So, as part of the validation process, the address will be standardized, then compared, then returned with components accurately labeled.
Some validation providers do parsing in-house, so that even if an address fails to validate they can return its components to you and indicate which part failed the test. Not every provider does this, and even those who do can't guarantee 100% accuracy, due to the nature of parsing. It's simply not a foolproof system. Whether or not the parsing is done in-house, they can still standardize the address.
Open Source Address Validation
Validation requires access to an authoritative database, and the owners of that data (at least in the US) don't want their stuff strewn across the internet. So for US addresses, there is no truly open source solution for validation.
We hope this article has done its job and clarified some of the details that might be obfuscating your search for whichever of the above services you need. Along those lines, we would be remiss if we did not mention how we provide a significant amount of free address validation (which, as mentioned, includes the parsing/standardizing part), and the stuff you pay for is so reliable and lightning fast it merits the price.
The information delivered in this article was probably overkill, but a wise woman once told us that "nuking it from orbit is the only way to be sure."
We're pretty sure we're not taking that advice out of context.
In any case, good luck and happy hunting.