There you go again – the MLS doesn’t scale

[photopress:Reagan.jpg,thumb,alignright]Ever since Zearch, I’ve been bombarded with work to update or create MLS search web sites for various brokers & agents across the country. Because of this, I’ve had the opportunity to deal with another MLS in the Bay Area (EBRDI) and Central Virginia (CAARMLS). Before I begin another MLS rant (and cause the ghost of Gipper to quip one of his more famous lines), I want to say the IT staff at both EBRDI & the NWMLS have been helpful whenever I’ve had issues, and this primary purpose of the post is to shine a light on the IT challenges that an MLS has (and the hoops that application engineers have to jump through to address them).

After working with EBRDI, and the NWMLS, I can safely say the industry faces some interesting technical challenges ahead. Both MLSes have major bandwidth issues and the download times of data from their servers can be so slow, it makes me wonder if they using Atari 830 Acoustic modems instead of network cards.

The EBRDI provides data to members via ftp downloads. The provide a zip file of text files for the all listing data (which appears to be updated twice daily), and a separate file for all the images for that day’s listings (updated nightly). You can request a DVD-R of all the images to get started, but there is no online mechanism to get all older images. This system is frustrating because if you miss a day’s worth of image downloads, there’s no way to recover other than bothering the EBRDI’s IT staff. If the zip file gets corrupted or otherwise terminated during download, you get to download the multi-megabyte monstrosity again (killing any benefit that zipping the data might have had). Furthermore, zip file compression of images offers no major benefit. The 2-3% size savings is offset by the inconvenience of dealing with large files. The nightly data file averages about 5MB (big but manageable), but the nightly image file averages about 130 MB (a bit big for my liking considering the bandwidth constraints that the EBRDI is operating under).

As much as I complain about the NWMLS, I have to admit they probably have the toughest information distribution challenge. The NWMLS is probably the busiest MLS in the country (and probably one of the largest as well). According to Alexa.com, their servers get more traffic than redfin or John L Scott. If that wasn’t load enough, the NWMLS is the only MLS that I’m aware of that offers sold listing data [link removed]. If that wasn’t load enough, they offer access to live MLS data (via a SOAP based web service) instead of daily downloads that the EBRDI & CAARMLS offer their members. If that wasn’t enough load, I believe they allow up 16 or 20 photos per active listing (which seems to be more than the typical MLS supports). So, you have a database with over 30,000 active listings & 300,000 sold listings, all being consumed by over 1,000 offices and 15,000 agents (and their vendors or consultants). The NWMLS also uses F5 Network’s BigIP products, so they are obviously attempting to address the challenges of their overloaded information infrastructure. Unfortunately, by all appearances it doesn’t seem to be enough to handle the load that brokers & their application engineers are creating.

Interestingly, the other MLS I’ve had the opportunity to deal with (the CAARMLS in Central Virginia) doesn’t appear to have a bandwidth problem. It stores it’s data in a manner similar to EBRDI does. However, it’s a small MLS (only 2400-ish residental listings) and I suspect the reason it doesn’t have bandwidth problem is because of the fact it has fewer members to support and less data to distribute than the larger MLSes do. Either that, or the larger MLSes have seriously under invested in technology infrastructure.

So what can be done to help out the large MLSes with their bandwidth woes? Here’s some wild ideas…

Provide data via DB servers. The problem is that as an application developer, you only really want the differences between your copy of the data and the MLS data. Unforunately, providing a copy of the entire database every day is not the most efficient way of doing this. I think the NWMLS has the right idea with what is essentially SOAP front end for their listing database. Unfortunately, writing code to talk SOAP, do a data compare and download is a much bigger pain than writing a SQL stored proc to do the same thing or using a product like RedGate’s SQLCompare. Furthermore, SOAP is a lot more verbose than the proprietary protocols database servers use to talk to each other. Setting up security might be tricky, but modern DB servers allow you to have view, table, and column permissions so I suspect that’s not a major problem. Perhaps a bigger problem is that every app developer probably uses a different back-end, and getting heterogeneous SQL servers talking to each other is probably as big a headache as SOAP is. Maybe using REST instead of SOAP, would accomplish the same result?

Provide images as individually down-loadable files (preferably over HTTP). I think HTTP would scale better than FTP would for many reasons. HTTP is a less chatty protocol than FTP is, so there’s a lot less back & forth data exchange between the client & server. Also there’s a lot more tech industry investment in the ongoing Apache & IIS web server war than improving ftp servers (I don’t see that changing anytime soon).

Another advantage is that most modern web development frameworks have a means of easily making HTTP requests and generating dynamic images at run time. These features mean a web application could create a custom image page that downloads the image file on the fly at run-time from the MLS server and caches it on the file system when it’s first requested. Then all subsequent image requests would be fast since they are locally accessed and more importantly, the app would only download images for properties that were searched for. Since nearly all searches are restricted somehow (show all homes in Redmond under $800K, show all homes with at least 3 bedrooms, etc), and paged (show only 10, 20, etc. listings at a time), an app developer’s/broker’s servers wouldn’t download images from the MLS that nobody was looking at.

Data push instead of pull. Maybe instead of all the brokers constantly bombarding the MLS servers, maybe the MLS could upload data to broker servers at predefined intervals and in random order. This would prevent certain brokers from being bandwidth hogs, and perhaps it might encourage brokers to share MLS data with each other (easing the MLS bandwidth crunch) and leading to my next idea.

BitTorrents? To quote a popular BitTorrent FAQ – “BitTorrent is a protocol designed for transferring files. It is peer-to-peer in nature, as users connect to each
other directly to send and receive portions of the file. However, there is a central server (called a tracker) which coordinates the action of all such peers. The tracker only manages connections, it does not have any knowledge of the contents of the files being distributed, and therefore a large number of users can be supported with relatively limited tracker bandwidth. The key philosophy of BitTorrent is that users should upload (transmit outbound) at the same time they are downloading (receiving inbound.) In this manner, network bandwidth is utilized as efficiently as possible. BitTorrent is designed to work better as the number of people interested in a certain file increases, in contrast to other file transfer protocols.”

Obviously MLS download usage patterns match this pattern of downloading. The trick would be getting brokers to agree to it and doing it in a way that’s secure enough to prevent unauthorized people from getting at it. At any rate, the current way of distributing data doesn’t scale. As the public and industry’s appetite for web access to MLS data grows and as MLSs across the country merge and consolidate, this problem is only going to get worse. If you ran a large MLS, what would you try (other than writing big checks for more hardware)?