Optimizing web performance with GlusterFS

Posted by Joe Julian 5 years, 8 months ago (comments)

More often than I would like, someone with twenty or more web servers servicing tens of thousands of page hits per hour comes into #gluster asking how to get the highest performance out of their storage system. They've only just now come to the realization that storage is slow, only because it's been exacerbated by adding a network layer. The short answer: You don't.

Put your performance enhancements as close to the user as possible.

Use a reverse proxy like Varnish or Squid to cache pages and static content first. This not only prevents filesystem lookups, but even for generated content it allows the content process to quickly dump the page and move on to the next query while the proxy streams the content over the relatively slow internet connection. Caching of dynamic content can be tuned at the proxy as well. Does this page really need to have instantaneous updates with the barrage of comments, or is a new version every 30 seconds enough? This can cut my content generation process time significantly.

If you can't cache your content, cache the program.

Set realpath_cache_size in php.ini to something much larger than the 16k it defaults to. This will help avoid searching the include path over and over again for the same files. Avoiding looking up filenames that don't exist can make a huge difference (thanks to Mohammed Naser from VEXXHOST).

For PHP, I use apc (php-pecl-apc for the rest of you red hat based folks out there). Coupled with setting apc.stat=0 apc will only load your php files once. Once loaded, php no longer needs to go to the disk to run a script. Obviously this will vastly improve your page load times. The only downside is that if you change your scripts, you'll have to force a reload (“service httpd reload” is again the normal red hat way).

Specific to GlusterFS, disabling stat calls is important. Calling stat() on a file forces a self-heal check. This will add some latency to the page load if you call it on every file before open(), like php does. Configuring apc to avoid those stat calls can add a significant performance boost for most php apps.

Yes, there's no mention of python here. I use fastcgi, myself, which works in exactly this way.

Don't load every file every time

Using an autoloader, like Zend's or building an autoloader into your own PHP app (maybe using the the __autoload() function in PHP 5) will help prevent unnecessary file accesses. Lazy loading will wait to load a class file until it's actually used, thus possibly avoiding unnecessary require_once loads.

Don't store your session data on disk

This one I thought was obvious. Not only does that include a write(), which by it's very nature is going to be expensive, but it limits your growth. Use memcached for sessions. It's fast, does the job beautifully, and allows shared sessions. While you're implementing memcached, use it for caching database queries.

When necessary: sure, read from disk

With GlusterFS you can improve performance in a few ways. If your reads are varied and inconsistent, you might benefit from adding more servers. The distribute translator spreads the files among all the storage in the volume (or at least, all the subvolumes given to the distribute translator) so the more distribute subvolumes you have, the more spread out your load.

If you have 30000 users all pulling the same file... well, that shouldn't happen if you've correctly addressed the caching above, but... if you can read the file without calling stat() on it, more replica volumes should help. Currently (3.2.6) it won't spread the load unless it starts getting really bad. There's a “bug” (“feature”?) that always polls the replicate subvolumes in the same order, allowing the first server to nearly always respond quicker. This will be changed in 3.3. This problem isn't as huge of a problem as it sounds, though, because once the first server is saturated with requests, the next replicate server will be faster to respond.

Go faster!!!

Finally, of course, speed is king. Reduce latency, increase bandwidth. Use RDMA over Infiniband if you can. Avoid stateful firewalls on your web servers. Identify your bottlenecks and throw money at them.

And make your web developers bring you a nice Venti Latte in the morning and take you out for micro-brews after work for taking such good care of them.