Flickr, Caching, and Overcomplication

I mentioned offhand in a previous post that I'd modified my photo section to use Flickr. A few days later, I realize this may have been a bad idea.

I've gotten a few tracebacks from this site where a search engine bot has hit the photos section and absolute broke this functionality. It's a big block of "Connection reset" messages sitting in my Inbox. And they do always seem to come in large numbers, so I don't think it's just randomly crapping out.

So tonight I coded up (what I think is) a fairly elegant caching solution. It will hold on to some number of Python objects (default is 1000) by key, and delete the least-recently used data as new items come in. And the main Photos page is caching each individual photoset, so that no additional calls need to be be made when you drill down to the Photoset level. The same thing happens on the Photoset page with individual Photos.

In testing, it hits me that Apache spawns a lot of different child processes (or maybe threads, I can't remember), and each process will necessarily have its own cache. (I count 9 child processes running right now.) So the odds of a search engine bot actually making heavy use of the caching is relatively small--unless they're just lucky and get the same child process each time, or stupid enough to request the same URL more than once. So for the amount of effort I spent, I surely could have just written a daily sync script to pull down all of my Flickr data to a database.

Of course, that would be too obvious a solution. (And, I gotta admit, writing a data sync script just doesn't feel as 1337 as doing caching.) Oh well. At least I have a caching module that I can use for this site or Chainsaw Buffet gets popular, or at least Slashdotted. (Do kids still use that term, "Slashdotted?")

... like that's ever going to happen. :)

Comments

I think that "Twitter'd" is the new "Slashdotted"...

So, you are saying that the robots are crapping out your photos b/c the web service call is taking too long? If you are doing the thumbnail + descriptions and linking to a larger images, you might want to add the "nofollow" attribute to the thumnail in order to limit the number of requests that google would make against the API. Seeing that flickr has built in paging anyway, it should keep your requests small enough that it will not crap out on you.

I started to reccomend turing on / up your output caching on this page... however, that might not work so well for you, ;) .

Spelling

Just realized that the spelling in the previous post was horrible. I am ashamed to even read it. Oh well, thats what I get for multitasking...

Hmm.

Yeah, the robots seem to be breaking the web service calls. At least that's the only thing I can think.

I didn't realize that Flickr has built-in paging. But the module I'm using for Flickr may not support it. But as far as paging goes, the caching I have set up sort of works for me the other way--if I have three pages of photos, it retrieves them all, and caches the entire list. So pages 2, 3, 4, etc. don't have to make a call back to Flickr--assuming you get the same process, that is.

Good point on the no-follows on lower-level pages. I hadn't thought of that. Of course, when I had my photos on my site, I got some pretty good search engine traffic from them if they had descriptive titles (for example, a lot of the photos from MGM Studios). Of course, (a) that's not really good traffic, since they're just here for a photo, and (b) I haven't captioned any of my Flickr photos yet.

Actually, page output caching isn't a bad idea, in a sense. The photo pages don't change that often (no sidebar boxes to deal with), so I could set up a caching module that would associate the URL with the HTML output of the page. Then, unless (a) the page was a POST, (b) the URL/query string didn't match, and (c) the user wasn't logged in, it would just dump that HTML output from the database.

For that matter, it wouldn't be hard to serialize/pickle an object and toss it in the database to cache across threads. Obviously, this would be pointless for blog posts/comments (or would it, if I stored whole lists?), but it would work really well for Flickr.

memcached

Take a look at memcached - since you're already on linux this is a no-brainer. memcached is how the big boys scale up (like MySpace) - an in memory, distributed (though you only have one server) cache.

Unlike the other post, this makes sense as it can solve the problem, and it also a great point to have on the resume.

memcached

Cool, I'll try it out.

There seems to be a Python API available, which is what I was worried about. And memcached is available through Ubuntu's apt-get... but I'll have to compile it on my actual VPS server.

memcached ftw

OK, I just set up my photo section to use memcached.

Pretty easy to set up, despite the fact that you have to compile it on Red Hat/Fedora.

Good Python library, although with very sparse documentation.

Found an extension here that supports "namespaces," so more than one application can share the same server.

And here's the best part: it very nearly matched the interface I'd written. OK, it isn't that much of a coincidence; it's a no-brainer to implement get(key) and set(key, value) for something like this, but I barely had to change any code to get it working.

w00t.

Post a Comment

To post a comment to this blog entry, login below:

Email:
Password:

If you don't have an account, register here. | Forgot your password?