Notes On Creating A Multi-user Feed Aggregator

Some time ago, I answered another user's question on Stack Overflow about database design for a multi-user feed aggregator. I also received an email from a developer asking for additional input, which I shared. But I thought I should put my response here, as well, for posterity's sake if nothing else. Note that my comments here assume MySQL as the database but should apply to any SQL database.

Basically, my emailer asked what to do about the fact that the posts table will get huge very quickly if we have multiple users and a row for each post for each user. It's actually a pretty basic relational database scenario, but if you start your project as a single user application and later decide it's going to be multi-user, you may not realize that you probably need to completely redesign your database.

So I've posted the schemae I use for my multi-user feed aggregator (a private project):

I've also posted a sql command that you can run in a cron job to remove read posts that are more than 14 days old:

As an aside, I'm amazed by how many people are writing new aggregators. Is it a common programming class exercise to write a feed aggregator or something?

8 comments

  1. Dan,
    Thanks for posting this. I’m trying to work out a similar problem, but focused on podcasts.

    Right now in my design I have users, feed_master(single entry for the feed), and a subscription table connecting the two. I’m adding a items table for items in the feed and want to add the equivalent of your postmeta table. I’d also like for users to be able to create virtual feeds by reposting items to their own feed(s). And have other users subscribe to the virtual feeds like playlists.

    I’m struggling with how to do this, as I think a normalized structure would explode to massively big tables quickly.

    I had some questions about your structure. Do you have a feeds table for the master feeds list? I assume you create duplicate post entries for every user/post combinations. Is that right? Also, do you create the postmeta entries for every post or just those that the user interacts with?

    Also, what are you putting into the subscription_cache, post_cache, post_annotations and other cache fields? And how do you use them in the application?

    Any thoughts are appreciated.

    Thanks for sharing.

    Alex

  2. Another question I just thought of. Do you create the posts as the feeds are crawled or only when the user logs in and checks their reader?

  3. I hope I understand your questions.

    1. The subscriptions table contains all the feeds to which that any user is subscribed. I used to have separate tables for feeds and user subscriptions/folders with a relational table to join them, but I couldn’t get the performance out of it that I wanted so I denormalized them.

    2. I create one database row per post. Simultaneously, I create a postmeta row for that post for each user. The postmeta table is where all the state is stored on a post for each user, e.g., read/unread, starred/unstarred, etc.

    3. The caches are simply the last fetched copy of things. Each user has a cache of the latest subscription list. Each subscription has a cache of the latest post list. Generally, they’re just used for backup/recovery purposes (i.e., almost never).

    4. The post_annotations field is left over from when the Google Reader API used to have robust sharing features.

    5. Posts are created in the database as the feeds are crawled (every 10 minutes).

    Hope that helps.

  4. Thanks for responding. I think I understand most of it, but just to clarify:

    If you and I both subscribe to the Yabfog rss feed, there will be 2 entries in subscriptions table. Will there also be 2 entries in the post table for “Notes On Creating A Multi-user Feed Aggregator” or just 1? If there are 2, why have postmeta as a separate table?

    So far, I have seen two other approaches, 1 replicates data tables for each user, which seems unworkable.

    The other has users, feeds, subscriptions(user_feed), items and tags where the tags is used for read, stars, plus labels.

    I’ve also been looking an Newsblur, which uses some combination of postgresql and mongodb. I’m not sure how that works yet, but still looking at it.

    Any thoughts on either the tag approach or using something like Mongo?

    thanks.
    Alex

  5. No, like I said: just one copy of each post. That’s why I have postmeta. 🙂

    Postmeta serves the same purpose as tags, but if I really wanted tags per se, I’d normalize a bit more, i.e., one postmeta property or tag per row.

    Mongo is something I’m experimenting with as a feed reader database, too. You need a lot more disk space and memory. Right now I have about 6000 feeds in Mongo, and 2 GB RAM is not enough. So it gets expensive very quickly.

  6. Looking at the schema above, do you use a common feed_id for the same feed when different users subscribe to it? That would make the Post table make sense if you only create one copy of the post. Not sure where you keep track of that.

    That’s good to know about Mongo. I’m starting with 25,000 feeds in the directory before I deal with user subscriptions so it might need a massive amount of memory.

    thanks again.

    Alex

  7. Yes. The feed_id is the unique identifier for a feed. That’s not a unique index in the subscriptions table, though, because it’s denormalized. In my case, the feed_id comes from the Google Reader API; I don’t generate it myself. That’s another reason why I can get away with denormalizing the user subscriptions — I don’t need a unique, autoincrementing feed_id generator.

  8. That makes things a lot clearer. It’s nice because you can drill down from user to subscription/feed to post to postmeta without any relationship tables. The post meta approach also eliminates the many to many relationship you get with tags.

    One thing that I would like to add is the ability for users to make virtual feeds that they can add posts to. Without some constraint that would mean a many to many between feeds and posts. Another way might be to put something in the postmeta data like a shared feed field. It would limit the number of virtual feeds that a user can attach a post to, but that might not be so bad.

    Thanks again for the post and comments. Its been a big help.

Comments are closed.