I built a custom RSS hydrator for better GitHub and Youtube feeds

Oct 30th, 2024

by Juha-Matti Santala

This was published in categories: rss python project dev

In addition to following my favourite blogs, I use RSS feeds to follow Today I Learned (TIL) repositories in GitHub and some Youtube channels and playlists.

The experience hasn’t been very good though. So I decided to build a small web service that functions as a “hydrator” for the chosen feeds. When I say hydrate, I mean it takes the feed and adds in or reformats information so it becomes more usable when the feed lands into my reader.

Hydrating Youtube video feeds

Youtube uses media:elements to provide the information. Some readers are able to read those and provide a better experience for these feeds (or so I have heard) but NetNewsWire doesn’t so a Youtube feed just becomes the title and channel name with no content.

That’s good enough for knowing when a new video lands but I’d really like to watch those directly from my reader instead of jumping to Youtube and be distracted with the algorithm pushing more content to my face.

My hydrator has two functions: one called parse_youtube_feed that parses the XML of a feed provided and another called create_youtube_embed that takes individual video URL and turns it into a Youtube embed iframe.

def create_youtube_embed(video_feed_url):
    """Convert Youtube video URL from RSS feed
    into a Youtube embed with iframe."""
    video_id = re.findall(r"youtube.com/v/(.*)\?", video_feed_url)
    return f'<iframe width=1020 height=600 src="https://youtube.com/embed/{video_id[0]}"></iframe>'


def process_youtube_feed(feed_url):
    """Processes a given RSS feed for a Youtube channel and
    injects video embed, title and description into the feed's content
    so it can be viewable by feed readers that don't support media attributes."""

    # Get original feed
    response = requests.get(feed_url)
    if not response.ok:
        raise FeedNotFoundException(f"Feed at {feed_url} not found.")

    feed = BeautifulSoup(response.text, "xml")

    # Add content with embedded Youtube video + title & description
    # to each entry
    entries = feed.find_all("entry")

    for entry in entries:
        media_content = entry.find("media:content")
        youtube_url = media_content["url"]

        embed = create_youtube_embed(youtube_url)
        title = entry.find("media:title").text
        html = f"<![CDATA[{embed} <h1>{title}</h1>]]>"

        entry.append(
            BeautifulSoup(f"<content type='html'>{html}</content>", "html.parser")
        )

    return feed

While parsing the feed, I add a new content element into it that my feed reader then displays and I can watch the video right away.

Hydrating GitHub commits feed

For GitHub, the functionality needed is bit more complex. The commit feed only contains the commit message and link to the commit in GitHub. Since I follow TIL repositories where people document and share publicly tech snippets they learn, I want to read the actual contents as if they were blog posts.

GitHub’s API provides these in a couple of formats.

Diffs

My initial solution was to send a request to GitHub’s API to fetch the diff for the given commit:

# Get link for the commit
link = entry.find("link")["href"]

# Change the URL so it matches the format the API uses
link = link.replace("github.com/", "api.github.com/repos/")
link = link.replace("/commit/", "/commits/")

# Get diff by providing the correct Accept header
diff_response = requests.get(
    link, headers={"Accept": "application/vnd.github.diff"}
)
diff = diff_response.text

I also added a layer of caching to avoid unnecessary API calls so each commit gets called only once, no matter how often the feed is processed or updated. Thanks to how git works, the commit doesn’t change retroactively so it’s a very simple commit → content cache.

The issue with the diff was though that by adding it inside pre tags, the text didn’t wrap anymore and wasn’t the best possible experience. It was way better than just the commit message but I wanted something better.

Markdown files as HTML

Calling the commit endpoint provides links to each file changed which made it easy to get the contents of the file as HTML to be shown in the feed. I lose the access to the actual changes but in this case that’s not so important and I rather read the full file than a small snippet anyway 95% of the cases.

It also allows me to skip files like readmes that people use as an index so they update them to only add a link to the new actual entry.

Here’s the crux of how to get the HTML:

# Get commit data from API
commit_response = requests.get(link).json()

# Get all markdown files that are not readme.md
files = [
    file
    for file in commit_response.get("files", [])
    if file["filename"].endswith("md")
    and file["filename"].lower() != "readme.md"
]

html = ""
if files:
    for file in files:
        api_url = remove_url_params(file["contents_url"])
        # application/vnd.github.html tells the API
        # to return the file contents as HTML
        html_response = requests.get(
            api_url, headers={"Accept": "application/vnd.github.html"}
        )
        html += html_response.text
else:
    html = "No non-readme Markdown changes."

Attempt at hydrating Yle Areena feeds

Third kind of feed that I wish to hydrate are feeds for Yle Areena streaming service. Unfortunately, the embedding of Areena videos isn’t supported for all videos (due to contract reasons) and for those that it is, the embed code is a React widget so it doesn’t work with RSS readers.

So for now, I need to be content with using the feed just as a reminder to make sure I don’t miss my favorite series. It’s especially useful when a new season starts as I normally remember to watch weekly episodes when the season is running anyway.

Flask web service to provide the new feed

I wrapped the hydration functions into a single Flask endpoint

@app.route("/<path:feed_url>")
def process(feed_url):
    if "github.com" in feed_url:
        new_feed = process_github_feed(f"https://{feed_url}")
        return Response(str(new_feed), mimetype="text/xml")
    if "youtube.com" in feed_url:
        channel_id = request.args.get("channel_id")
        new_feed = process_youtube_feed(f"https://{feed_url}?channel_id={channel_id}")
        return Response(str(new_feed), mimetype="text/xml")

    return feed_url

To use it, I take the original feed url, let’s call it example.com/feed.xml and call my hydrator at hydrator-url/example.com/feed.xml which I then add back to my feed reader.

Whenever I want to add a new Youtube or GitHub feed, I just add it to the end of my URL and tada, my feed is way better now.

In a dream world, this wouldn’t need to be a web service running on a server but rather a transform step I could add into my RSS reader but until that becomes a possibility, I’m very happy with this project.

Right now, my hydrator supports these two services but if I ever come across something new that I want to hydrate, all I need is a function to process the raw feed, another to hydrate the content and then I add a new if clause to my endpoint and go on with my day, happier than before.

Hobbit software

RSS Hydrator is “hobbit software”, as coined by Dave Anderson:

Pretty chill, keeps to itself, tends to its databases, hangs out with other hobbit software at the pub, broadly unbothered by the scheming of the wizards and the orcs, oblivious to the rise and fall of software empires around them.

I’ve published the code as open source in GitHub at hamatti/rss-hydrator. It’s not built super robust and is not a drop-in service but you can use the ideas from it to build your own for now. I might add some error handling and make it more robust one day but right now it’s built for my own use and relies on me providing good and correct input feeds.

If something above resonated with you, let's start a discussion about it! Email me at juhamattisantala at gmail dot com and share your thoughts. In 2025, I want to have more deeper discussions with people from around the world and I'd love if you'd be part of that.