RSS and Atom
#301 Henry, Thursday, 29 October 2015 11:30 PM (Category: Web Development)
(Tags: rss atom)

After getting my feed list imported, the next step was obvious - read those feeds and get the RSS data. I did that. I wrote PHP code to hit the URL, and read the data. It's XML. Fine by me. I grabbed a converter that turned the XML into a dictionary structure and I extracted all the fields I needed.

That proved a little trickier than I thought. There are variations between websites. I worked with the first half-dozen of my feeds and got them all working perfectly. I could retrieve all the XML data and transform it into the common things I needed, like the published date, author name, body of the post, title. Looking good.

Then I expanded it to my whole feed list for a big trial. Crash and burn. It took time to work out what was happening. But then I found out.

I need to cater for two syndication methods. There's RSS and there's Atom. I had known about Atom before, knew the name, knew it was a syndication method, knew that it might have slightly different formatting to RSS. Yeah, I "knew" this, but there's a big difference between vaguely knowing about something and having to know it enough to work with it. So I read up on it, and I examined the XML I was getting back. Didn't look too bad. I can work with this. But the more I worked with it, the more problems I found. Inside the content tags where the body of the post is, it's raw HTML in there and it's not enclosed in a CDATA section. My XML to array converter goes crazy trying to handle all the HTML in there. I need to rework that. I need a better converter, one that can handle Atom the way I want to handle it.

I'm having more struggles with this project than with anything else. Still, it's a challenge, and if it was easy, it wouldn't be as satisfying to conquer. More work required.