Friday, December 24, 2010

Parsing RSS Feeds with PHP

Holiday Parade<br>[looks like focus was on reflections and not the boat]<br>© 2010 Copyright www.cssrule.com All Rights Reserved
I neglected my subject categories long enough so today I've sat down to write a search tool that will help people find what they are looking for.

I wrote the following PHP code to pull down my CSSBakery RSS feed, then parse and dump it out:

function dumpXmlDoc($node,$d=0) {
    if ($d==0) { echo "<pre>"; }
    echo str_repeat(" ",$d).$node->getName()."\n";
    foreach ($node->attributes() as $k=>$v) {
      echo str_repeat(" ",$d+2)."($k='$v')\n";
    }
    echo str_repeat(" ",$d+2).":".(string)$node."\n";
    foreach($node->children() as $child) {
      dumpXmlDoc($child,$d+2);
    } 
    if ($d==0) { echo "</pre>"; }
  }

  $xml = simplexml_load_file('http://www.cssrule.com/feeds/posts/default?max-results=10');
  dumpXmlDoc($xml);  


RSS feeds are formatted as XML documents, so I'm using PHP's SimpleXML extension. The following line of PHP code does a lot for us:

$xml = simplexml_load_file('http://www.cssrule.com/feeds/posts/default?max-results=10');

This causes SimpleXML to load the feed document from my Blog, parse it, and construct a corresponding data structure in memory. This function can load files off the server's filesystem, but it can also retrieve them over HTTP, which is what we are doing here. Note that in the URL I am specifying a query parameter of max-results=10. That tells Blogger to only give me the latest 10 posts.

Next we want to just dump out the XML structure that we now have in memory. For this I wrote a small recursive function, dumpXmlDoc() to dump out the XML tree. The function takes two parameters: a node that is the root of the sub-tree to dump out, and an optional depth variable I called $d. If a 2nd parameter is not specified, it defaults to zero.

We call dumpXmlDoc() on the root of the tree, letting the 2nd parameter default to zero:
dumpXmlDoc($xml);

If the dump function sees that the depth ($d) is zero, it will wrap the output in a <pre> tag to preserve formatting:

function dumpXmlDoc($node,$d=0) {
    if ($d==0) { echo "<pre>"; }
    ... (do the dumping logic) ...
    if ($d==0) { echo "</pre>"; }
  }
  

It dumps out the name, attributes, and value of the current node. The calls to str_repeat() are there to provide some indenting of the output by generating a number of spaces based on the current depth ($d):

echo str_repeat(" ",$d).$node->getName()."\n";
    foreach ($node->attributes() as $k=>$v) {
      echo str_repeat(" ",$d+2)."($k='$v')\n";
    }
    echo str_repeat(" ",$d+2).":".(string)$node."\n";
  

Then it loops over all children of the current node and recursively calls itself on each child, passing in a computed value of depth that is greater than the current depth:

foreach($node->children() as $child) {
      dumpXmlDoc($child,$d+2);
    } 
  

And that's all there is to it. We can use this as a basis for writing some interesting applications to process RSS feeds.

You can see the live output of this code here.

Post a Comment

Note: Only a member of this blog may post a comment.