Version 10^-1.9 of MD_Extract pushed to github

Completely changed the way the string representing the HTML is preprocessed before being fed to tidy. I’ve just changed the function and the approach. The function is not really very elegant but it fixes a bunch of bugs. It’s mostly character iteration and lots and lots of flags (old school style!!). But it got me thinking after doing some quick browsing on the HTML parsing algorithm provided by the WHATWG if I shouldn’t just write my own (though it looks sort of hard and specially time consuming). I’ve also been looking at the source code of tidy and though it’s quite big the other option would be to try to contribute to it and help update it to HTML 5, but it would take some time for me to get to know the base code and the project seems to have been abandoned (and it might be quite big for just one person to work on). Anyhow, I’m not promising anything so far.

I do understand that the current approach that the library takes on this (preprocessing and then sending to tidy) is not the most efficient one. However there is another take on efficiency and that’s economic efficiency, and except for really heavy duty Microdata consuming the library does fulfill it’s purpose and the truth is Microdata is a new spec that still has to be widely adopted, so that’s not a real concern right now. So the question is whether if it makes sense to spend the next 3 months writing a parser from scratch, when the one I have does fit my needs (and probably those of 99.999% of PHP developers that may use the library). So far I don’t see the point. But then again my geeky side keeps bugging me to do it right.

Well, anyhow if you find any bugs (and I’m sure there might be many, simply because there are very few microdata examples and I might be missing strange markup some user might come up with ), please report them!!. Other than that I will write a post next on why I believe microdata to be better than microformats and I would also probably write a personal post that I’m sort of owing myself to write.

My first attempt at a Microdata Extractor.

I’ve just pushed to github, version 10^-2 of MD_Extract . It’s my first attempt at a Microdata consumer.

I based the extraction algorithm on the one published by the whatwg , though the implementation has some variations, mainly for clarity of code and also due to the particulars of it being done in PHP. I took Tab’s suggestion and it does a first pass through the HTML tree to collect references to elements with IDs which makes the code so much clearer and nicer than what I was originally planning of doing. In fact I think the algorithm is beautiful ( and it’s O(n), where n is the number of nodes in the html tree ).

I have versioned it at V. 10^-2 because I have not found that many examples to test it, there are also some anticipated problems with character encodings that do not extend ASCII and a couple of little things I’d like to add. But as far as I know, regarding microdata syntax it’s 100% compliant with the latest spec.

hCard_xtract v0.0.2 released

I have just finished version 0.0.2 of my hCard Extract Application. The changes on this new version are:

  • added support for FN ORG Optimization
  • added support for mail type
  • added support for multiple nicknames and for nickname inside n
  • added support for multiple categories

I’ve also decided to release the hCard Parser Library under the LGPL, source code can be accessed here. This is quite probably the last release of this, since I would like to do a full production version using Tidy under PHP 5 (that gives an HTML parser quite similar to the one I was using).

Post Archive

Post Categories

Search Posts