REXML

REXML @ANT_VERSION@ @ANT_DATE@ http://www.germane-software.com/software/rexml rexml ruby Sean Russell

REXML is a conformant XML processor for the Ruby programming language. REXML passes 100% of the Oasis non-validating tests and includes full XPath support. It is reasonably fast, and is implemented in pure Ruby. Best of all, it has a clean, intuitive API. REXML is included in the standard library of Ruby

This software is distribute under the Ruby license.

REXML arose out of a desire for a straightforward XML API, and is an attempt at an API that doesn't require constant referencing of documentation to do common tasks. "Keep the common case simple, and the uncommon, possible."

REXML avoids The DOM API, which violates the maxim of simplicity. It does provide a DOM model, but one that is Ruby-ized. It is an XML API oriented for Ruby programmers, not for XML programmers coming from Java.

Some of the common differences are that the Ruby API relies on block enumerations, rather than iterators. For example, the Java code:

for (Enumeration e=parent.getChildren(); e.hasMoreElements(); ) { Element child = (Element)e.nextElement(); // Do something with child }

in Ruby becomes:

parent.each_child{ |child| # Do something with child }

Can't you feel the peace and contentment in this block of code? Ruby is the language Buddha would have programmed in.

One last thing. If you use and like this software, and you're in a position of power in a company in Western Europe and are looking for a software architect or developer, drop me a line. I took a lot of French classes in college (all of which I've forgotten), and I lived in Munich long enough that I was pretty fluent by the time I left, and I'd love to get back over there.

Four intuitive parsing APIs. Intuitive, powerful, and reasonably fast tree parsing API (a-la DOM Fast stream parsing API (a-la SAX)This is not a SAX API. SAX2-based APIIn addition to the native REXML streaming API. This is slower than the native REXML API, but does a lot more work for you. Pull parsing API. Small Reasonably fast (for interpreted code) Native Ruby Full XPath supportCurrently only available for the tree API XML 1.0 conformantREXML passes all of the non-validating OASIS tests. There are probably places where REXML isn't conformant, but I try to fix them as they're reported. ISO-8859-1, UNILE, UTF-16 and UTF-8 input and output; also, support for any encoding the iconv supports. Documentation

You don't have to install anything; if you're running a version of Ruby greater than 1.8, REXML is included. However, if you choose to upgrade from the REXML distribution, run the command: ruby bin/install.rb. By the way, you really should look at these sorts of files before you run them as root. They could contain anything, and since (in Ruby, at least) they tend to be mercifully short, it doesn't hurt to glance over them. If you want to uninstall REXML, run ruby bin/install.rb -u.

If you have Test::Unit installed, you can run the unit test cases. Run the command: ruby bin/suite.rb; it runs against the distribution, not against the installed version.

There is a benchmark suite in benchmarks/. To run the benchmarks, change into that directory and run ruby comparison.rb. If you have nothing else installed, only the benchmarks for REXML will be run. However, if you have any of the following installed, benchmarks for those tools will also be run:

NQXML XMLParser Electric XML (you must copy EXML.jar into the benchmarks directory and compile flatbench.java before running the test)

The results will be written to index.html.

Please see the Tutorial.

The API documentation is available on-line, or it can be downloaded as an archive in tgz format (~70Kb) or (if you're a masochist) in zip format (~280Kb). The best solution is to download and install Dave Thomas' most excellent rdoc and generate the API docs yourself; then you'll be sure to have the latest API docs and won't have to keep downloading the doc archive.

The unit tests in test/ and the benchmarking code in benchmark/ provide additional examples of using REXML. The Tutorial provides examples with commentary. The documentation unpacks into rexml/doc.

Kouhei Sutou maintains a Japanese version of the REXML API docs. Kou's documentation page contains links to binary archives for various versions of the documentation.

Unfortunately, NQXML is the only package REXML can be compared against; XMLParser uses expat, which is a native library, and really is a different beast altogether. So in comparing NQXML and REXML you can look at four things: speed, size, completeness, and API.

Benchmarks

REXML is faster than NQXML in some things, and slower than NQXML in a couple of things. You can see this for yourself by running the supplied benchmarks. Most of the places where REXML are slower are because of the convenience methodsFor example, element.elements[index] isn't really an array operation; index can be an Integer or an XPath, and this feature is relatively time expensive.. On the positive side, most of the convenience methods can be bypassed if you know what you are doing. Check the benchmark comparison page for a general comparison. You can look at the benchmark code yourself to decide how much salt to take with them.

The sizes of the XML parsers are closeAs measured with ruby -nle 'print unless /^\s*(#.*|)$/' *.rb | wc -l . NQXML 1.1.3 has 1580 non-blank, non-comment lines of code; REXML 2.0 has 2340REXML started out with about 1200, but that number has been steadily increasing as features are added. XPath accounts for 541 lines of that code, so the core REXML has about 1800 LOC..

REXML is a conformant XML 1.0 parser. It supports multiple language encodings, and internal processing uses the required UTF-8 and UTF-16 encodings. It passes 100% of the Oasis non-validating tests. Furthermore, it provides a full implementation of XPath, a SAX2 and a PullParser API.

As of release 2.0, XPath 1.0 is fully implemented.

I fully expect bugs to crop up from time to time, so if you see any bogus XPath results, please let me know. That said, since I'm now following the XPath grammar and spec fairly closely, I suspect that you won't be surprised by REXML's XPath very often, and it should become rock solid fairly quickly.

Check the "bugs" section for known problems; there are little bits of XPath here and there that are not yet implemented, but I'll get to them soon.

Namespace support is rather odd, but it isn't my fault. I can only do so much and still conform to the specs. In particular, XPath attempts to help as much as possible. Therefore, in the trivial cases, you can pass namespace prefixes to Element.elements[...] and so on -- in these cases, XPath will use the namespace environment of the base element you're starting your XPath search from. However, if you want to do something more complex, like pass in your own namespace environment, you have to use the XPath first(), each(), and match() methods. Also, default namespaces force you to use the XPath methods, rather than the convenience methods, because there is no way for XPath to know what the mappings for the default namespaces should be. This is exactly why I loath namespaces -- a pox on the person(s) who thought them up!

Namespace support is now fairly stable. One thing to be aware of is that REXML is not (yet) a validating parser. This means that some invalid namespace declarations are not caught.

There is a low-volume mailing list dedicated to REXML. To subscribe, send an empty email to ser-rexml-subscribe@germane-software.com. This list is more or less spam proof. To unsubscribe, similarly send a message to ser-rexml-unsubscribe@germane-software.com.

An RSS file for REXML is now being generated from the change log. This allows you to be alerted of bug fixes and feature additions via "pull". Another RSS is available which contains a single item: the release notice for the most recent release. This is an abuse of the RSS mechanism, which was intended to be a distribution system for headlines linked back to full articles, but it works. The headline for REXML is the version number, and the description is the change log. The links all link back to the REXML home page. The URL for the RSS itself is http://www.germane-software.com/software/rexml/rss.xml.

The changelog itself is here.

For those who are interested, there's a SLOCCount (by David A. Wheeler) file with stats on the REXML sourcecode. Note that the SLOCCount output includes the files in the test/, benchmarks/, and bin/ directories, as well as the main sourcecode for REXML itself.

Raggle is a console-based RSS aggregator. getrss is an RSS aggregator Ned Konz's ruby-htmltools uses REXML Hiroshi NAKAMURA's SOAP4R package can use REXML as the XML processor. Chris Morris' XML Serializer. XML Serializer provides a serialization mechanism for Ruby that provides a bidirectional mapping between Ruby classes and XML documents. Much of the RubyXML site is generated with scripts that use REXML. RubyXML is a great place to find information about th intersection between Ruby and XML.

You can submit bug reports and feature requests, and view the list of known bugs, at the REXML bug report page. Please do submit bug reports. If you really want your bug fixed fast, include an runit or Test::Unit method (or methods) that illustrates the problem. At the very least, send me some XML that REXML doesn't process properly.

You don't have to send an entire test suite -- just the unit test methods. If you don't send me a unit test, I'll have to write one myself, which will mean that your bug will take longer to fix.

When submitting bug reports, please include the version of Ruby and of REXML that you're using, and the operating system you're running on. Just run: ruby -vrrexml/rexml -e 'p REXML::VERSION,PLATFORM' and paste the results in your bug report. Include your email if you want a response about the bug.

Attributes are not handled internally as nodes, so you can't perform node functions on them. This will have to change. It'll also probably mean that, rather than returning attribute values, XPath will return the Attribute nodes. Some of the XPath functions are untestedMike Stok has been testing, debugging, and implementing some of these Functions (and he's been doing a good job) so there's steady improvement in this area.. Any XPath functions that don't work are also bugs... please report them. If you send a unit test that illustrates the problem, I'll try to fix the problem within a couple of days (if I can) and send you a patch, personally. Accessing prefixes for which there is no defined namespace in an XPath should throw an exception. It currently doesn't -- it just fails to match. Reparsing a tree with a pull/SAX parser Better namespace support in SAX Lazy tree parsing Segregate parsers, for optimized minimal distributions XML <-> Ruby Validation support True XML character support Add XPath support for streaming APIs XQuery support XUpdate support Make sure namespaces are supported in pull parser Add document start and entity replacement events in pull parser Better stream parsing exception handling I'd like to hack XMLRPC4R to use REXML, for my own purposes. REXML is hanging while parsing one of my XML files. Your XML is probably malformed. Some malformed XML, especially XML that contains literal '<' embedded in the document, causes REXML to hang. REXML should be throwing an exception, but it doesn't; this is a bug. I'm aware that it is an extremely annoying bug, and it is one I'm trying to solve in a way that doesn't significantly reduce REXML's parsing speed. I'm using the XPath '//foo' on an XML branch node X, and keep getting all of the 'foo' elements in the entire document. Why? Shouldn't it return only the 'foo' element descendants of X? No. XPath specifies that '/' returns the document root, regardless of the context node. '//' also starts at the document root. If you want to limit your search to a branch, you need to use the self:: axe. EG, 'self::node()//foo', or the shorthand './/foo'. I want to parse a document both as a tree, and as a stream. Can I do this? Yes, and no. There is no mechanism that directly supports this in REXML. However, aside from writing your own traversal layer, there is a way of doing this. To turn a tree into a stream, just turn the branch you want to process as a stream back into a string, and re-parse it with your preferred API. EG: pp = PullParser.new( some_element.to_s ). The other direction is more difficult; you basically have to build a tree from the events. REXML will have one of these builders, eventually, but it doesn't currently exist. Why is Element.elements indexed off of '1' instead of '0'? Because of XPath. The XPath specification states that the index of the first child node is '1'. Although it may be counter-intuitive to base elements on 1, it is more undesireable to have element.elements[0] == element.elements[ 'node()[1]' ]. Since I can't change the XPath specification, the result is that Element.elements[1] is the first child element. Why isn't REXML a validating parser? Because validating parsers must include code that parses and interprets DTDs. I hate DTDs. REXML supports the barest minimum of DTD parsing, and even that isn't complete. There is DTD parsing code in the works, but I only work on it when I'm really, really bored. Rumor has it that a contributor is working on a DTD parser for REXML; rest assured that any such contribution will be included with REXML as soon as it is available. I'm trying to create an ISO-8859-1 document, but when I add text to the document it isn't being properly encoded. Regardless of what the encoding of your document is, when you add text programmatically to a REXML document you must ensure that you are only adding UTF-8 to the tree. In particular, you can't add ISO-8859-1 encoded text that contains characters above 0x80 to REXML trees -- you must convert it to UTF-8 before doing so. Luckily, this is easy: text.unpack('C*').pack('U*') will do the trick. 7-bit ASCII is identical to UTF-8, so you probably won't need to worry about this. How do I get the tag name of an Element? You take a look at the APIs, and notice that Element includes Namespace. Then you click on the Namespace link and look at the methods that Element includes from Namespace. One of these is name(). Another is expanded_name(). Yet another is prefix(). Then, you email the author of rdoc and ask him to extend rdoc so that it lists methods in the API that are included from other files, so that you don't have to do all of that looking around for your method.

I've had help from a number of resources; if I haven't listed you here, it means that I just haven't gotten around to adding you, or that I'm a dork and have forgotten. In either case, feel free to write me and complain.

Mike Stok has been very active, sending not only fixes for bugs (especially in Functions), but also by providing unit tests and making sure REXML runs under Ruby 1.7. He also sent the most awesome hand knitted tea cozy, with "REXML" and the Ruby knitted into it. Kouhei Sutou translated the REXML API documentation to Japanese! Links are in the API docs section of the main documentation. He has also contributed a large number of bug reports and patches to fix bugs in REXML. Erik Terpstra heard my pleas and submitted several logos for REXML. After sagely procrastinating for several weeks, I finally forced my poor slave of a wife to pick one (this is what we call "delegation"). She did, with caveats; Erik quickly made the changes, and the result is what you now see at the top of this page. He also supplied a smaller version that you can include with your projects that use REXML, if you'd like. Ernest Ellingson contributed the sourcecode for turning UTF16 and UNILE encodings into UTF8, which allowed REXML to get the 100% OASIS valid tests rating. Ian Macdonald provided me with a comprehensive, well written RPM spec file. Oliver M . Bolzer is maintaining a Debian package distribution of REXML. He also has provided good feedback and bug reports about namespace support. Michael Granger supplied a patch for REXML that make the unit tests pass under Ruby 1.7. James Britt contributed code that makes using Document.parse_stream easier to use by allowing it to be passed either a Source, File, or String. Tobias Reif: Numerous bug reports, and suggestions for improvement. Stefan Scholl, who provided a lot of feedback and bug reports while I was trying to get ISO-8859-1 support working. Steven E Lumos for volunteering information about XPath particulars. Fumitoshi UKAI provided some bug fixes for CData metacharacter quoting. TAKAHASHI Masayoshi, for information on UTF Robert Feldt: Bug reports and suggestions/recommendations about improving REXML. Testing is one of the most important aspects of software development. Electric XML: This was, after all, the inspiration for REXML. Originally, I was just going to do a straight port, and although REXML doesn't in any way, shape or form resemble Electric XML, still the basic framework and philosophy was inspired by E-XML. And I still use E-XML in my Java projects. NQXML: While I may complain about the NQXML API, I wrote a few applications using it that wouldn't have been written otherwise, and it was very useful to me. It also encouraged me to write REXML. Never complain about free software *slap*. See my technologies page for a more comprehensive list of computer technologies that I depend on for my day-to-day work. rdoc, an excellent JavaDoc analogWhen I was first working on REXML, rdoc wasn't, IMO, very good, so I wrote API2XML. API2XML was good enough for a while, and then there was a flurry of work on rdoc, and it quickly surpassed API2XML in features. Since I was never really interested in maintaining a JavaDoc analog, I stopped support of API2XML, and am now recommending that people use rdoc.. Many, many other people who've submitted bug reports, suggestions, and positive feedback. You're all co-developers!