REXML is a conformant XML processor for the Ruby programming language. REXML passes 100% of the Oasis non-validating tests and includes full XPath support. It is reasonably fast, and is implemented in pure Ruby. Best of all, it has a clean, intuitive API. REXML is included in the standard library of Ruby
This software is distribute under the Ruby license.
REXML arose out of a desire for a straightforward XML API, and is an attempt at an API that doesn't require constant referencing of documentation to do common tasks. "Keep the common case simple, and the uncommon, possible."
REXML avoids The DOM API, which violates the maxim of simplicity. It does provide a DOM model, but one that is Ruby-ized. It is an XML API oriented for Ruby programmers, not for XML programmers coming from Java.
Some of the common differences are that the Ruby API relies on block enumerations, rather than iterators. For example, the Java code:
in Ruby becomes:
Can't you feel the peace and contentment in this block of code? Ruby is the language Buddha would have programmed in.
One last thing. If you use and like this software, and you're in a position of power in a company in Western Europe and are looking for a software architect or developer, drop me a line. I took a lot of French classes in college (all of which I've forgotten), and I lived in Munich long enough that I was pretty fluent by the time I left, and I'd love to get back over there.
You don't have to install anything; if you're running a
version of Ruby greater than 1.8, REXML is included. However, if you
choose to upgrade from the REXML distribution, run the command:
ruby bin/install.rb
. By the way, you really should look at
these sorts of files before you run them as root. They could contain
anything, and since (in Ruby, at least) they tend to be mercifully
short, it doesn't hurt to glance over them. If you want to uninstall
REXML, run ruby bin/install.rb -u
.
If you have Test::Unit installed, you can run the unit test cases.
Run the command: ruby bin/suite.rb
; it runs against the
distribution, not against the installed version.
There is a benchmark suite in benchmarks/
. To run the
benchmarks, change into that directory and run ruby
comparison.rb
. If you have nothing else installed, only the
benchmarks for REXML will be run. However, if you have any of the
following installed, benchmarks for those tools will also be run:
EXML.jar
into the
benchmarks
directory and compile
flatbench.java
before running the test)The results will be written to index.html
.
Please see the Tutorial.
The API documentation is available on-line, or it can be downloaded as an archive in tgz format (~70Kb) or (if you're a masochist) in zip format (~280Kb). The best solution is to download and install Dave Thomas' most excellent rdoc and generate the API docs yourself; then you'll be sure to have the latest API docs and won't have to keep downloading the doc archive.
The unit tests in test/
and the benchmarking code in
benchmark/
provide additional examples of using REXML. The
Tutorial provides examples with commentary. The documentation unpacks
into rexml/doc
.
Kouhei Sutou maintains a Japanese version of the REXML API docs. Kou's documentation page contains links to binary archives for various versions of the documentation.
Unfortunately, NQXML is the only package REXML can be compared against; XMLParser uses expat, which is a native library, and really is a different beast altogether. So in comparing NQXML and REXML you can look at four things: speed, size, completeness, and API.
Benchmarks
REXML is faster than NQXML in some things, and slower than NQXML in a
couple of things. You can see this for yourself by running the supplied
benchmarks. Most of the places where REXML are slower are because of the
convenience methodselement.elements[index]
isn't really an array operation;
index can be an Integer or an XPath, and this feature is relatively time
expensive.
The sizes of the XML parsers are closeruby -nle 'print unless /^\s*(#.*|)$/' *.rb | wc -l
REXML is a conformant XML 1.0 parser. It supports multiple language encodings, and internal processing uses the required UTF-8 and UTF-16 encodings. It passes 100% of the Oasis non-validating tests. Furthermore, it provides a full implementation of XPath, a SAX2 and a PullParser API.
As of release 2.0, XPath 1.0 is fully implemented.
I fully expect bugs to crop up from time to time, so if you see any bogus XPath results, please let me know. That said, since I'm now following the XPath grammar and spec fairly closely, I suspect that you won't be surprised by REXML's XPath very often, and it should become rock solid fairly quickly.
Check the "bugs" section for known problems; there are little bits of XPath here and there that are not yet implemented, but I'll get to them soon.
Namespace support is rather odd, but it isn't my fault. I can only do so much and still conform to the specs. In particular, XPath attempts to help as much as possible. Therefore, in the trivial cases, you can pass namespace prefixes to Element.elements[...] and so on -- in these cases, XPath will use the namespace environment of the base element you're starting your XPath search from. However, if you want to do something more complex, like pass in your own namespace environment, you have to use the XPath first(), each(), and match() methods. Also, default namespaces force you to use the XPath methods, rather than the convenience methods, because there is no way for XPath to know what the mappings for the default namespaces should be. This is exactly why I loath namespaces -- a pox on the person(s) who thought them up!
Namespace support is now fairly stable. One thing to be aware of is that REXML is not (yet) a validating parser. This means that some invalid namespace declarations are not caught.
There is a low-volume mailing list dedicated to REXML. To subscribe, send an empty email to ser-rexml-subscribe@germane-software.com. This list is more or less spam proof. To unsubscribe, similarly send a message to ser-rexml-unsubscribe@germane-software.com.
An RSS file for REXML is now being generated from the change log. This allows you to be alerted of bug fixes and feature additions via "pull". Another RSS is available which contains a single item: the release notice for the most recent release. This is an abuse of the RSS mechanism, which was intended to be a distribution system for headlines linked back to full articles, but it works. The headline for REXML is the version number, and the description is the change log. The links all link back to the REXML home page. The URL for the RSS itself is http://www.germane-software.com/software/rexml/rss.xml.
The changelog itself is here.
For those who are interested, there's a SLOCCount (by David A. Wheeler) file with stats on the REXML sourcecode. Note that the SLOCCount output includes the files in the test/, benchmarks/, and bin/ directories, as well as the main sourcecode for REXML itself.
You can submit bug reports and feature requests, and view the list of known bugs, at the REXML bug report page. Please do submit bug reports. If you really want your bug fixed fast, include an runit or Test::Unit method (or methods) that illustrates the problem. At the very least, send me some XML that REXML doesn't process properly.
You don't have to send an entire test suite -- just the unit test methods. If you don't send me a unit test, I'll have to write one myself, which will mean that your bug will take longer to fix.
When submitting bug reports, please include the version of Ruby and
of REXML that you're using, and the operating system you're running on.
Just run: ruby -vrrexml/rexml -e 'p
REXML::VERSION,PLATFORM'
and paste the results in your bug
report. Include your email if you want a response about the bug.
REXML is hanging while parsing one of my XML files.Your XML is probably malformed. Some malformed XML, especially XML that contains literal '<' embedded in the document, causes REXML to hang. REXML should be throwing an exception, but it doesn't; this is a bug. I'm aware that it is an extremely annoying bug, and it is one I'm trying to solve in a way that doesn't significantly reduce REXML's parsing speed.
I'm using the XPath '//foo' on an XML branch node X, and keep getting all of the 'foo' elements in the entire document. Why? Shouldn't it return only the 'foo' element descendants of X?No. XPath specifies that '/' returns the document root, regardless of the context node. '//' also starts at the document root. If you want to limit your search to a branch, you need to use the self:: axe. EG, 'self::node()//foo', or the shorthand './/foo'.
I want to parse a document both as a tree, and as a stream. Can I do this?Yes, and no. There is no mechanism that directly supports this in REXML. However, aside from writing your own traversal layer, there is a way of doing this. To turn a tree into a stream, just turn the branch you want to process as a stream back into a string, and re-parse it with your preferred API. EG: pp = PullParser.new( some_element.to_s ). The other direction is more difficult; you basically have to build a tree from the events. REXML will have one of these builders, eventually, but it doesn't currently exist.
Why is Element.elements indexed off of '1' instead of '0'?Because of XPath. The XPath specification states that the index of the first child node is '1'. Although it may be counter-intuitive to base elements on 1, it is more undesireable to have element.elements[0] == element.elements[ 'node()[1]' ]. Since I can't change the XPath specification, the result is that Element.elements[1] is the first child element.
Why isn't REXML a validating parser?Because validating parsers must include code that parses and interprets DTDs. I hate DTDs. REXML supports the barest minimum of DTD parsing, and even that isn't complete. There is DTD parsing code in the works, but I only work on it when I'm really, really bored. Rumor has it that a contributor is working on a DTD parser for REXML; rest assured that any such contribution will be included with REXML as soon as it is available.
I'm trying to create an ISO-8859-1 document, but when I add text to the document it isn't being properly encoded.Regardless of what the encoding of your document is, when you add text programmatically to a REXML document you must ensure that you are only adding UTF-8 to the tree. In particular, you can't add ISO-8859-1 encoded text that contains characters above 0x80 to REXML trees -- you must convert it to UTF-8 before doing so. Luckily, this is easy:
text.unpack('C*').pack('U*')
will do the trick. 7-bit ASCII
is identical to UTF-8, so you probably won't need to worry about this.
How do I get the tag name of an Element?You take a look at the APIs, and notice that
Element
includes Namespace
. Then you click on the
Namespace
link and look at the methods that
Element
includes from Namespace
. One of these is
name()
. Another is expanded_name()
. Yet another
is prefix()
. Then, you email the author of rdoc and ask him
to extend rdoc so that it lists methods in the API that are included from
other files, so that you don't have to do all of that looking around for
your method.
I've had help from a number of resources; if I haven't listed you here, it means that I just haven't gotten around to adding you, or that I'm a dork and have forgotten. In either case, feel free to write me and complain.