From fcbf63e62c627deae76c1b8cb8c0876c536ed811 Mon Sep 17 00:00:00 2001 From: Jari Vetoniemi Date: Mon, 16 Mar 2020 18:49:26 +0900 Subject: Fresh start --- jni/ruby/test/rexml/data/tutorial.xml | 678 ++++++++++++++++++++++++++++++++++ 1 file changed, 678 insertions(+) create mode 100644 jni/ruby/test/rexml/data/tutorial.xml (limited to 'jni/ruby/test/rexml/data/tutorial.xml') diff --git a/jni/ruby/test/rexml/data/tutorial.xml b/jni/ruby/test/rexml/data/tutorial.xml new file mode 100644 index 0000000..bf5783d --- /dev/null +++ b/jni/ruby/test/rexml/data/tutorial.xml @@ -0,0 +1,678 @@ + + + + + + + REXML Tutorial + + $Revision: 1.1.2.1 $ + + *2001-296+594 + + http://www.germane-software.com/~ser/software/rexml + + + + ruby + + Sean Russell + + + + +

This is a tutorial for using REXML, + a pure Ruby XML processor.

+
+ + +

REXML was inspired by the Electric XML library for Java, which + features an easy-to-use API, small size, and speed. Hopefully, REXML, + designed with the same philosophy, has these same features. I've tried + to keep the API as intuitive as possible, and have followed the Ruby + methodology for method naming and code flow, rather than mirroring the + Java API.

+ +

REXML supports both tree and stream document parsing. Stream parsing + is faster (about 1.5 times as fast). However, with stream parsing, you + don't get access to features such as XPath.

+ +

The API documentation also + contains code snippits to help you learn how to use various methods. + This tutorial serves as a starting point and quick guide to using + REXML.

+ + +

We'll start with parsing an XML document

+ + require "rexml/document" +file = File.new( "mydoc.xml" ) +doc = REXML::Document.new file + +

Line 3 creates a new document and parses the supplied file. You can + also do the following

+ + require "rexml/document" +include REXML # so that we don't have to prefix everything with REXML::... +string = <<EOF + <mydoc> + <someelement attribute="nanoo">Text, text, text</someelement> + </mydoc> +EOF +doc = Document.new string + +

So parsing a string is just as easy as parsing a file. For future + examples, I'm going to omit both the require and + include lines.

+ +

Once you have a document, you can access elements in that document + in a number of ways:

+ + + The Element class itself has + each_element_with_attribute, a common way of accessing + elements. + + The attribute Element.elements is an + Elements class instance which has the each + and [] methods for accessing elements. Both methods can + be supplied with an XPath for filtering, which makes them very + powerful. + + Since Element is a subclass of Parent, you can + also access the element's children directly through the Array-like + methods Element[], Element.each, Element.find, + Element.delete. This is the fastest way of accessing + children, but note that, being a true array, XPath searches are not + supported, and that all of the element children are contained in + this array, not just the Element children. + + +

Here are a few examples using these methods. First is the source + document used in the examples. Save this as mydoc.xml before running + any of the examples that require it:

+ + <inventory title="OmniCorp Store #45x10^3"> + <section name="health"> + <item upc="123456789" stock="12"> + <name>Invisibility Cream</name> + <price>14.50</price> + <description>Makes you invisible</description> + </item> + <item upc="445322344" stock="18"> + <name>Levitation Salve</name> + <price>23.99</price> + <description>Levitate yourself for up to 3 hours per application</description> + </item> + </section> + <section name="food"> + <item upc="485672034" stock="653"> + <name>Blork and Freen Instameal</name> + <price>4.95</price> + <description>A tasty meal in a tablet; just add water</description> + </item> + <item upc="132957764" stock="44"> + <name>Grob winglets</name> + <price>3.56</price> + <description>Tender winglets of Grob. Just add water</description> + </item> + </section> +</inventory> + + doc = Document.new File.new("mydoc.xml") +doc.elements.each("inventory/section") { |element| puts element.attributes["name"] } +# -> health +# -> food +doc.elements.each("*/section/item") { |element| puts element.attributes["upc"] } +# -> 123456789 +# -> 445322344 +# -> 485672034 +# -> 132957764 +root = doc.root +puts root.attributes["title"] +# -> OmniCorp Store #45x10^3 +puts root.elements["section/item[@stock='44']"].attributes["upc"] +# -> 132957764 +puts root.elements["section"].attributes["name"] +# -> health (returns the first encountered matching element) +puts root.elements[1].attributes["name"] +# -> health (returns the FIRST child element) +root.detect {|node| node.kind_of? Element and node.attributes["name"] == "food" } + +

Notice the second-to-last line of code. Element children in REXML + are indexed starting at 1, not 0. This is because XPath itself counts + elements from 1, and REXML maintains this relationship; IE, + root.elements['*[1]'] == root.elements[1]. The last line + finds the first child element with the name of "food". As you can see + in this example, accessing attributes is also straightforward.

+ +

You can also access xpaths directly via the XPath class.

+ + # The invisibility cream is the first <item> +invisibility = XPath.first( doc, "//item" ) +# Prints out all of the prices +XPath.each( doc, "//price") { |element| puts element.text } +# Gets an array of all of the "name" elements in the document. +names = XPath.match( doc, "//name" ) + +

Another way of getting an array of matching nodes is through + Element.elements.to_a(). Although this is a method on elements, if + passed an XPath it can return an array of arbitrary objects. This is + due to the fact that XPath itself can return arbitrary nodes + (Attribute nodes, Text nodes, and Element nodes).

+ + all_elements = doc.elements.to_a +all_children = doc.to_a +all_upc_strings = doc.elements.to_a( "//item/attribute::upc" ) +all_name_elements = doc.elements.to_a( "//name" ) +
+ + +

REXML attempts to make the common case simple, but this means that + the uncommon case can be complicated. This is especially true with + Text nodes.

+ +

Text nodes have a lot of behavior, and in the case of internal + entities, what you get may be different from what you expect. When + REXML reads an XML document, in parses the DTD and creates an internal + table of entities. If it finds any of these entities in the document, + it replaces them with their values:

+ + doc = Document.new '<!DOCTYPE foo [ +<!ENTITY ent "replace"> +]><a>&ent;</a>' +doc.root.text #-> "replace" + + +

When you write the document back out, REXML replaces the values + with the entity reference:

+ + doc.to_s +# Generates: +# <!DOCTYPE foo [ +# <!ENTITY ent "replace"> +# ]><a>&ent;</a> + +

But there's a problem. What happens if only some of the words are + also entity reference values?

+ + doc = Document.new '<!DOCTYPE foo [ +<!ENTITY ent "replace"> +]><a>replace &ent;</a>' +doc.root.text #-> "replace replace" + + +

Well, REXML does the only thing it can:

+ + doc.to_s +# Generates: +# <!DOCTYPE foo [ +# <!ENTITY ent "replace"> +# ]><a>&ent; &ent;</a> + +

This is probably not what you expect. However, when designing + REXML, I had a choice between this behavior, and using immutable text + nodes. The problem is that, if you can change the text in a node, + REXML can never tell which tokens you want to have replaced with + entities. There is a wrinkle: REXML will write what it gets in as long + as you don't access the text. This is because REXML does lazy + evaluation of entities. Therefore,

+ + doc = Document.new( '<!DOCTYPE foo + [ <!ENTITY ent "replace"> ]><a>replace + &ent;</a>' ) doc.to_s # Generates: # <!DOCTYPE foo [ # + <!ENTITY ent "replace"> # ]><a>replace + &ent;</a> doc.root.text #-> Now accessed, + entities have been resolved doc.to_s # Generates: # <!DOCTYPE foo [ + # <!ENTITY ent "replace"> # ]><a>&ent; + &ent;</a> + +

There is a programmatic solution: :raw. If you set the + :raw flag on any Text or Element node, the entities + within that node will not be processed. This means that you'll have to + deal with entities yourself:

+ + doc = Document.new('<!DOCTYPE + foo [ <!ENTITY ent "replace"> ]><a>replace + &ent;</a>',{:raw=>:all}) + doc.root.text #-> "replace &ent;" doc.to_s # Generates: # + <!DOCTYPE foo [ # <!ENTITY ent "replace"> # + ]><a>replace &ent;</a> +
+ + +

Again, there are a couple of mechanisms for creating XML documents + in REXML. Adding elements by hand is faster than the convenience + method, but which you use will probably be a matter of aesthetics.

+ + el = someelement.add_element "myel" +# creates an element named "myel", adds it to "someelement", and returns it +el2 = el.add_element "another", {"id"=>"10"} +# does the same, but also sets attribute "id" of el2 to "10" +el3 = Element.new "blah" +el1.elements << el3 +el3.attributes["myid"] = "sean" +# creates el3 "blah", adds it to el1, then sets attribute "myid" to "sean" + +

If you want to add text to an element, you can do it by either + creating Text objects and adding them to the element, or by using the + convenience method text=

+ + el1 = Element.new "myelement" +el1.text = "Hello world!" +# -> <myelement>Hello world!</myelement> +el1.add_text "Hello dolly" +# -> <myelement>Hello world!Hello dolly</element> +el1.add Text.new("Goodbye") +# -> <myelement>Hello world!Hello dollyGoodbye</element> +el1 << Text.new(" cruel world") +# -> <myelement>Hello world!Hello dollyGoodbye cruel world</element> + +

But note that each of these text objects are still stored as + separate objects; el1.text will return "Hello world!"; + el1[2] will return a Text object with the contents + "Goodbye".

+ +

Please be aware that all text nodes in REXML are UTF-8 encoded, and + all of your code must reflect this. You may input and output other + encodings (UTF-8, UTF-16, ISO-8859-1, and UNILE are all supported, + input and output), but within your program, you must pass REXML UTF-8 + strings.

+ +

I can't emphasize this enough, because people do have problems with + this. REXML can't possibly alway guess correctly how your text is + encoded, so it always assumes the text is UTF-8. It also does not warn + you when you try to add text which isn't properly encoded, for the + same reason. You must make sure that you are adding UTF-8 text. +  If you're adding standard 7-bit ASCII, which is most common, you + don't have to worry.  If you're using ISO-8859-1 text (characters + above 0x80), you must convert it to UTF-8 before adding it to an + element.  You can do this with the shard: + text.unpack("C*").pack("U*"). If you ignore this warning + and add 8-bit ASCII characters to your documents, your code may + work... or it may not.  In either case, REXML is not at fault. + You have been warned.

+ +

One last thing: alternate encoding output support only works from + Document.write() and Document.to_s(). If you want to write out other + nodes with a particular encoding, you must wrap your output object + with Output:

+ + e = Element.new "<a/>" +e.text = "f\xfcr" # ISO-8859-1 'ΓΌ' +o = '' +e.write( Output.new( o, "ISO-8859-1" ) ) + + +

You can pass Output any of the supported encodings.

+ +

If you want to insert an element between two elements, you can use + either the standard Ruby array notation, or + Parent.insert_before and + Parent.insert_after.

+ + doc = Document.new "<a><one/><three/></a>" +doc.root[1,0] = Element.new "two" +# -> <a><one/><two/><three/></a> +three = doc.elements["a/three"] +doc.root.insert_after three, Element.new "four" +# -> <a><one/><two/><three/><four/></a> +# A convenience method allows you to insert before/after an XPath: +doc.root.insert_after( "//one", Element.new("one-five") ) +# -> <a><one/><one-five/><two/><three/><four/></a> +# Another convenience method allows you to insert after/before an element: +four = doc.elements["//four"] +four.previous_sibling = Element.new("three-five") +# -> <a><one/><one-five/><two/><three/><three-five/><four/></a> + +

The raw flag in the Text constructor can + be used to tell REXML to leave strings which have entities defined for + them alone.

+ + doc = Document.new( "<?xml version='1.0?> +<!DOCTYPE foo SYSTEM 'foo.dtd' [ +<!ENTITY % s "Sean"> +]> +<a/>" +t = Text.new( "Sean", false, nil, false ) +doc.root.text = t +t.to_s # -> &s; +t = Text.new( "Sean", false, nil, true ) +doc.root.text = t +t.to_s # -> Sean + +

Note that, in all cases, the value() method returns + the text with entities expanded, so the raw flag only + affects the to_s() method. If the raw is set + for a text node, then to_s() will not entities will not + normalize (turn into entities) entity values. You can not create raw + text nodes that contain illegal XML, so the following will generate a + parse error:

+ + t = Text.new( "&", false, nil, true ) + +

You can also tell REXML to set the Text children of given elements + to raw automatically, on parsing or creating:

+ + doc = REXML::Document.new( source, { :raw => %w{ tag1 tag2 tag3 } } + +

In this example, all tags named "tag1", "tag2", or "tag3" will have + any Text children set to raw text. If you want to have all of the text + processed as raw text, pass in the :all tag:

+ + doc = REXML::Document.new( source, { :raw => :all }) +
+ + +

There aren't many things that are more simple than writing a REXML + tree. Simply pass an object that supports <<( String + ) to the write method of any object. In Ruby, both + IO instances (File) and String instances support <<.

+ + doc.write $stdout +output = "" +doc.write output + +

If you want REXML to pretty-print output, pass write() + an indent value greater than -1:

+ + doc.write( $stdout, 0 ) + +

REXML will not, by default, write out the XML declaration unless + you specifically ask for them. If a document is read that contains an + XML declaration, that declaration will be written + faithfully. The other way you can tell REXML to write the declaration + is to specifically add the declaration:

+ + doc = Document.new +doc.add_element 'foo' +doc.to_s #-> <foo/> +doc << XMLDecl.new +doc.to_s #-> <?xml version='1.0'?><foo/> +
+ + +

There are four main methods of iterating over children. + Element.each, which iterates over all the children; + Element.elements.each, which iterates over just the child + Elements; Element.next_element and + Element.previous_element, which can be used to fetch the + next Element siblings; and Element.next_sibling and + Eleemnt.previous_sibling, which fetches the next and + previous siblings, regardless of type.

+
+ + +

REXML stream parsing requires you to supply a Listener class. When + REXML encounters events in a document (tag start, text, etc.) it + notifies your listener class of the event. You can supply any subset + of the methods, but make sure you implement method_missing if you + don't implement them all. A StreamListener module has been supplied as + a template for you to use.

+ + list = MyListener.new +source = File.new "mydoc.xml" +REXML::Document.parse_stream(source, list) + +

Stream parsing in REXML is much like SAX, where events are + generated when the parser encounters them in the process of parsing + the document. When a tag is encountered, the stream listener's + tag_start() method is called. When the tag end is + encountered, tag_end() is called. When text is + encountered, text() is called, and so on, until the end + of the stream is reached. One other note: the method + entity() is called when an &entity; is + encountered in text, and only then.

+ +

Please look at the StreamListener + API for more information.You must generate the API + documentation with rdoc or download the API documentation from the + REXML website for this documentation.

+
+ + +

By default, REXML respects whitespace in your document. In many + applications, you want the parser to compress whitespace in your + document. In these cases, you have to tell the parser which elements + you want to respect whitespace in by passing a context to the + parser:

+ + doc = REXML::Document.new( source, { :compress_whitespace => %w{ tag1 tag2 tag3 } } + +

Whitespace for tags "tag1", "tag2", and "tag3" will be compressed; + all other tags will have their whitespace respected. Like :raw, you + can set :compress_whitespace to :all, and have all elements have their + whitespace compressed.

+ +

You may also use the tag :respect_whitespace, which + flip-flops the behavior. If you use :respect_whitespace + for one or more tags, only those elements will have their whitespace + respected; all other tags will have their whitespace compressed.

+
+ + +

REXML does some automatic processing of entities for your + convenience. The processed entities are &, <, >, ", and '. + If REXML finds any of these characters in Text or Attribute values, it + automatically turns them into entity references when it writes them + out. Additionally, when REXML finds any of these entity references in + a document source, it converts them to their character equivalents. + All other entity references are left unprocessed. If REXML finds an + &, <, or > in the document source, it will generate a + parsing error.

+ + bad_source = "<a>Cats & dogs</a>" +good_source = "<a>Cats &amp; &#100;ogs</a>" +doc = REXML::Document.new bad_source +# Generates a parse error +doc = REXML::Document.new good_source +puts doc.root.text +# -> "Cats & &#100;ogs" +doc.root.write $stdout +# -> "<a>Cats &amp; &#100;ogs</a>" +doc.root.attributes["m"] = "x'y\"z" +puts doc.root.attributes["m"] +# -> "x'y\"z" +doc.root.write $stdout +# -> "<a m='x&apos;y&quot;z'>Cats &amp; &#100;ogs</a>" +
+ + +

Namespaces are fully supported in REXML and within the XPath + parser. There are a few caveats when using XPath, however:

+ + + If you don't supply a namespace mapping, the default namespace + mapping of the context element is used. This has its limitations, + but is convenient for most purposes. + + If you need to supply a namespace mapping, you must use the + XPath methods each, first, and + match and pass them the mapping. + + + source = "<a xmlns:x='foo' xmlns:y='bar'><x:b id='1'/><y:b id='2'/></a>" +doc = Document.new source +doc.elements["/a/x:b"].attributes["id"] # -> '1' +XPath.first(doc, "/a/m:b", {"m"=>"bar"}).attributes["id"] # -> '2' +doc.elements["//x:b"].prefix # -> 'x' +doc.elements["//x:b"].namespace # -> 'foo' +XPath.first(doc, "//m:b", {"m"=>"bar"}).prefix # -> 'y' +
+ + +

The pull parser API is not yet stable. When it settles down, I'll + fill in this section. For now, you'll have to bite the bullet and read + the PullParser + API docs. Ignore the PullListener class; it is a private helper + class.

+
+ + +

The original REXML stream parsing API is very minimal. This also + means that it is fairly fast. For a more complex, more "standard" API, + REXML also includes a streaming parser with a SAX2+ API. This API + differs from SAX2 in a couple of ways, such as having more filters and + multiple notification mechanisms, but the core API is SAX2.

+ +

The two classes in the SAX2 API are SAX2Parser + and SAX2Listener. + You can use the parser in one of five ways, depending on your needs. + Three of the ways are useful if you are filtering for a small number + of events in the document, such as just printing out the names of all + of the elements in a document, or getting all of the text in a + document. The other two ways are for more complex processing, where + you want to be notified of multiple events. The first three involve + Procs, and the last two involve listeners. The listener mechanisms are + very similar to the original REXML streaming API, with the addition of + filtering options, and are faster than the proc mechanisms.

+ +

An example is worth a thousand words, so we'll just take a look at + a small example of each of the mechanisms. The first example involves + printing out only the text content of a document.

+ + require 'rexml/sax2parser' +parser = REXML::SAX2Parser.new( File.new( 'documentation.xml' ) ) +parser.listen( :characters ) {|text| puts text } +parser.parse + +

In this example, we tell the parser to call our block for every + characters event. "characters" is what SAX2 calls Text + nodes. The event is identified by the symbol :characters. + There are a number of these events, including + :element_start, :end_prefix_mapping, and so + on; the events are named after the methods in the + SAX2Listener API, so refer to that document for a + complete list.

+ +

You can additionally filter for particular elements by passing an + array of tag names to the listen method. In further + examples, we will not include the require or parser + construction lines, as they are the same for all of these + examples.

+ + parser.listen( :characters, %w{ changelog todo } ) {|text| puts text } +parser.parse + +

In this example, only the text content of changelog and todo + elements will be printed. The array of tag names can also contain + regular expressions which the element names will be matched + against.

+ +

Finally, as a shortcut, if you do not pass a symbol to the listen + method, it will default to :element_start

+ + parser.listen( %w{ item }) do |uri,localname,qname,attributes| + puts attributes['version'] +end +parser.parse + +

This example prints the "version" attribute of all "item" elements + in the document. Notice that the number of arguments passed to the + block is larger than for :text; again, check the + SAX2Listener API for a list of what arguments are passed the blocks + for a given event.

+ +

The last two mechanisms for parsing use the SAX2Listener API. Like + StreamListener, SAX2Listener is a module, so you can + include it in your class to give you an adapter. To use + the listener model, create a class that implements some of the + SAX2Listener methods, or all of them if you don't include the + SAX2Listener model. Add them to a parser as you would blocks, and when + the parser is run, the methods will be called when events occur. + Listeners do not use event symbols, but they can filter on element + names.

+ + listener1 = MySAX2Listener.new +listener2 = MySAX2Listener.new +parser.listen( listener1 ) +parser.listen( %{ changelog, todo, credits }, listener2 ) +parser.parse + +

In the previous example, listener1 will be notified of + all events that occur, and listener2 will only be + notified of events that occur in changelog, + todo, and credits elements. We also see that + multiple listeners can be added to the same parser; multiple blocks + can also be added, and listeners and blocks can be mixed together.

+ +

There is, as yet, no mechanism for recursion. Two upcoming features + of the SAX2 API will be the ability to filter based on an XPath, and + the ability to specify filtering on an elemnt and all of its + descendants.

+ +

WARNING: The SAX2 API for dealing with doctype (DTD) + events almost certainly will change.

+
+ + +

Michael Neumann contributed some convenience functions for nodes, + and they are general enough that I've included. Michael's use-case + examples follow: # + Starting with +root_node+, we recursively look for a node with the + given # +tag+, the given +attributes+ (a Hash) and whoose text equals + or matches the # +text+ string or regular expression. # # To find the + following node: # # <td class='abc'>text</td> # # We use: + # # find_node(root, 'td', {'class' => 'abc'}, "text") # # Returns + +nil+ if no matching node was found. def find_node(root_node, tag, + attributes, text) root_node.find_first_recursive {|node| node.name == + tag and attributes.all? {|attr, val| node.attributes[attr] == val} and + text === node.text } end # # Extract specific columns (specified by + the position of it's corresponding # header column) from a table. # # + Given the following table: # # <table> # <tr> # + <td>A</td> # <td>B</td> # + <td>C</td> # </tr> # <tr> # + <td>A.1</td> # <td>B.1</td> # + <td>C.1</td> # </tr> # <tr> # + <td>A.2</td> # <td>B.2</td> # + <td>C.2</td> # </tr> # </table> # # To extract + the first (A) and last (C) column: # # extract_from_table(root_node, + ["A", "C"]) # # And you get this as result: # # [ # ["A.1", "C.1"], # + ["A.2", "C.2"] # ] # def extract_from_table(root_node, headers) # + extract and collect all header nodes header_nodes = headers.collect { + |header| find_node(root_node, 'td', {}, header) } raise "some headers + not found" if header_nodes.compact.size < headers.size # assert + that all headers have the same parent 'header_row', which is the row # + in which the header_nodes are contained. 'table' is the surrounding + table tag. header_row = header_nodes.first.parent table = + header_row.parent raise "different parents" unless header_nodes.all? + {|n| n.parent == header_row} # we now iterate over all rows in the + table that follows the header_row. # for each row we collect the + elements at the same positions as the header_nodes. # this is what we + finally return from the method. (header_row.index_in_parent+1 .. + table.elements.size).collect do |inx| row = table.elements[inx] + header_nodes.collect { |n| row.elements[ n.index_in_parent ].text } + end end

+
+ + +

This isn't everything there is to REXML, but it should be enough to + get started. Check the API + documentationYou must generate the API documentation + with rdoc or download the API documentation from the REXML website for + this documentation. for particulars and more examples. + There are plenty of unit tests in the test/ directory, + and these are great sources of working examples.

+
+
+
+ + +

Among the people who've contributed to this document are:

+ + + Eichert, Diana (bug + fix) + +
+
\ No newline at end of file -- cgit v1.2.3