update v1.0.7.7
This commit is contained in:
		
							
								
								
									
										850
									
								
								vendor/htmlpurifier/docs/enduser-customize.html
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										850
									
								
								vendor/htmlpurifier/docs/enduser-customize.html
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| @@ -0,0 +1,850 @@ | ||||
| <?xml version="1.0" encoding="UTF-8"?> | ||||
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" | ||||
|     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> | ||||
| <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> | ||||
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> | ||||
| <meta name="description" content="Tutorial for customizing HTML Purifier's tag and attribute sets." /> | ||||
| <link rel="stylesheet" type="text/css" href="style.css" /> | ||||
|  | ||||
| <title>Customize - HTML Purifier</title> | ||||
|  | ||||
| </head><body> | ||||
|  | ||||
| <h1 class="subtitled">Customize!</h1> | ||||
| <div class="subtitle">HTML Purifier is a Swiss-Army Knife</div> | ||||
|  | ||||
| <div id="filing">Filed under End-User</div> | ||||
| <div id="index">Return to the <a href="index.html">index</a>.</div> | ||||
| <div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> | ||||
|  | ||||
| <p> | ||||
|   HTML Purifier has this quirk where if you try to allow certain elements or | ||||
|   attributes, HTML Purifier will tell you that it's not supported, and that | ||||
|   you should go to the forums to find out how to implement it. Well, this | ||||
|   document is how to implement elements and attributes which HTML Purifier | ||||
|   doesn't support out of the box. | ||||
| </p> | ||||
|  | ||||
| <h2>Is it necessary?</h2> | ||||
|  | ||||
| <p> | ||||
|   Before we even write any code, it is paramount to consider whether or | ||||
|   not the code we're writing is necessary or not. HTML Purifier, by default, | ||||
|   contains a large set of elements and attributes: large enough so that | ||||
|   <em>any</em> element or attribute in XHTML 1.0 or 1.1 (and its HTML variants) | ||||
|   that can be safely used by the general public is implemented. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   So what needs to be implemented? (Feel free to skip this section if | ||||
|   you know what you want). | ||||
| </p> | ||||
|  | ||||
| <h3>XHTML 1.0</h3> | ||||
|  | ||||
| <p> | ||||
|   All of the modules listed below are based off of the | ||||
|   <a href="http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/abstract_modules.html#sec_5.2.">modularization of | ||||
|   XHTML</a>, which, while technically for XHTML 1.1, is quite a useful | ||||
|   resource. | ||||
| </p> | ||||
|  | ||||
| <ul> | ||||
|   <li>Structure</li> | ||||
|   <li>Frames</li> | ||||
|   <li>Applets (deprecated)</li> | ||||
|   <li>Forms</li> | ||||
|   <li>Image maps</li> | ||||
|   <li>Objects</li> | ||||
|   <li>Frames</li> | ||||
|   <li>Events</li> | ||||
|   <li>Meta-information</li> | ||||
|   <li>Style sheets</li> | ||||
|   <li>Link (not hypertext)</li> | ||||
|   <li>Base</li> | ||||
|   <li>Name</li> | ||||
| </ul> | ||||
|  | ||||
| <p> | ||||
|   If you don't recognize it, you probably don't need it. But the curious | ||||
|   can look all of these modules up in the above-mentioned document.  Note | ||||
|   that inline scripting comes packaged with HTML Purifier (more on this | ||||
|   later). | ||||
| </p> | ||||
|  | ||||
| <h3>XHTML 1.1</h3> | ||||
|  | ||||
| <p> | ||||
|   As of HTMLPurifier 2.1.0, we have implemented the | ||||
|   <a href="http://www.w3.org/TR/2001/REC-ruby-20010531/">Ruby module</a>, | ||||
|   which defines a set of tags | ||||
|   for publishing short annotations for text, used mostly in Japanese | ||||
|   and Chinese school texts, but applicable for positioning any text (not | ||||
|   limited to translations) above or below other corresponding text. | ||||
| </p> | ||||
|  | ||||
| <h3>HTML 5</h3> | ||||
|  | ||||
| <p> | ||||
|   <a href="http://www.whatwg.org/specs/web-apps/current-work/">HTML 5</a> | ||||
|   is a fork of HTML 4.01 by WHATWG, who believed that XHTML 2.0 was headed | ||||
|   in the wrong direction.  It too is a working draft, and may change | ||||
|   drastically before publication, but it should be noted that the | ||||
|   <code>canvas</code> tag has been implemented by many browser vendors. | ||||
| </p> | ||||
|  | ||||
| <h3>Proprietary</h3> | ||||
|  | ||||
| <p> | ||||
|   There are a number of proprietary tags still in the wild. Many of them | ||||
|   have been documented in <a href="ref-proprietary-tags.txt">ref-proprietary-tags.txt</a>, | ||||
|   but there is currently no implementation for any of them. | ||||
| </p> | ||||
|  | ||||
| <h3>Extensions</h3> | ||||
|  | ||||
| <p> | ||||
|   There are also a number of other XML languages out there that can | ||||
|   be embedded in HTML documents: two of the most popular are MathML and | ||||
|   SVG, and I frequently get requests to implement these.  But they are | ||||
|   expansive, comprehensive specifications, and it would take far too long | ||||
|   to implement them <em>correctly</em> (most systems I've seen go as far | ||||
|   as whitelisting tags and no further; come on, what about nesting!) | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   Word of warning: HTML Purifier is currently <em>not</em> namespace | ||||
|   aware. | ||||
| </p> | ||||
|  | ||||
| <h2>Giving back</h2> | ||||
|  | ||||
| <p> | ||||
|   As you may imagine from the details above (don't be abashed if you didn't | ||||
|   read it all: a glance over would have done), there's quite a bit that | ||||
|   HTML Purifier doesn't implement.  Recent architectural changes have | ||||
|   allowed HTML Purifier to implement elements and attributes that are not | ||||
|   safe!  Don't worry, they won't be activated unless you set %HTML.Trusted | ||||
|   to true, but they certainly help out users who need to put, say, forms | ||||
|   on their page and don't want to go through the trouble of reading this | ||||
|   and implementing it themself. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   So any of the above that you implement for your own application could | ||||
|   help out some other poor sap on the other side of the globe.  Help us | ||||
|   out, and send back code so that it can be hammered into a module and | ||||
|   released with the core.  Any code would be greatly appreciated! | ||||
| </p> | ||||
|  | ||||
| <h2>And now...</h2> | ||||
|  | ||||
| <p> | ||||
|   Enough philosophical talk, time for some code: | ||||
| </p> | ||||
|  | ||||
| <pre>$config = HTMLPurifier_Config::createDefault(); | ||||
| $config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); | ||||
| $config->set('HTML.DefinitionRev', 1); | ||||
| if ($def = $config->maybeGetRawHTMLDefinition()) { | ||||
|     // our code will go here | ||||
| }</pre> | ||||
|  | ||||
| <p> | ||||
|   Assuming that HTML Purifier has already been properly loaded (hint: | ||||
|   include <code>HTMLPurifier.auto.php</code>), this code will set up | ||||
|   the environment that you need to start customizing the HTML definition. | ||||
|   What's going on? | ||||
| </p> | ||||
|  | ||||
| <ul> | ||||
|   <li> | ||||
|     The first three lines are regular configuration code: | ||||
|     <ul> | ||||
|       <li> | ||||
|         %HTML.DefinitionID is set to a unique identifier for your | ||||
|         custom HTML definition.  This prevents it from clobbering | ||||
|         other custom definitions on the same installation. | ||||
|       </li> | ||||
|       <li> | ||||
|         %HTML.DefinitionRev is a revision integer of your HTML | ||||
|         definition.  Because HTML definitions are cached, you'll need | ||||
|         to increment this whenever you make a change in order to flush | ||||
|         the cache. | ||||
|       </li> | ||||
|     </ul> | ||||
|   </li> | ||||
|   <li> | ||||
|     The fourth line retrieves a raw <code>HTMLPurifier_HTMLDefinition</code> | ||||
|     object that we will be tweaking.  Interestingly enough, we have | ||||
|     placed it in an if block: this is because | ||||
|     <code>maybeGetRawHTMLDefinition</code>, as its name suggests, may | ||||
|     return a NULL, in which case we should skip doing any | ||||
|     initialization.  This, in fact, will correspond to when our fully | ||||
|     customized object is already in the cache. | ||||
|   </li> | ||||
| </ul> | ||||
|  | ||||
| <h2>Turn off caching</h2> | ||||
|  | ||||
| <p> | ||||
|   To make development easier, we're going to temporarily turn off | ||||
|   definition caching: | ||||
| </p> | ||||
|  | ||||
| <pre>$config = HTMLPurifier_Config::createDefault(); | ||||
| $config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); | ||||
| $config->set('HTML.DefinitionRev', 1); | ||||
| <strong>$config->set('Cache.DefinitionImpl', null); // TODO: remove this later!</strong> | ||||
| $def = $config->getHTMLDefinition(true);</pre> | ||||
|  | ||||
| <p> | ||||
|   A few things should be mentioned about the caching mechanism before | ||||
|   we move on.  For performance reasons, HTML Purifier caches generated | ||||
|   <code>HTMLPurifier_Definition</code> objects in serialized files | ||||
|   stored (by default) in <code>library/HTMLPurifier/DefinitionCache/Serializer</code>. | ||||
|   A lot of processing is done in order to create these objects, so it | ||||
|   makes little sense to repeat the same processing over and over again | ||||
|   whenever HTML Purifier is called. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   In order to identify a cache entry, HTML Purifier uses three variables: | ||||
|   the library's version number, the value of %HTML.DefinitionRev and | ||||
|   a serial of relevant configuration.  Whenever any of these changes, | ||||
|   a new HTML definition is generated.  Notice that there is no way | ||||
|   for the definition object to track changes to customizations: here, it | ||||
|   is up to you to supply appropriate information to DefinitionID and | ||||
|   DefinitionRev. | ||||
| </p> | ||||
|  | ||||
| <h2 id="addAttribute">Add an attribute</h2> | ||||
|  | ||||
| <p> | ||||
|   For this example, we're going to implement the <code>target</code> attribute found | ||||
|   on <code>a</code> elements.  To implement an attribute, we have to | ||||
|   ask a few questions: | ||||
| </p> | ||||
|  | ||||
| <ol> | ||||
|   <li>What element is it found on?</li> | ||||
|   <li>What is its name?</li> | ||||
|   <li>Is it required or optional?</li> | ||||
|   <li>What are valid values for it?</li> | ||||
| </ol> | ||||
|  | ||||
| <p> | ||||
|   The first three are easy: the element is <code>a</code>, the attribute | ||||
|   is <code>target</code>, and it is not a required attribute. (If it | ||||
|   was required, we'd need to append an asterisk to the attribute name, | ||||
|   you'll see an example of this in the addElement() example). | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   The last question is a little trickier. | ||||
|   Lets allow the special values: _blank, _self, _target and _top. | ||||
|   The form of this is called an <strong>enumeration</strong>, a list of | ||||
|   valid values, although only one can be used at a time.  To translate | ||||
|   this into code form, we write: | ||||
| </p> | ||||
|  | ||||
| <pre>$config = HTMLPurifier_Config::createDefault(); | ||||
| $config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); | ||||
| $config->set('HTML.DefinitionRev', 1); | ||||
| $config->set('Cache.DefinitionImpl', null); // remove this later! | ||||
| $def = $config->getHTMLDefinition(true); | ||||
| <strong>$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');</strong></pre> | ||||
|  | ||||
| <p> | ||||
|   The <code>Enum#_blank,_self,_target,_top</code> does all the magic. | ||||
|   The string is split into two parts, separated by a hash mark (#): | ||||
| </p> | ||||
|  | ||||
| <ol> | ||||
|   <li>The first part is the name of what we call an <code>AttrDef</code></li> | ||||
|   <li>The second part is the parameter of the above-mentioned <code>AttrDef</code></li> | ||||
| </ol> | ||||
|  | ||||
| <p> | ||||
|   If that sounds vague and generic, it's because it is!  HTML Purifier defines | ||||
|   an assortment of different attribute types one can use, and each of these | ||||
|   has their own specialized parameter format.  Here are some of the more useful | ||||
|   ones: | ||||
| </p> | ||||
|  | ||||
| <table class="table"> | ||||
|   <thead> | ||||
|     <tr> | ||||
|       <th>Type</th> | ||||
|       <th>Format</th> | ||||
|       <th>Description</th> | ||||
|     </tr> | ||||
|   </thead> | ||||
|   <tbody> | ||||
|     <tr> | ||||
|       <th>Enum</th> | ||||
|       <td><em>[s:]</em>value1,value2,...</td> | ||||
|       <td> | ||||
|         Attribute with a number of valid values, one of which may be used. When | ||||
|         s: is present, the enumeration is case sensitive. | ||||
|       </td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Bool</th> | ||||
|       <td>attribute_name</td> | ||||
|       <td> | ||||
|         Boolean attribute, with only one valid value: the name | ||||
|         of the attribute. | ||||
|       </td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>CDATA</th> | ||||
|       <td></td> | ||||
|       <td> | ||||
|         Attribute of arbitrary text. Can also be referred to as <strong>Text</strong> | ||||
|         (the specification makes a semantic distinction between the two). | ||||
|       </td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>ID</th> | ||||
|       <td></td> | ||||
|       <td> | ||||
|         Attribute that specifies a unique ID | ||||
|       </td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Pixels</th> | ||||
|       <td></td> | ||||
|       <td> | ||||
|         Attribute that specifies an integer pixel length | ||||
|       </td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Length</th> | ||||
|       <td></td> | ||||
|       <td> | ||||
|         Attribute that specifies a pixel or percentage length | ||||
|       </td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>NMTOKENS</th> | ||||
|       <td></td> | ||||
|       <td> | ||||
|         Attribute that specifies a number of name tokens, example: the | ||||
|         <code>class</code> attribute | ||||
|       </td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>URI</th> | ||||
|       <td></td> | ||||
|       <td> | ||||
|         Attribute that specifies a URI, example: the <code>href</code> | ||||
|         attribute | ||||
|       </td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Number</th> | ||||
|       <td></td> | ||||
|       <td> | ||||
|         Attribute that specifies an positive integer number | ||||
|       </td> | ||||
|     </tr> | ||||
|   </tbody> | ||||
| </table> | ||||
|  | ||||
| <p> | ||||
|   For a complete list, consult | ||||
|   <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/AttrTypes.php"><code>library/HTMLPurifier/AttrTypes.php</code></a>; | ||||
|   more information on attributes that accept parameters can be found on their | ||||
|   respective includes in | ||||
|   <a href="http://repo.or.cz/w/htmlpurifier.git?a=tree;hb=HEAD;f=library/HTMLPurifier/AttrDef"><code>library/HTMLPurifier/AttrDef</code></a>. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   Sometimes, the restrictive list in AttrTypes just doesn't cut it. Don't | ||||
|   sweat: you can also use a fully instantiated object as the value. The | ||||
|   equivalent, verbose form of the above example is: | ||||
| </p> | ||||
|  | ||||
| <pre>$config = HTMLPurifier_Config::createDefault(); | ||||
| $config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); | ||||
| $config->set('HTML.DefinitionRev', 1); | ||||
| $config->set('Cache.DefinitionImpl', null); // remove this later! | ||||
| $def = $config->getHTMLDefinition(true); | ||||
| <strong>$def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum( | ||||
|   array('_blank','_self','_target','_top') | ||||
| ));</strong></pre> | ||||
|  | ||||
| <p> | ||||
|   Trust me, you'll learn to love the shorthand. | ||||
| </p> | ||||
|  | ||||
| <h2>Add an element</h2> | ||||
|  | ||||
| <p> | ||||
|   Adding attributes is really small-fry stuff, though, and it was possible | ||||
|   to add them (albeit a bit more wordy) prior to 2.0. The real gem of | ||||
|   the Advanced API is adding elements. There are five questions to | ||||
|   ask when adding a new element: | ||||
| </p> | ||||
|  | ||||
| <ol> | ||||
|   <li>What is the element's name?</li> | ||||
|   <li>What content set does this element belong to?</li> | ||||
|   <li>What are the allowed children of this element?</li> | ||||
|   <li>What attributes does the element allow that are general?</li> | ||||
|   <li>What attributes does the element allow that are specific to this element?</li> | ||||
| </ol> | ||||
|  | ||||
| <p> | ||||
|   It's a mouthful, and you'll be slightly lost if your not familiar with | ||||
|   the HTML specification, so let's explain them step by step. | ||||
| </p> | ||||
|  | ||||
| <h3>Content set</h3> | ||||
|  | ||||
| <p> | ||||
|   The HTML specification defines two major content sets: Inline | ||||
|   and Block.  Each of these | ||||
|   content sets contain a list of elements: Inline contains things like | ||||
|   <code>span</code> and <code>b</code> while Block contains things like | ||||
|   <code>div</code> and <code>blockquote</code>. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   These content sets amount to a macro mechanism for HTML definition. Most | ||||
|   elements in HTML are organized into one of these two sets, and most | ||||
|   elements in HTML allow elements from one of these sets.  If we had | ||||
|   to write each element verbatim into each other element's allowed | ||||
|   children, we would have ridiculously large lists; instead we use | ||||
|   content sets to compactify the declaration. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   Practically speaking, there are several useful values you can use here: | ||||
| </p> | ||||
|  | ||||
| <table class="table"> | ||||
|   <thead> | ||||
|     <tr> | ||||
|       <th>Content set</th> | ||||
|       <th>Description</th> | ||||
|     </tr> | ||||
|   </thead> | ||||
|   <tbody> | ||||
|     <tr> | ||||
|       <th>Inline</th> | ||||
|       <td>Character level elements, text</td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Block</th> | ||||
|       <td>Block-like elements, like paragraphs and lists</td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th><em>false</em></th> | ||||
|       <td> | ||||
|         Any element that doesn't fit into the mold, for example <code>li</code> | ||||
|         or <code>tr</code> | ||||
|       </td> | ||||
|     </tr> | ||||
|   </tbody> | ||||
| </table> | ||||
|  | ||||
| <p> | ||||
|   By specifying a valid value here, all other elements that use that | ||||
|   content set will also allow your element, without you having to do | ||||
|   anything. If you specify <em>false</em>, you'll have to register | ||||
|   your element manually. | ||||
| </p> | ||||
|  | ||||
| <h3>Allowed children</h3> | ||||
|  | ||||
| <p> | ||||
|   Allowed children defines the elements that this element can contain. | ||||
|   The allowed values may range from none to a complex regexp depending on | ||||
|   your element. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   If you've ever taken a look at the HTML DTD's before, you may have | ||||
|   noticed declarations like this: | ||||
| </p> | ||||
|  | ||||
| <pre><!ELEMENT LI - O (%flow;)*             -- list item --></pre> | ||||
|  | ||||
| <p> | ||||
|   The <code>(%flow;)*</code> indicates the allowed children of the | ||||
|   <code>li</code> tag: <code>li</code> allows any number of flow | ||||
|   elements as its children. (The <code>- O</code> allows the closing tag to be | ||||
|   omitted, though in XML this is not allowed.) In HTML Purifier, | ||||
|   we'd write it like <code>Flow</code> (here's where the content sets | ||||
|   we were discussing earlier come into play). There are three shorthand | ||||
|   content models you can specify: | ||||
| </p> | ||||
|  | ||||
| <table class="table"> | ||||
|   <thead> | ||||
|     <tr> | ||||
|       <th>Content model</th> | ||||
|       <th>Description</th> | ||||
|     </tr> | ||||
|   </thead> | ||||
|   <tbody> | ||||
|     <tr> | ||||
|       <th>Empty</th> | ||||
|       <td>No children allowed, like <code>br</code> or <code>hr</code></td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Inline</th> | ||||
|       <td>Any number of inline elements and text, like <code>span</code></td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Flow</th> | ||||
|       <td>Any number of inline elements, block elements and text, like <code>div</code></td> | ||||
|     </tr> | ||||
|   </tbody> | ||||
| </table> | ||||
|  | ||||
| <p> | ||||
|   This covers 90% of all the cases out there, but what about elements that | ||||
|   break the mold like <code>ul</code>? This guy requires at least one | ||||
|   child, and the only valid children for it are <code>li</code>. The | ||||
|   content model is: <code>Required: li</code>. There are two parts: the | ||||
|   first type determines what <code>ChildDef</code> will be used to validate | ||||
|   content models. The most common values are: | ||||
| </p> | ||||
|  | ||||
| <table class="table"> | ||||
|   <thead> | ||||
|     <tr> | ||||
|       <th>Type</th> | ||||
|       <th>Description</th> | ||||
|     </tr> | ||||
|   </thead> | ||||
|   <tbody> | ||||
|     <tr> | ||||
|       <th>Required</th> | ||||
|       <td>Children must be one or more of the valid elements</td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Optional</th> | ||||
|       <td>Children can be any number of the valid elements</td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Custom</th> | ||||
|       <td>Children must follow the DTD-style regex</td> | ||||
|     </tr> | ||||
|   </tbody> | ||||
| </table> | ||||
|  | ||||
| <p> | ||||
|   You can also implement your own <code>ChildDef</code>: this was done | ||||
|   for a few special cases in HTML Purifier such as <code>Chameleon</code> | ||||
|   (for <code>ins</code> and <code>del</code>), <code>StrictBlockquote</code> | ||||
|   and <code>Table</code>. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   The second part specifies either valid elements or a regular expression. | ||||
|   Valid elements are separated with horizontal bars (|), i.e. | ||||
|   "<code>a | b | c</code>".  Use #PCDATA to represent plain text. | ||||
|   Regular expressions are based off of DTD's style: | ||||
| </p> | ||||
|  | ||||
| <ul> | ||||
|   <li>Parentheses () are used for grouping</li> | ||||
|   <li>Commas (,) separate elements that should come one after another</li> | ||||
|   <li>Horizontal bars (|) indicate one or the other elements should be used</li> | ||||
|   <li>Plus signs (+) are used for a one or more match</li> | ||||
|   <li>Asterisks (*) are used for a zero or more match</li> | ||||
|   <li>Question marks (?) are used for a zero or one match</li> | ||||
| </ul> | ||||
|  | ||||
| <p> | ||||
|   For example, "<code>a, b?, (c | d), e+, f*</code>" means "In this order, | ||||
|   one <code>a</code> element, at most one <code>b</code> element, | ||||
|   one <code>c</code> or <code>d</code> element (but not both), one or more | ||||
|   <code>e</code> elements, and any number of <code>f</code> elements." | ||||
|   Regex veterans should be able to jump right in, and those not so savvy | ||||
|   can always copy-paste W3C's content model definitions into HTML Purifier | ||||
|   and hope for the best. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   A word of warning: while the regex format is extremely flexible on | ||||
|   the developer's side, it is | ||||
|   quite unforgiving on the user's side.  If the user input does not <em>exactly</em> | ||||
|   match the specification, the entire contents of the element will | ||||
|   be nuked.  This is why there is are specific content model types like | ||||
|   Optional and Required: while they could be implemented as <code>Custom: | ||||
|   (valid | elements)*</code>, the custom classes contain special recovery | ||||
|   measures that make sure as much of the user's original content gets | ||||
|   through. HTML Purifier's core, as a rule, does not use Custom. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   One final note: you can also use Content Sets inside your valid elements | ||||
|   lists or regular expressions. In fact, the three shorthand content models | ||||
|   mentioned above are just that: abbreviations: | ||||
| </p> | ||||
|  | ||||
| <table class="table"> | ||||
|   <thead> | ||||
|     <tr> | ||||
|       <th>Content model</th> | ||||
|       <th>Implementation</th> | ||||
|     </tr> | ||||
|   </thead> | ||||
|   <tbody> | ||||
|     <tr> | ||||
|       <th>Inline</th> | ||||
|       <td>Optional: Inline | #PCDATA</td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Flow</th> | ||||
|       <td>Optional: Flow | #PCDATA</td> | ||||
|     </tr> | ||||
|   </tbody> | ||||
| </table> | ||||
|  | ||||
| <p> | ||||
|   When the definition is compiled, Inline will be replaced with a | ||||
|   horizontal-bar separated list of inline elements. Also, notice that | ||||
|   it does not contain text: you have to specify that yourself. | ||||
| </p> | ||||
|  | ||||
| <h3>Common attributes</h3> | ||||
|  | ||||
| <p> | ||||
|   Congratulations: you have just gotten over the proverbial hump (Allowed | ||||
|   children). Common attributes is much simpler, and boils down to | ||||
|   one question: does your element have the <code>id</code>, <code>style</code>, | ||||
|   <code>class</code>, <code>title</code> and <code>lang</code> attributes? | ||||
|   If so, you'll want to specify the <code>Common</code> attribute collection, | ||||
|   which contains these five attributes that are found on almost every | ||||
|   HTML element in the specification. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   There are a few more collections, but they're really edge cases: | ||||
| </p> | ||||
|  | ||||
| <table class="table"> | ||||
|   <thead> | ||||
|     <tr> | ||||
|       <th>Collection</th> | ||||
|       <th>Attributes</th> | ||||
|     </tr> | ||||
|   </thead> | ||||
|   <tbody> | ||||
|     <tr> | ||||
|       <th>I18N</th> | ||||
|       <td><code>lang</code>, possibly <code>xml:lang</code></td> | ||||
|     </tr> | ||||
|     <tr> | ||||
|       <th>Core</th> | ||||
|       <td><code>style</code>, <code>class</code>, <code>id</code> and <code>title</code></td> | ||||
|     </tr> | ||||
|   </tbody> | ||||
| </table> | ||||
|  | ||||
| <p> | ||||
|   Common is a combination of the above-mentioned collections. | ||||
| </p> | ||||
|  | ||||
| <p class="aside"> | ||||
|   Readers familiar with the modularization may have noticed that the Core | ||||
|   attribute collection differs from that specified by the <a | ||||
|   href="http://www.w3.org/TR/xhtml-modularization/abstract_modules.html#s_commonatts">abstract | ||||
|   modules of the XHTML Modularization 1.1</a>. We believe this section | ||||
|   to be in error, as <code>br</code> permits the use of the <code>style</code> | ||||
|   attribute even though it uses the <code>Core</code> collection, and | ||||
|   the DTD and XML Schemas supplied by W3C support our interpretation. | ||||
| </p> | ||||
|  | ||||
| <h3>Attributes</h3> | ||||
|  | ||||
| <p> | ||||
|   If you didn't read the <a href="#addAttribute">earlier section on | ||||
|   adding attributes</a>, read it now.  The last parameter is simply | ||||
|   an array of attribute names to attribute implementations, in the exact | ||||
|   same format as <code>addAttribute()</code>. | ||||
| </p> | ||||
|  | ||||
| <h3>Putting it all together</h3> | ||||
|  | ||||
| <p> | ||||
|   We're going to implement <code>form</code>. Before we embark, lets | ||||
|   grab a reference implementation from over at the | ||||
|   <a href="http://www.w3.org/TR/html4/sgml/loosedtd.html">transitional DTD</a>: | ||||
| </p> | ||||
|  | ||||
| <pre><!ELEMENT FORM - - (%flow;)* -(FORM)   -- interactive form --> | ||||
| <!ATTLIST FORM | ||||
|   %attrs;                              -- %coreattrs, %i18n, %events -- | ||||
|   action      %URI;          #REQUIRED -- server-side form handler -- | ||||
|   method      (GET|POST)     GET       -- HTTP method used to submit the form-- | ||||
|   enctype     %ContentType;  "application/x-www-form-urlencoded" | ||||
|   accept      %ContentTypes; #IMPLIED  -- list of MIME types for file upload -- | ||||
|   name        CDATA          #IMPLIED  -- name of form for scripting -- | ||||
|   onsubmit    %Script;       #IMPLIED  -- the form was submitted -- | ||||
|   onreset     %Script;       #IMPLIED  -- the form was reset -- | ||||
|   target      %FrameTarget;  #IMPLIED  -- render in this frame -- | ||||
|   accept-charset %Charsets;  #IMPLIED  -- list of supported charsets -- | ||||
|   ></pre> | ||||
|  | ||||
| <p> | ||||
|   Juicy! With just this, we can answer four of our five questions: | ||||
| </p> | ||||
|  | ||||
| <ol> | ||||
|   <li>What is the element's name? <strong>form</strong></li> | ||||
|   <li>What content set does this element belong to? <strong>Block</strong> | ||||
|     (this needs a little sleuthing, I find the easiest way is to search | ||||
|     the DTD for <code>FORM</code> and determine which set it is in.)</li> | ||||
|   <li>What are the allowed children of this element? <strong>One | ||||
|     or more flow elements, but no nested <code>form</code>s</strong></li> | ||||
|   <li>What attributes does the element allow that are general? <strong>Common</strong></li> | ||||
|   <li>What attributes does the element allow that are specific to this element? <strong>A whole bunch, see ATTLIST; | ||||
|     we're going to do the vital ones: <code>action</code>, <code>method</code> and <code>name</code></strong></li> | ||||
| </ol> | ||||
|  | ||||
| <p> | ||||
|   Time for some code: | ||||
| </p> | ||||
|  | ||||
| <pre>$config = HTMLPurifier_Config::createDefault(); | ||||
| $config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); | ||||
| $config->set('HTML.DefinitionRev', 1); | ||||
| $config->set('Cache.DefinitionImpl', null); // remove this later! | ||||
| $def = $config->getHTMLDefinition(true); | ||||
| $def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum( | ||||
|   array('_blank','_self','_target','_top') | ||||
| )); | ||||
| <strong>$form = $def->addElement( | ||||
|   'form',   // name | ||||
|   'Block',  // content set | ||||
|   'Flow', // allowed children | ||||
|   'Common', // attribute collection | ||||
|   array( // attributes | ||||
|     'action*' => 'URI', | ||||
|     'method' => 'Enum#get|post', | ||||
|     'name' => 'ID' | ||||
|   ) | ||||
| ); | ||||
| $form->excludes = array('form' => true);</strong></pre> | ||||
|  | ||||
| <p> | ||||
|   Each of the parameters corresponds to one of the questions we asked. | ||||
|   Notice that we added an asterisk to the end of the <code>action</code> | ||||
|   attribute to indicate that it is required. If someone specifies a | ||||
|   <code>form</code> without that attribute, the tag will be axed. | ||||
|   Also, the extra line at the end is a special extra declaration that | ||||
|   prevents forms from being nested within each other. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   And that's all there is to it! Implementing the rest of the form | ||||
|   module is left as an exercise to the user; to see more examples | ||||
|   check the <a href="http://repo.or.cz/w/htmlpurifier.git?a=tree;hb=HEAD;f=library/HTMLPurifier/HTMLModule"><code>library/HTMLPurifier/HTMLModule/</code></a> directory | ||||
|   in your local HTML Purifier installation. | ||||
| </p> | ||||
|  | ||||
| <h2>And beyond...</h2> | ||||
|  | ||||
| <p> | ||||
|   Perceptive users may have realized that, to a certain extent, we | ||||
|   have simply re-implemented the facilities of XML Schema or the | ||||
|   Document Type Definition.  What you are seeing here, however, is | ||||
|   not just an XML Schema or Document Type Definition: it is a fully | ||||
|   expressive method of specifying the definition of HTML that is | ||||
|   a portable superset of the capabilities of the two above-mentioned schema | ||||
|   languages.  What makes HTMLDefinition so powerful is the fact that | ||||
|   if we don't have an implementation for a content model or an attribute | ||||
|   definition, you can supply it yourself by writing a PHP class. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|   There are many facets of HTMLDefinition beyond the Advanced API I have | ||||
|   walked you through today.  To find out more about these, you can | ||||
|   check out these source files: | ||||
| </p> | ||||
|  | ||||
| <ul> | ||||
|   <li><a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/HTMLModule.php"><code>library/HTMLPurifier/HTMLModule.php</code></a></li> | ||||
|   <li><a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ElementDef.php"><code>library/HTMLPurifier/ElementDef.php</code></a></li> | ||||
| </ul> | ||||
|  | ||||
| <h2 id="optimized">Notes for HTML Purifier 4.2.0 and earlier</h3> | ||||
|  | ||||
| <p> | ||||
|     Previously, this tutorial gave some incorrect template code for | ||||
|     editing raw definitions, and that template code will now produce the | ||||
|     error <q>Due to a documentation error in previous version of HTML | ||||
|     Purifier...</q>  Here is how to mechanically transform old-style | ||||
|     code into new-style code. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|     First, identify all code that edits the raw definition object, and | ||||
|     put it together.  Ensure none of this code must be run on every | ||||
|     request; if some sub-part needs to always be run, move it outside | ||||
|     this block.  Here is an example below, with the raw definition | ||||
|     object code bolded. | ||||
| </p> | ||||
|  | ||||
| <pre>$config = HTMLPurifier_Config::createDefault(); | ||||
| $config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); | ||||
| $config->set('HTML.DefinitionRev', 1); | ||||
| $def = $config->getHTMLDefinition(true); | ||||
| <strong>$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');</strong> | ||||
| $purifier = new HTMLPurifier($config);</pre> | ||||
|  | ||||
| <p> | ||||
|     Next, replace the raw definition retrieval with a | ||||
|     maybeGetRawHTMLDefinition method call inside an if conditional, and | ||||
|     place the editing code inside that if block. | ||||
| </p> | ||||
|  | ||||
| <pre>$config = HTMLPurifier_Config::createDefault(); | ||||
| $config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); | ||||
| $config->set('HTML.DefinitionRev', 1); | ||||
| <strong>if ($def = $config->maybeGetRawHTMLDefinition()) { | ||||
|     $def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top'); | ||||
| }</strong> | ||||
| $purifier = new HTMLPurifier($config);</pre> | ||||
|  | ||||
| <p> | ||||
|     And you're done!  Alternatively, if you're OK with not ever caching | ||||
|     your code, the following will still work and not emit warnings. | ||||
| </p> | ||||
|  | ||||
| <pre>$config = HTMLPurifier_Config::createDefault(); | ||||
| $def = $config->getHTMLDefinition(true); | ||||
| $def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top'); | ||||
| $purifier = new HTMLPurifier($config);</pre> | ||||
|  | ||||
| <p> | ||||
|     A slightly less efficient version of this was what was going on with | ||||
|     old versions of HTML Purifier. | ||||
| </p> | ||||
|  | ||||
| <p> | ||||
|     <em>Technical notes:</em> ajh pointed out on <a | ||||
|         href="http://htmlpurifier.org/phorum/read.php?5,5164,5169#msg-5169">in a forum topic</a> that | ||||
|     HTML Purifier appeared to be repeatedly writing to the cache even | ||||
|     when a cache entry already existed.  Investigation lead to the | ||||
|     discovery of the following infelicity: caching of customized | ||||
|     definitions didn't actually work!  The problem was that even though | ||||
|     a cache file would be written out at the end of the process, there | ||||
|     was no way for HTML Purifier to say, <q>Actually, I've already got a | ||||
|         copy of your work, no need to reconfigure your | ||||
|         customizations</q>.  This required the API to change: placing | ||||
|     all of the customizations to the raw definition object in a | ||||
|     conditional which could be skipped. | ||||
| </p> | ||||
|  | ||||
| </body></html> | ||||
|  | ||||
| <!-- vim: et sw=4 sts=4 | ||||
| --> | ||||
		Reference in New Issue
	
	Block a user
	 Sujit Prasad
					Sujit Prasad