WP_HTML_Tag_Processor::escape_javascript_script_contents
Escape JavaScript and JSON script tag contents.
Ensure that the script contents cannot modify the HTML structure or break out of its containing SCRIPT element. JavaScript and JSON may both be escaped with the same rules, even though there are additional escaping measures available to JavaScript source code which aren’t applicable to serialized JSON data.
A simple method safely escapes all content except for a few extremely rare and unlikely exceptions: prevent the appearance of <script and </script within the contents by replacing the first letter of the tag name with a Unicode escape.
Example:
$plaintext = '<script>document.write( "A </script> closes a script." );</script>'; $escaped = '<script>document.write( "A </\u0073cript> closes a script." );</script>';
This works because of how parsing changes after encountering an opening SCRIPT tag. The actual parsing comprises a complicated state machine, the result of legacy behaviors and diverse browser support. However, without these two strings in the script contents, two key things are ensured: </script> cannot appear to prematurely close the tag, and the problematic double-escaped state becomes unreachable. A JavaScript engine or JSON decoder will then decode the Unicode escape (\u0073) back into its original plaintext value, but only after having been safely extracted from the HTML.
While it may seem tempting to replace the < character instead, doing so would break JavaScript syntax. The < character is used in comparison operators and other JavaScript syntax; replacing it would break valid JavaScript. Replacing only the s in <script and </script avoids modifying JavaScript syntax.
Exceptions
This _should_ work everywhere, but there are some extreme exceptions.
- Comments.
- Tagged templates, such as
String.raw(), which provide access to “raw” strings. - The
sourceproperty of a RegExp object.
Each of these exceptions appear at the source code level, not at the semantic or evaluation level. Normal JavaScript will remain semantically equivalent after escaping, but any JavaScript which analyzes the raw source code will see potentially-different values.
Comments
Comments are never unescaped because they aren’t parsed by the JavaScript engine. When viewing the source in a browser’s developer tools, the comments will retain their escaped text.
Example:
// A comment: "</script>" …becomes… // A comment: "</\u0073cript>"
Tagged templates.
Tagged templates “enable the embedding of arbitrary string content, where escape sequences may follow a different syntax.” For example, they can aid representing a RegExp pattern or LaTex snippet within a JavaScript string, where the string escape characters might get noisy and distracting.
Example:
console.log( 'A \notin B' ); // Prints a newline because of the "\n". console.log( 'A \\notin B' ); // Prints "A \notin B". console.log( String.raw`A \notin B` ); // Prints "A \notin B".
This means that if <script transforms into <\u0073cript _inside_ a raw string or tagged template literal which relies on its .raw property, the output of the code will be different after escaping.
Example:
console.log( String.raw`</script>` ); // Prematurely closes the SCRIPT element. console.log( String.raw`</\u0073cript>` ); // Prints "</\u0073cript".
RegExp sources.
The RegExp object exposes its raw source in a similar way to how tagged templates and raw strings do. Thankfully, because escape sequences are decoded when compiling the pattern, escaped RegExp patterns will match the same way as the plaintext sequences would.
Example:
true === /<script>/.test( '<script>' ); true === /<\u0073cript>/.test( '<script>' );
However, as with raw strings, any code which reads the source will see the escaped value instead of the decoded one.
Example:
console.log( /<script>/.source ); // Prints "<script>". console.log( /<\u0073cript>/.source ); // Prints "<\u0073cript>".
Unsupported escaping.
It is not possible to properly represent every possible JavaScript source file inside a SCRIPT element. As with CSS stylesheets, SVG images, and MathML, the only 100% reliable way to represent all possible inputs is to link to external files of the given content-type.
In some cases it’s possible to manually prevent escaping issues. These are not automatically handled by this function because doing so would require a full JavaScript tokenizer. Consider the following example listing various ways to manually escape a closing script tag.
Example:
console.log( String.raw`</script>` ); // !!UNSAFE!! Will be escaped.
console.log( String.raw`</\u0073cript>` ); // "</\u0073cript>"
console.log( String.raw`</scr` + String.raw`ipt>` ); // "</script>"
console.log( String.raw`</${"script"}>` ); // "</script>"
console.log( '</scr' + 'ipt>' ); // "</script>"
console.log( "\x3C/script>" ); // "</script>"
console.log( "<\/script>" ); // "</script>"
The following graph is a simplified interpretation of how HTML interprets the contents of a SCRIPT tag and identifies the closing tag. It is useful to understand what text is dangerous inside of a SCRIPT tag and why different approaches to escaping work.
Open script │ ▼ ╔═════════════════════════════════════════╗ <!--(…)> ║ ║ (all dashes) ║ script ╟────────────────╮ ║ data ║ │
╭───────────╢ ║ ◀──────────────╯ │ ╚═╤═══════════════════════════════════════╝ │ │ ▲ ▲ │ │ <!-- │ --> ╰─────╮ │ ▼ │ │ │ ┌─────────────────┴───────────────────────┐ │ │ </script¹ │ escaped │ │ │ └─┬─────────────────────────────┬─────────┘ │ │ │ ▲ │ │ --> │ │ </script¹ │ </script¹ │ <script¹ │ │ ▼ │ ▼ │ │ ╔══════════════╗ │ ┌───────────┐ │ │ ║ Close script ║ │ │ double │ │ ╰──────────▶║ ║ ╰───────────┤ escaped ├──╯ ╚══════════════╝ └───────────┘
¹ = Case insensitive 'script' followed by one of ' \t\f\r\n/>', known as “tag-name-terminating characters.” This sequence forms the start of what could be a SCRIPT opening or closing tag.
Метод класса: WP_HTML_Tag_Processor{}
Хуков нет.
Возвращает
Строку. Escaped form of input contents which will not lead to premature closing of the containing SCRIPT element.
Использование
$result = WP_HTML_Tag_Processor::escape_javascript_script_contents( $sourcecode ): string;
- $sourcecode(строка) (обязательный)
- Raw contents intended to be serialized into an HTML SCRIPT element.
Заметки
- Смотрите: https://html.spec.whatwg.org/#restrictions-for-contents-of-script-elements
- Смотрите: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals#specifications
- Смотрите: wp_html_api_script_element_escaping_diagram_source()
Список изменений
| С версии 7.0.0 | Введена. |
Код WP_HTML_Tag_Processor::escape_javascript_script_contents() WP HTML Tag Processor::escape javascript script contents WP 7.0
private static function escape_javascript_script_contents( string $sourcecode ): string {
$at = 0;
$was_at = 0;
$end = strlen( $sourcecode );
$escaped = '';
/*
* Replace all instances of the ASCII case-insensitive match of "<script"
* and "</script", when followed by whitespace or "/" or ">", by using a
* character replacement for the "s" (or the "S").
*/
while ( $at < $end ) {
$tag_at = strpos( $sourcecode, '<', $at );
if ( false === $tag_at ) {
break;
}
$tag_name_at = $tag_at + 1;
$has_closing_slash = $tag_name_at < $end && '/' === $sourcecode[ $tag_name_at ];
$tag_name_at += $has_closing_slash ? 1 : 0;
if ( 0 !== substr_compare( $sourcecode, 'script', $tag_name_at, 6, true ) ) {
$at = $tag_at + 1;
continue;
}
if ( 1 !== strspn( $sourcecode, " \t\f\r\n/>", $tag_name_at + 6, 1 ) ) {
$at = $tag_name_at + 6;
continue;
}
$escaped .= substr( $sourcecode, $was_at, $tag_name_at - $was_at );
$escaped .= 's' === $sourcecode[ $tag_name_at ] ? '\u0073' : '\u0053';
$was_at = $tag_name_at + 1;
$at = $tag_name_at + 7;
}
if ( '' === $escaped ) {
return $sourcecode;
}
if ( $was_at < $end ) {
$escaped .= substr( $sourcecode, $was_at );
}
return $escaped;
}