Merge lp:~gz/bzr/remove_monkey_patched_elementtree_escaping_614522 into lp:bzr

Proposed by Martin Packman
Status: Merged
Approved by: John A Meinel
Approved revision: no longer in the source branch.
Merged at revision: 5407
Proposed branch: lp:~gz/bzr/remove_monkey_patched_elementtree_escaping_614522
Merge into: lp:bzr
Diff against target: 85 lines (+2/-67)
1 file modified
bzrlib/xml_serializer.py (+2/-67)
To merge this branch: bzr merge lp:~gz/bzr/remove_monkey_patched_elementtree_escaping_614522
Reviewer Review Type Date Requested Status
John A Meinel Approve
Review via email: mp+34028@code.launchpad.net

Commit message

Remove monkey patching for elementtree, we do it differently anyway

Description of the change

The monkey patching of some ElementTree escaping functions for performance purposes mangles the output with newer versions of the library.

As this kind of thing is a bad idea and it's unclear to me what improvement this provides, this branch just removes all of that.

I expect people using non-ascii characters with this code on Python 2.7 currently risk corrupting their branch or something.

To post a comment you must log in.
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Do we still use XML anywhere other than pre-2a formats?

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

(If the optimization is only relevant for pre-2a formats then it would be an easier choice to get rid of an optimization).

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/29/2010 1:51 PM, Jelmer Vernooij wrote:
> (If the optimization is only relevant for pre-2a formats then it would be an easier choice to get rid of an optimization).

we only use xml for bundles in 2a. So not everyday, but it does get used
from time to time.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx6ueAACgkQJdeBCYSNAAN6qgCgqm1wOHDEK9lMcbGyVYcvbK+O
hoIAmwfDioXiaVeO4di3Ia2xHHyiJfrB
=uqPs
-----END PGP SIGNATURE-----

Revision history for this message
Martin Packman (gz) wrote :

Okay, I'm confused now. Every microbenchmark I could cook up the bzrlib version of the _escape_cdata is actually slower than the original. So, I tried profiling bundle, and sure enough there's an xml escaping function high up in the list, but it's not the etree one:

         502 0 205.1471 16.3189 <C:\Python24\Lib\site-packages\bzrlib\xml8.py>:217(write_inventory)
    +2610268 0 13.1339 8.5832 +<C:\Python24\Lib\site-packages\bzrlib\xml8.py>:94(_encode_and_escape)

So, what command can I run instead to measure how much ripping this code out hurts performance?

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/30/2010 7:44 AM, Martin [gz] wrote:
> Okay, I'm confused now. Every microbenchmark I could cook up the bzrlib version of the _escape_cdata is actually slower than the original. So, I tried profiling bundle, and sure enough there's an xml escaping function high up in the list, but it's not the etree one:
>
> 502 0 205.1471 16.3189 <C:\Python24\Lib\site-packages\bzrlib\xml8.py>:217(write_inventory)
> +2610268 0 13.1339 8.5832 +<C:\Python24\Lib\site-packages\bzrlib\xml8.py>:94(_encode_and_escape)
>
> So, what command can I run instead to measure how much ripping this code out hurts performance?

I would guess a major factor would be "which version of Elementtree" :)
since it isn't bundled with earlier pythons.

As near as I can tell, the main change is to switch:

            text = replace(text, "&", "&amp;")
            text = replace(text, "'", "&apos;") # FIXME: overkill
            text = replace(text, "\"", "&quot;")
            text = replace(text, "<", "&lt;")
            text = replace(text, ">", "&gt;")

to using:
escape_re = re.compile("[&'\"<>]")
escape_map = {
    "&":'&amp;',
    "'":"&apos;", # FIXME: overkill
    "\"":"&quot;",
    "<":"&lt;",
    ">":"&gt;",
    }
def _escape_replace(match, map=escape_map):
    return map[match.group()]
...
  text = escape_re.sub(_escape_replace, text)

As such, I think a valid benchmark would be:

 a) grab a large inventory content from pre-2a format (1.9-rich-root,
    for example). This can be a single revision
 b) Time the different between a single re.sub() versus 5 calls to
    'string.replace'.

Anyway, as mentioned, this isn't a large perf issue for current formats,
so we probably can just revert it.

And your profiling shows... we worked around the ElementTree code
entirely in later revisions, so again, it is likely to not degrade
performance by applying your patch.

 merge: approve

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx747MACgkQJdeBCYSNAAPIwwCgykHabjXO7EuELNwHqurUY+Pc
vjUAoJ/5kqD10G1hjvTC0SRz418Kpi1I
=Sm4E
-----END PGP SIGNATURE-----

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/30/2010 7:44 AM, Martin [gz] wrote:
> Okay, I'm confused now. Every microbenchmark I could cook up the bzrlib version of the _escape_cdata is actually slower than the original. So, I tried profiling bundle, and sure enough there's an xml escaping function high up in the list, but it's not the etree one:
>
> 502 0 205.1471 16.3189 <C:\Python24\Lib\site-packages\bzrlib\xml8.py>:217(write_inventory)
> +2610268 0 13.1339 8.5832 +<C:\Python24\Lib\site-packages\bzrlib\xml8.py>:94(_encode_and_escape)
>
> So, what command can I run instead to measure how much ripping this code out hurts performance?

Note also that when we performance tuned it, we may have been running
under lsprof, which would also penalize the extra function calls more
than real runtime would (IME).

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx7/6EACgkQJdeBCYSNAAPrBQCgjwcTXwZnqbOdgs2iCB10qY8A
JnYAnjIFNiwViT9sdOAxodeZcVpV9TbC
=gdz3
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/xml_serializer.py'
--- bzrlib/xml_serializer.py 2010-05-25 17:27:52 +0000
+++ bzrlib/xml_serializer.py 2010-08-29 18:47:43 +0000
@@ -22,6 +22,8 @@
22# importing this module is fairly slow because it has to load several22# importing this module is fairly slow because it has to load several
23# ElementTree bits23# ElementTree bits
2424
25import re
26
25from bzrlib.serializer import Serializer27from bzrlib.serializer import Serializer
26from bzrlib.trace import mutter28from bzrlib.trace import mutter
2729
@@ -111,73 +113,6 @@
111 return ElementTree().parse(f)113 return ElementTree().parse(f)
112114
113115
114# performance tuning for elementree's serialiser. This should be
115# sent upstream - RBC 20060523.
116# the functions here are patched into elementtree at runtime.
117import re
118escape_re = re.compile("[&'\"<>]")
119escape_map = {
120 "&":'&amp;',
121 "'":"&apos;", # FIXME: overkill
122 "\"":"&quot;",
123 "<":"&lt;",
124 ">":"&gt;",
125 }
126def _escape_replace(match, map=escape_map):
127 return map[match.group()]
128
129def _escape_attrib(text, encoding=None, replace=None):
130 # escape attribute value
131 try:
132 if encoding:
133 try:
134 text = elementtree.ElementTree._encode(text, encoding)
135 except UnicodeError:
136 return elementtree.ElementTree._encode_entity(text)
137 if replace is None:
138 return escape_re.sub(_escape_replace, text)
139 else:
140 text = replace(text, "&", "&amp;")
141 text = replace(text, "'", "&apos;") # FIXME: overkill
142 text = replace(text, "\"", "&quot;")
143 text = replace(text, "<", "&lt;")
144 text = replace(text, ">", "&gt;")
145 return text
146 except (TypeError, AttributeError):
147 elementtree.ElementTree._raise_serialization_error(text)
148
149elementtree.ElementTree._escape_attrib = _escape_attrib
150
151escape_cdata_re = re.compile("[&<>]")
152escape_cdata_map = {
153 "&":'&amp;',
154 "<":"&lt;",
155 ">":"&gt;",
156 }
157def _escape_cdata_replace(match, map=escape_cdata_map):
158 return map[match.group()]
159
160def _escape_cdata(text, encoding=None, replace=None):
161 # escape character data
162 try:
163 if encoding:
164 try:
165 text = elementtree.ElementTree._encode(text, encoding)
166 except UnicodeError:
167 return elementtree.ElementTree._encode_entity(text)
168 if replace is None:
169 return escape_cdata_re.sub(_escape_cdata_replace, text)
170 else:
171 text = replace(text, "&", "&amp;")
172 text = replace(text, "<", "&lt;")
173 text = replace(text, ">", "&gt;")
174 return text
175 except (TypeError, AttributeError):
176 elementtree.ElementTree._raise_serialization_error(text)
177
178elementtree.ElementTree._escape_cdata = _escape_cdata
179
180
181def escape_invalid_chars(message):116def escape_invalid_chars(message):
182 """Escape the XML-invalid characters in a commit message.117 """Escape the XML-invalid characters in a commit message.
183118