Merge lp:~gz/bzr/remove_monkey_patched_elementtree_escaping_614522 into lp:bzr

Proposed by Martin Packman
Status: Merged
Approved by: John A Meinel
Approved revision: no longer in the source branch.
Merged at revision: 5407
Proposed branch: lp:~gz/bzr/remove_monkey_patched_elementtree_escaping_614522
Merge into: lp:bzr
Diff against target: 85 lines (+2/-67)
1 file modified
bzrlib/xml_serializer.py (+2/-67)
To merge this branch: bzr merge lp:~gz/bzr/remove_monkey_patched_elementtree_escaping_614522
Reviewer Review Type Date Requested Status
John A Meinel Approve
Review via email: mp+34028@code.launchpad.net

Commit message

Remove monkey patching for elementtree, we do it differently anyway

Description of the change

The monkey patching of some ElementTree escaping functions for performance purposes mangles the output with newer versions of the library.

As this kind of thing is a bad idea and it's unclear to me what improvement this provides, this branch just removes all of that.

I expect people using non-ascii characters with this code on Python 2.7 currently risk corrupting their branch or something.

To post a comment you must log in.
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Do we still use XML anywhere other than pre-2a formats?

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

(If the optimization is only relevant for pre-2a formats then it would be an easier choice to get rid of an optimization).

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/29/2010 1:51 PM, Jelmer Vernooij wrote:
> (If the optimization is only relevant for pre-2a formats then it would be an easier choice to get rid of an optimization).

we only use xml for bundles in 2a. So not everyday, but it does get used
from time to time.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx6ueAACgkQJdeBCYSNAAN6qgCgqm1wOHDEK9lMcbGyVYcvbK+O
hoIAmwfDioXiaVeO4di3Ia2xHHyiJfrB
=uqPs
-----END PGP SIGNATURE-----

Revision history for this message
Martin Packman (gz) wrote :

Okay, I'm confused now. Every microbenchmark I could cook up the bzrlib version of the _escape_cdata is actually slower than the original. So, I tried profiling bundle, and sure enough there's an xml escaping function high up in the list, but it's not the etree one:

         502 0 205.1471 16.3189 <C:\Python24\Lib\site-packages\bzrlib\xml8.py>:217(write_inventory)
    +2610268 0 13.1339 8.5832 +<C:\Python24\Lib\site-packages\bzrlib\xml8.py>:94(_encode_and_escape)

So, what command can I run instead to measure how much ripping this code out hurts performance?

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/30/2010 7:44 AM, Martin [gz] wrote:
> Okay, I'm confused now. Every microbenchmark I could cook up the bzrlib version of the _escape_cdata is actually slower than the original. So, I tried profiling bundle, and sure enough there's an xml escaping function high up in the list, but it's not the etree one:
>
> 502 0 205.1471 16.3189 <C:\Python24\Lib\site-packages\bzrlib\xml8.py>:217(write_inventory)
> +2610268 0 13.1339 8.5832 +<C:\Python24\Lib\site-packages\bzrlib\xml8.py>:94(_encode_and_escape)
>
> So, what command can I run instead to measure how much ripping this code out hurts performance?

I would guess a major factor would be "which version of Elementtree" :)
since it isn't bundled with earlier pythons.

As near as I can tell, the main change is to switch:

            text = replace(text, "&", "&amp;")
            text = replace(text, "'", "&apos;") # FIXME: overkill
            text = replace(text, "\"", "&quot;")
            text = replace(text, "<", "&lt;")
            text = replace(text, ">", "&gt;")

to using:
escape_re = re.compile("[&'\"<>]")
escape_map = {
    "&":'&amp;',
    "'":"&apos;", # FIXME: overkill
    "\"":"&quot;",
    "<":"&lt;",
    ">":"&gt;",
    }
def _escape_replace(match, map=escape_map):
    return map[match.group()]
...
  text = escape_re.sub(_escape_replace, text)

As such, I think a valid benchmark would be:

 a) grab a large inventory content from pre-2a format (1.9-rich-root,
    for example). This can be a single revision
 b) Time the different between a single re.sub() versus 5 calls to
    'string.replace'.

Anyway, as mentioned, this isn't a large perf issue for current formats,
so we probably can just revert it.

And your profiling shows... we worked around the ElementTree code
entirely in later revisions, so again, it is likely to not degrade
performance by applying your patch.

 merge: approve

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx747MACgkQJdeBCYSNAAPIwwCgykHabjXO7EuELNwHqurUY+Pc
vjUAoJ/5kqD10G1hjvTC0SRz418Kpi1I
=Sm4E
-----END PGP SIGNATURE-----

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/30/2010 7:44 AM, Martin [gz] wrote:
> Okay, I'm confused now. Every microbenchmark I could cook up the bzrlib version of the _escape_cdata is actually slower than the original. So, I tried profiling bundle, and sure enough there's an xml escaping function high up in the list, but it's not the etree one:
>
> 502 0 205.1471 16.3189 <C:\Python24\Lib\site-packages\bzrlib\xml8.py>:217(write_inventory)
> +2610268 0 13.1339 8.5832 +<C:\Python24\Lib\site-packages\bzrlib\xml8.py>:94(_encode_and_escape)
>
> So, what command can I run instead to measure how much ripping this code out hurts performance?

Note also that when we performance tuned it, we may have been running
under lsprof, which would also penalize the extra function calls more
than real runtime would (IME).

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx7/6EACgkQJdeBCYSNAAPrBQCgjwcTXwZnqbOdgs2iCB10qY8A
JnYAnjIFNiwViT9sdOAxodeZcVpV9TbC
=gdz3
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/xml_serializer.py'
2--- bzrlib/xml_serializer.py 2010-05-25 17:27:52 +0000
3+++ bzrlib/xml_serializer.py 2010-08-29 18:47:43 +0000
4@@ -22,6 +22,8 @@
5 # importing this module is fairly slow because it has to load several
6 # ElementTree bits
7
8+import re
9+
10 from bzrlib.serializer import Serializer
11 from bzrlib.trace import mutter
12
13@@ -111,73 +113,6 @@
14 return ElementTree().parse(f)
15
16
17-# performance tuning for elementree's serialiser. This should be
18-# sent upstream - RBC 20060523.
19-# the functions here are patched into elementtree at runtime.
20-import re
21-escape_re = re.compile("[&'\"<>]")
22-escape_map = {
23- "&":'&amp;',
24- "'":"&apos;", # FIXME: overkill
25- "\"":"&quot;",
26- "<":"&lt;",
27- ">":"&gt;",
28- }
29-def _escape_replace(match, map=escape_map):
30- return map[match.group()]
31-
32-def _escape_attrib(text, encoding=None, replace=None):
33- # escape attribute value
34- try:
35- if encoding:
36- try:
37- text = elementtree.ElementTree._encode(text, encoding)
38- except UnicodeError:
39- return elementtree.ElementTree._encode_entity(text)
40- if replace is None:
41- return escape_re.sub(_escape_replace, text)
42- else:
43- text = replace(text, "&", "&amp;")
44- text = replace(text, "'", "&apos;") # FIXME: overkill
45- text = replace(text, "\"", "&quot;")
46- text = replace(text, "<", "&lt;")
47- text = replace(text, ">", "&gt;")
48- return text
49- except (TypeError, AttributeError):
50- elementtree.ElementTree._raise_serialization_error(text)
51-
52-elementtree.ElementTree._escape_attrib = _escape_attrib
53-
54-escape_cdata_re = re.compile("[&<>]")
55-escape_cdata_map = {
56- "&":'&amp;',
57- "<":"&lt;",
58- ">":"&gt;",
59- }
60-def _escape_cdata_replace(match, map=escape_cdata_map):
61- return map[match.group()]
62-
63-def _escape_cdata(text, encoding=None, replace=None):
64- # escape character data
65- try:
66- if encoding:
67- try:
68- text = elementtree.ElementTree._encode(text, encoding)
69- except UnicodeError:
70- return elementtree.ElementTree._encode_entity(text)
71- if replace is None:
72- return escape_cdata_re.sub(_escape_cdata_replace, text)
73- else:
74- text = replace(text, "&", "&amp;")
75- text = replace(text, "<", "&lt;")
76- text = replace(text, ">", "&gt;")
77- return text
78- except (TypeError, AttributeError):
79- elementtree.ElementTree._raise_serialization_error(text)
80-
81-elementtree.ElementTree._escape_cdata = _escape_cdata
82-
83-
84 def escape_invalid_chars(message):
85 """Escape the XML-invalid characters in a commit message.
86