Ticket #12 (closed enhancement: fixed)
PyYAML is slow
| Reported by: | edemaine@… | Owned by: | xi |
|---|---|---|---|
| Priority: | normal | Component: | pyyaml |
| Severity: | normal | Keywords: | |
| Cc: |
Description
Here are two simple wall-clock timings comparing PyYAML to PySyck on a Pentium 4 2.8GHz with 1MB cache and 1GB RAM:
$ wc file1.yaml 2036 8767 59154 file1 $ test.py file1.yaml 0:00:00.001419 to read the YAML via Syck 0:00:04.029627 to read the YAML via PyYAML $ wc file2.yaml 8949 35105 317342 file2 $ test.py file2.yaml 0:00:00.001564 to read the YAML via Syck 0:00:19.288912 to read the YAML via PyYAML
I do not expect PyYAML to be terribly competitive with Syck: the language barrier is big, and PyYAML is written with a higher level of abstraction. But I was surprised to see a factor of 12,000 difference. I wonder if a bit of profiling and tuning might reduce this gap to just a couple of orders of magnitude (100x) instead of four? Personally, 19 seconds to read a 0.3 meg file is too slow for my application, so I'll have to switch back to Syck for now, unfortunately. Just food for thought...
Attachments
Change History
comment:2 Changed 7 years ago by edemaine@…
OK, here is a sample file on the larger size (8961 lines, 301,229 bytes), and a simple driver script generating output similar to the last example above.
Changed 7 years ago by edemaine@…
-
attachment
CSAIL.yaml
added
A large YAML file (slightly culled to fit on Trac)
comment:3 Changed 7 years ago by xi
Sorry for the trac spam :(. I'll try to deal with it somehow.
On the bright side, I've started the LibYAML project, which will eventually allow to close this bug. :)
comment:4 Changed 7 years ago by xi
- Status changed from assigned to closed
- Resolution set to fixed
The libyaml bindings are now usable (though not as fast as possible).
comment:5 Changed 7 years ago by edemaine@…
I finally got to try the LibYAML bindings of PyYAML. In case you're curious, here is a repeat of the simple test from before. The improvement so far is about a factor of 10 (without Psyco), but still 3 more orders of magnitude to get down to Syck speed.
$ python test.py CSAIL.ycard 0:00:00.001437 to read the YAML via Syck 0:00:13.661756 to read the YAML via PyYAML 0:00:01.181506 to read the YAML via PyYAML/LibYAML
comment:6 Changed 7 years ago by xi
There is a problem in your test code in the line:
cards = syck.load_documents (open (sys.argv[1]))
The function load_documents is a generator, so it does not really load the documents. You should replace it with
for card in syck.load_documents (open (sys.argv[1])): pass
Please post the updated benchmarks :) PyYAML/LibYAML is 2-3 times slower than PySyck, probably because of Pyrex and PyYAML code overhead. I'm going to reduce overhead by replacing all Pyrex and some Python code with pure C.
You may also run
yaml.CLoader (open (sys.argv[1])).raw_parse()
to check pure LibYAML perfomance.
comment:7 Changed 7 years ago by edemaine@…
Whoops, you are right! Sorry about that. Now they are within a factor of 2 as you state (I am actually using PySyck):
$ python test.py CSAIL.ycard 0:00:00.643884 to read the YAML via Syck 0:00:13.676710 to read the YAML via PyYAML 0:00:01.201301 to read the YAML via PyYAML/LibYAML
Nice work! Looking forward to even more optimizations.

It is expected for C vs Python, but I'm too surpised by the factor of the difference. I usually get about 200x difference on simple tests. You may attach your files and the script so I can check them.
You may try to use psyco, it might get you about 1.5-5.0 speed up:
The real solution is, of course, to rewrite the code to C. It's planned, but don't expect it too soon.