Modeling problem in the tokenizer
I struggled a lot with a problem of modeling in the HTML tokenizer for Floki.
From one side I had performance concerns, which made me start with a more primitive approach. From another side I wish a better modeling, which would make much easier to work inside the tokenizer and inside the tree construction phase.
My worries about performance led to a poor design. I understand that now, and it’s a relief to realize it in the middle of the work. But the time I spent trying to make it faster for the first version was costly.
Tuples x Structs 🤓
The problem I was facing was: tuples are much faster than structs/maps,
but they are not easy to work with when you need to update them. You may think: “but he could use pattern
matching”. Yeah, you are right. But this is cucumbersome for tuples with more than 3 elements.
##### With input Big #####
Name ips average deviation median 99th %
update name in tuple 10.71 M 0.0934 μs ±3077.55% 0.0400 μs 0.130 μs
update name in struct 6.90 M 0.145 μs ±2035.90% 0.0700 μs 0.170 μs
Comparison:
update name in tuple 10.71 M
update name in struct 6.90 M - 1.55x slower
##### With input Middle #####
Name ips average deviation median 99th %
update name in tuple 10.43 M 0.0958 μs ±3032.20% 0.0400 μs 0.150 μs
update name in struct 6.92 M 0.145 μs ±2037.50% 0.0700 μs 0.160 μs
Comparison:
update name in tuple 10.43 M
update name in struct 6.92 M - 1.51x slower
##### With input Small #####
Name ips average deviation median 99th %
update name in tuple 10.35 M 0.0967 μs ±3082.52% 0.0400 μs 0.150 μs
update name in struct 6.92 M 0.145 μs ±2037.05% 0.0600 μs 0.170 μs
Comparison:
update name in tuple 10.35 M
update name in struct 6.92 M - 1.50x slower
The benchmark can be found here.
Structs for better domain modeling
Elixir has good data structures, and the one that I like a lot is the Struct
. It is a map
that
can have a “type”, and this characteristic make structs great for modeling the system, and give
“names” to values.
Working with simple raw data structures like tuples
can be faster and are totally recommend for
things like Eithers
(like {:ok, value}
or {:error, reason}
), but are much harder to represent more
complex ideas and values. Another key point is that you can work better with protocols and strucs,
which makes the system more extensible.
In conclusion, performance can be good, but shouldn’t be the most important thing when your are starting a project. Good modeling first will probably make your program’s life longer and prosper.
Some people may argue that a HTML parser should be as fast as possible. I agree that this is ideal, but it is not the main objective of Floki. For performance, we should offer implementations that are faster by default.