Posted onNov 5, 2020

The limits of 2 dimensions

#webdev #maps #rating #sql

This article was written to design a solution forterravigil.com articles ordering. It will start as a problem related to articles, and will end with the same problem as well. But in between...

Let's do some math:

open your favourite site with some articles on the main page -bbc.com,reddit ordev.to, it does not matter.
check what you see
probably that's latest and the best news or articles

latest ANDthe best.
These aretwo differentproperties to order. Simultaneously.
These aretwo differentdimensions to discover.

💡 There is an easy way to "experience" this problem - open anytable within some table processor (Excel) and try to sort by a field. And then another field. And then by two fields simultaneously -

😅 but there is no such operation.

Speaking in a more common language (SQL):

you canORDER BY time, rating LIMIT 10
but it will first sort bytime and only then byrating.

Want to getby time AND by rating simultaneously - that is not possible, the ordering is happening sequentially, or it's better to sayin the1D world.

However, it's easy to start the story from another example. From the actual2D world. Something we all experienced, and able to understand.

Frommaps 😉

Maps

🗺 Maps are 2D. Obviously.

Let's do some math:

open your favourite site, powered by some maps, and displaying some stuff on the map (that's important!).
check what you see.
probably there are some "dots"(Points Of Interest) on that map.

For the sake of simplicity - let's usedomain.com.au as an example.

Let's check what we actually looking at:

✅ it's a City Of Sydney
✅ majority of POIs are located in the Downtown
✅ there are almost 9000 places to rent
💩 just a few dozen are displayed

So, this is "our situation" - we havemore data than we could, or should display, so we have to apply somelimits - we have to pick "latest ANDthe best" items to display.

We have to pickthe best items with restriction byX andY. Already 1-dimensional ordering and 2-dimensional filtering.

However, I would like to point in that strange fact, that there isalmost no information outside of Downtown.

Probably nobody is selling or leasing their houses so far from the centre. People are not living in those wild areas! Sounds reasonable? 🤷‍♂️ why not?

However, let's double-check it - zoo-oo-ooming in!

😎 I would assure you, there is life outside of Sydney's CBD, we just are not able to see it due tolimitation of 2D dimensions.

I've got a video for you

👉 In short: there is A LOT of "dots" outside the Downtown, but we don't see them.

The root cause of this effect is -information tend to be displayed in the areas of higher density. And yes - it comes, and become worse and worse, with a scale.

information tend to be displayed in the areas of higher density. Every entity has an equal probability to be picked, and that's why higher "density" produces more "picks"

Actually, the root cause here can be observed as the overall "speed" of updating information on the map as well.
Every time I move the map,domain fetches the new information from the backend, and then displays it. Very often, especially within "areas of high density", the updated data does not represent the previous dataset (it might be very different), even if I am searching roughly the same area.

This causes heavy load on the backend, very slow and non-responsive frontend, as well as annoyingPOI-glittering. Really VERY annoying. Especially when some point you had a plans to inspect just disappears...

Living in 2-dimensions

We have two problems here:

Problem 1

Every request is unique. You can move a map as you want, and every time it would be absolutely unique. You will have to reach the server, spend time, return a unique result.

Two customers, looking at one area, are looking at ABSOLUTELY different ones.

There is one thing you have to understand - maps "displaying" some stuff are here for some time. For "some time" - since IE6 and pre-iPhone era. Since "before" AWS. Since the times, when you had a single database for all 1000000 customers per day.
In other words - 10 years ago you were not able to "scale up", and had to provide performant and efficient backend, oryour Frontend will DDOS your backend.

And cartographical services of the pasthave to solve the problem. And had solved it.
But cartographical services of the present are not giving a crap, resulting in the issue mention above, the fix for which is known for years.> The solution will be explained later

Problem 2

Problem number 2 consist of the other two problems:

SELECT * WHERE X and Y gives us 10.000 records. So you can select much more information than you can display. So you have toLIMIT it. But limiting needs sorting to keep "the best" records.
ORDER BY rating is keeping records in the areas of higher density. And you want to display all possibleopportunitues.If all records have the same probability to be displayed - the result would represent just information density.

And "information density" is the problem you have to fix first. Not tofix - butneutralize.

Clustering

For example, you can do N smaller requests to "cover" the whole requested area.

SELECT WHERE X and Y ORDER BY rating LIMIT 100 ⬇️⬇️⬇️100 times x(SELECT WHERE x and y ORDER BY rating LIMIT 1)

This time instead of one big select we are making a dozen smaller, every of which still tends to display the most "dense" part of it, but covering the whole block more or lessevenly.

However -it is not efficient. There is a better way -group by

SELECT WHERE X and Y ORDER BY rating LIMIT 100⬇️⬇️⬇️SELECT WHERE X and Y GROUP BY (ROUND(x/10), ROUND(y/10)) ORDER BY rating

In this case we have grouped all points in some "smaller area" together, resulting potentially up to 100 rows(0.1x0.1 grouping).
This is, again, not efficient, but could it help us with reducing dev city of the data and making coverage a bit more “flat”.

We just make a complex task a bit simpler.

However,this is not clusterization we probably should use to properly display high density information. All we did isaggregation, and that's enough! But we did it not in the best way.
There is a better one.

Problems with "random queries"

group by ROUND(x/10) is not working./10 should be different for every zoom level. With every zoom, Map becomes twice bigger, so multiplier should also adjust (x2 every zoom)> Plus grouping is actually slow. No matter how you do it - it would be slow.
WHERE x BETWEEN left and right would be unique for every user looking at your map. And that is a problem bot only for the network cache, but for the database cache as well> 👉 Two customers, looking at one area, are looking at ABSOLUTELY different ones.

Z for Z-code

The goal of aggregation is tocollapse some records, which could be grouped somehow, into one record.
For 2d data the most common solution is known asZ or Morton code, as well asgeohash,Gray code orHilbert curve - they all are some sort ofspatial filling curves, literally aline which stitches N dimension together.

it is called "Z" because every iteration it "Z" itself (and Hilbert curve "U" itself)

It might take a while to understand how such curves work, but the essence of the idea is simple:

values which are "close" on the curve, are "close" in the N-dimension as well

In our case that would sounds like

SELECT WHERE X and Y GROUP BY zcode & 0xFFFF00  <---ORDER BY rating LIMIT 100

Where0xFFFF00 is playing a role of a "mask" and groups 😎values which are "close" on the curve😎 effectively clustering elements which are close to each other in 2D.

The explanation how "clustering" work is best seen onBings Tile "coordinate" System - quadkeys (the same Z code) which uses 2 bits (0-1-2-3) per one "division". So masking the "following" bits can "group" all points with the samebeginning of the code. And all points in one "rectangle" will have the same beginning.

Word-to-vec

A bit different approach is used in search engines, which also need tofind and rank something among a billion documents.
"Similarity" search is one of key algorithms there, andword2vec is a key technique here.

As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.

Mathematically speaking it's the same case - "close" words or "documents" would end "close" to each other in the (very) multi-dimension space (for the reference it's not 2D or 3D, but closer to 1000D).

Terravigil

However let's back to our articles case, about displaying "the best and the most fresh" data.

What we actually want:

we want to see "the best of the best" news longer - more people should read them
we want to display hot news as soon as possible, sofreshness is important
but there two points violates each other

There are two possible solutions here:buckets andscore

Buckets

Selects news in predefined time internals, and order inside each group separately.
Literally "display top news of today" and then "display top news of tomorrow".

the size of a bucket can be adjusted, like more than a day or less than a day
buckets might overlap, like you can merge the top news today, and the top news from the past week.
buckets can be fixed (in time), thus cacheable, or floating, changing intervals they represent every minute or hour.

Buckets are first~~sort by time~~ bucket by time, then sort by rate.

Score

Selects news ordered by a single "score", which is the main problem of this article.
While spatial curves can help with 2D and othergeometrical cases - they are more or less useless for this particular case.

Why? Because some elements "close" on the 2D plot would be VERY far on the curve. Just think for "start" is different from the "end", while curve starts and ends in the same point.

Solution here is to use or special "vector" data formats or databases known as KNN -k-nearest neighbours search (orNearest centroid.

Javascript implementationhttps://github.com/mourner/rbush-knn
ElasticSearchhttps://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/knn.html
Postgres support it as a part ofPostGIS - remember that our problem can be seen as a geometrical problem
MySQL does not have anyefficient way to handle this, and you will probably kill your database if you try.

A sample data set of points considered as "neighbours"

Conclusion

Long story short -Terravigil usesbuckets as long as it's easy todesign them and understand how they work, so you can create a correct result.
As well it usesscore to derive a single metric from many different other signals to at least understand which article is "the best".

But all you have to do - just look at the problem from another angle. If you were told to do a 2-dimensional sorting, thinks about limits of the 2-nd dimension.

And it's not just about how to do it, it's also about how to make it FAST.