The Approximate Filter, Join, and GroupBy

In this talk we introduce the notion of approximate filter, join, and groupby operations for arrays. Typically, Flink streams contain primitive types and tuples where filter, join, and groupby operate on exact matches. But, exact matches are sometimes limiting. For example, the objects Array(100, 0, 100) and Array(100, 0, 101) may be “close enough” to match. To solve this problem, we introduce locality sensitive hashing (LSH) for arrays of numeric and string types. This technique encodes arrays into strings so that similar arrays are encoded to the same string. In other words, we ensure matching when arrays are similar, up to a degree of error. Therefore, it is easy to incorporate new approximate filter, join, and groupby design patterns built on the notion of exact matches. In conclusion, we highlight how Cisco Umbrella streams large signals stored in arrays and then clusters them using approximate filter, join, groupby methods to detect waves of botnets and cybercrime online.

Speakers involved