Hackyeah

Table of Contents

1 Task intro

Amount of available data increases every day. Finding useful information – information we can work with – is possible, but quite tedious and difficult without spending lots of resources.

Can you write the software which, given a lot of uncategorized data (say – tweets, blog posts, application logs, pcaps), groups them by the similarity of the discussed topic?

The goal is that certain groups can be ignored as uninteresting and other browsed manually. Could you score and sort the information in a useful way?

If you’ve considered trying your skills with unsupervised machine learning, this challenge is for you!

2 Communication

  • We update this document.
  • We are responsive on discord (official comm app for hackaton) on channel #exatel-task.
  • Our nicks include "Mentor_Exatel_..."

3 Task details

3.1 Input data:

A publicly available, chaotic, random data set with a lot of noise and some interesting information: a publicly-facing "no paste" service dump. The trashbin of the Internet. Hic sunt dracones.

Around 10GB of uncompressed data (20GB on ext4), 3.5mln text documents. Compressed to 3.5GB. Format: a tarball containing files with separated entries. No metadata.

Data will be split into train/validate sets. You can develop your solution using 80% of the data and we will evaluate the results using 100% of it. We might've planted some data. We will publish the range of directories from which we will ask evaluation questions. Around 20-30% of the data you were given.

Come to get a pendrive with data.

Sneak peek:

000/0000a35e925764fc52c98aa9c593ce95d61d04a3: C source, ASCII text, with CRLF line terminators
000/0000a3ddaea0ae7c655b9c964e329e9fd32c6d66: UTF-8 Unicode text, with CRLF line terminators
000/0000a7ccdf84cac975b2601f203ed336e43d62a2: C source, ASCII text
000/0000bfd04ab0f1de2bc9d4982e262797a50b02ea: ASCII text, with very long lines, with CRLF line terminators
000/0000d132821f77b795a73fc55e373cd655b67d65: ASCII text, with CRLF line terminators
000/0000d17571af4c2d70d902a2e1fa9de432776a61: UTF-8 Unicode text, with CRLF line terminators
000/0000d7c662ce91502f098afd0fd5ef609ccda39c: ASCII text, with CRLF line terminators
000/0000d8609e938ec71e1f85f25faf7d7114103cae: UTF-8 Unicode text, with CRLF line terminators
000/0000df9429b8ef94d93948e32de7d56b8c8816d8: ASCII text, with CRLF line terminators
000/0000e50e29e59b1708e72fc8f1899276c4e91a32: C++ source, ASCII text, with very long lines, with CRLF line terminators
2ac/2ac088c05556edc22b753868456c5ac029898ba0: ASCII text
2ac/2ac0906ace389493fa8b0ad091040795ff31f07c: ASCII text, with very long lines, with no line terminators
2ac/2ac090a19829b5e7146e01a97b01f38b740091b9: ASCII text, with CRLF line terminators
2ac/2ac095be0a78b25effa7d53f05c94b34606e15ba: ASCII text, with CRLF, LF line terminators
2ac/2ac096cc2690f2f0b2ce166f06f4aa20476f868c: ASCII text
2ac/2ac099053f08a7991a84322ebbfc7b99f10ab4c2: ASCII text, with CRLF, LF line terminators
2ac/2ac099507c8a0779eaf15061564c7d9da2f62131: UTF-8 Unicode text, with CRLF line terminators
2ac/2ac09b8fa45e1f507e798f862b1586d6212ac617: C source, ASCII text, with CRLF line terminators
2ac/2ac0a066d679d5cbcdb85458a181583438ac520c: UTF-8 Unicode text, with CRLF line terminators
2ac/2ac0a968f2fab6b70fc0f5e9a16b97db40aa5a19: M3U playlist, ASCII text, with CRLF line terminators
2ac/2ac0b122b05673184a09094d6f8fa3bb89e5754d: C source, ASCII text, with CRLF line terminators
2ac/2ac0b16a7dca2bf1d02b87a28e964b4e716d2668: Java source, ASCII text


lines  words   bytes
  97     210    2579 2ac/2acf06672060f99886d594e19d3159cc2b799ced
  69     207    3464 2ac/2acf16a8dd2947bc27def8382fac659bdf3cf76c
  16      54     620 2ac/2acf16c9652f9d6f3fe6ce75e94b97cbfc916cb2
  79    1614    9999 2ac/2acf185bc36fd9574d70e0a2783e7dc19c5db958
  47    1112    6602 2ac/2acf1947c6561a07232253a760603b680c75c16b
  23      77     575 2ac/2acf19a3bb19f8ddb53f592a426618c58e4b614b
   3       8      60 2ac/2acf1fe2833301a9353b06f567d038ea4fdb8c61
   9      13     161 2ac/2acf26e7c30aa0224f2bf53b9a7e521b2660f8f0
 324    1490   10017 2ac/2acf30c67d34c923ac0c56eaa82f0ad2423361c5
  49     190    1207 2ac/2acf335abb1b3fc8e43dd864960ad501c501eb09
  51     223    1469 2ac/2acf34a4b1a8ce1822a8e8125e7aa961c9dcd1a4
  55      76     999 2ac/2acf3c48c51e1f48181bf866c65dc5b087d0dcfd
  54     191    1845 2ac/2acf4dad6536921f6360e45b05efda677e7cfb4a
 100     238    3382 2ac/2acf4ddefa6a5dc006e23bdfac5ec7149a5f8f7f
 107     360    3869 2ac/2acf52f769f8ee0a5dacfe85e5119017e6efd989
  48     314    1842 2ac/2acf5546119dacabb1d385be6176f525be3f732d
  21      55     549 2ac/2acf5b2e0d869a1ca093474b3ad043acde940ccb
  21      55     751 2ac/2acf5b9550baac938dbe2546d046eabe1325464d
  71     214    2324 2ac/2acf5c3a500d0a155dc6ceeb95347804259925f6
   5      40     389 2ac/2acf5c8464ccc80efa06eeaf8a1a98d36b645472
  35     156     761 2ac/2acf5f9bf55e7e61f42d90841054c6620469f78c

"UTF-8/ASCII text" being particularly interesting and varied.

DISCLAIMER: THE MEDIUM OWNER DIDN'T AUTHOR OR THOROUGHLY ANALYSED THIS DATA. WE DENY ANY RESPONSIBILITY FOR CONTENTS OF THE MEDIUM, USING THIS CONTENTS, PUBLISHING THEM OR EVEN VIEWING. DATA ONLY FOR ACADEMIC USE. CONTENTS OF THIS MEDIUM CAN BE POTENTIALLY HAZARDOUS OR ILLEGAL IN CERTAIN JURISDICTIONS.

IN NO EVENT SHALL THE MEDIUM OWNER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DATA, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

(C) 2019 CHG

3.2 Output data

For each ID of input file, depending on how you create your algorithm:

  • Set of probabilities of belonging to a certain category.
  • List of tags the document belongs to.
  • A single tag/category.

If your output is probabilistic, determine parameters for evaluation, for e.g. "let's assume document is in category X if probability is higher than 66%"

When looking through the data it's often advantageous to filter out all the categorised data and look through data that doesn't fit any known category. Make certain it's possible with your solution.

Probably interesting categories include:

  • Data leaks: passwords, credit cards, bitcoin accounts
  • Repetitive technical noise to filter it out (like M3U playlists)
  • Interesting code (exploits)
  • Boring code (students homeworks)
  • Ransom notes from terrorists.
  • Poor poetry
  • Good poetry

The more you can distinguish - the better, but accuracy is important.

3.3 Task

  1. Create an algorithm which is able to categorise unlabelled data - and works with the given input data. It might need to include some lexer.
  2. Allow some way to browse the resulting data by assigned category (filter category IN, or filter category OUT). Dedicated web interface is always nice but is not the part of the task. You can use for e.g. Kibana and work on the data instead).
  3. Evaluate the algorithm on 100% of the data (or at least the 20% you will be given later) and cache the result.

    We would like to see both:

    • list of documents MATCHING a given category and be able to compare their contents to see how well they fit the group. Possibly sorted by how well they fit the category, but this can be boolean.
    • see category data of a document given by ID.
  4. After evaluation you can do some manual analysis, select the most common categories and "name" them. This will help you evaluate your current results too.

Open topics:

  • Is it better to categorise documents as whole, or maybe it's interesting to split them into smaller parts and categorise parts instead?

3.4 Scoring criteria

3.4.1 Accuracy

How many categories can be accurately distinguished? How often is the document put into a wrong category? How many documents are required per category to learn it?

We will be flexible about the criteria. As a rule of thumb, if there are over 300 very similar documents it would be great if they were categorised together.

How much of the dataset can be filtered out using all known and accurate categories?

3.4.2 Speed

New data should be categorised in a finite time. Preferably in lower part of the 1ms - 1s range. 24h hackaton has less than 86400 seconds of time, if you take 1s to categorise a document you'll need 48 cores to parallelise it. If you start categorising NOW.

3.4.3 Flexibility

How does the algorithm respond to new emerging categories? Does it need to be taught them by a human operator? How much work must be put into it, each time something new emerges?

3.4.4 Applicability to computer security

It's great if it can guess which university gave the pasted homework, but that's not really interesting to us. We'd like to use it for improving filtering of a stream of possible cyber-threat-informations that our Security Operations Center has to comb through on a regular basis.

3.4.5 Quality

  • Is the code quality good? Is it useful? Will maintenance be hard?
  • Does it have any licensing limitations? We won't be able to use solutions which depend on closed software with limited licensing and score them lower than others.

3.5 Q/A

3.5.1 Too many docs! Can't do!

  • Pick 10% of data at random.
  • Allow to classify docs on demand by hash to see applied categories for evaluation.

3.5.2 Evaluation data. Update: [2019-09-15 Sun]

We will scrap the idea of additional 20% data to skip the problem of getting pendrives to all of you.

We will select around 20% of the data you already have and do the evaluation using this data only. This way you can limit the number of documents you have to classify during the hackaton time.

You can classify more if you can and this can improve your pitch.

We will publish the range later - select few directories from the range 000-fff.

We will ask you about documents from directory range c00-fff.

3.5.3 Evaluation process.

  • We will see your solution: code @hackerearth, and any kinds of demo/screenshots/description you will attach to it.
  • During the pitch, you show us:
    • your approach and results,
    • any interesting data you've found and categorized,
    • design choices,
    • why you think your tool is useful,
    • we will ask you about few IDs of documents from the "evaluation data" range to see what categories you've assigned to them. And to see what other documents are stored in these categories. We will publish IDs for copy-pasting from web.
  • We won't be running your code or teach it new data - no time for it.

3.5.4 Unattended / attended

Hint: When using unattended approach, a huge category can be further divided recursively into smaller categories.

Author: Tomasz bla Fortuna

Created: 2019-09-15 Sun 08:20

Validate