Clustering Tutorial

Table of Contents Requirements Overview Create Sequence File Partition Data Compute Similarity Style Similarity Structural Similarity Combine Structure and Style Cluster the results: Shared Near Neighbor Clustering Export Clusters to D3JS JSON Format Visualize clusters Other tools Grep Content Merge Sequence files De-duplicate Dump Keys

Requirements

Have crawl segments from Apache Nutch in consistent format.
Have latest version of autoext-spark-xx-SNAPSHOT.jar . Visit Build Instructions for building the sources to obtain either an executable or spark-submit jar.

NOTE: This tutorial runs spark in local mode. Thus -master local argument will be used in the following steps. To run these jobs in cluster mode, start the job with spark-submit command instead of java -jar and ignore -master local

Overview

Available goals

$java -jar autoext-spark-0.2-SNAPSHOT.jar help
Usage 
Commands::
    similarity  - Computes similarity between documents.
    createseq  - Creates a sequence file (compatible with Nutch Segment) from raw HTML files.
    partition  - Partitions Nutch Content based on host names.
    grep       - Greps for the records which contains url and content type filters.
    help       - Prints this help message.
    merge      - Merges (smaller) part files into one large sequence file.
    dedup      - Removes duplicate documents (exact url matches).
    d3export   - Exports clusters into most popular d3js format for clusters.
    keydump    - Dumps all the keys of sequence files(s).
    sncluster  - Cluster using Shared near neighbor algorithm.
    simcombine  - Combines two similarity measures on a linear scale.

Create Sequence File

This step is optional: If you are trying to cluster the output of Apache Nutch which produces SequenceFiles or your data is already in SequenceFiles please skip this step.

This tool is designed to work on Hadoop/Spark backend where we use SequenceFile to efficiently store data. If you have a bunch of raw HTML files, use java -jar autoext-spark-0.2-SNAPSHOT.jar createseq to convert them into sequence file.

Usage

java -jar autoext-spark/target/autoext-spark-0.2-SNAPSHOT.jar createseq
 -in VAL  : path to directory having html pages
 -out VAL : path to output Sequence File

Partition Data

Clustering is computationally expensive job! So it is better to partition dataset for interesting documents to cluster. For instance, there is no need to cluster images and other non-HTML web pages for DOM structure and style. This step partitions data based on domain names and content types.

Usage

$ java -jar autoext-spark-0.2-SNAPSHOT.jar partition
Option "-out" is required
 -app (--app-name) VAL  : Name for spark context. (default: ContentPartitioner)
 -in VAL                : path to a file/folder having input data
 -list VAL              : path to a file which contains many input paths (one
                          path per line).
 -locallist             : When this flag is set the -list is forced to treat as
                          local file. By default the list is read from
                          distributed filesystem when applicable (default:
                          false)
 -master (--master) VAL : Spark master. This is not required when job is
                          started with spark-submit
 -out VAL               : Path to file/folder where the output shall be stored

Example 1: To partition a single segment:

  java -jar autoext-spark-0.2-SNAPSHOT.jar partition \
              -in nutch-segments/20151013204832/content/ \
              -out partition1 -master local

Example 2: To partition multiple segments:

Note: To do this, all those segments needs to be kept in a text file and then supply as value to -list input.txt instead of -in. Expected format is one segment path per line

  java -jar autoext-spark-0.2-SNAPSHOT.jar partition \
              -list input.txt \
              -out partition2 -master local

Note : In case you missed it, the /content/ suffix is necessary to the segment paths.

Compute Similarity

Pick all the paths you are interested to cluster from above partition step and then put them to a text file, say paths.txt. This can be easily done with find ../partition1/ -regex .*ml -type d > paths.txt

Usage

java -jar autoext-spark-0.2-SNAPSHOT.jar similarity
Option "-func" is required
 -app (--app-name) VAL  : Name for spark context. (default:
                          ContentSimilarityComputer)
 -func VAL              : Similarity function. Valid function names =
                          {structure, style}
 -in VAL                : path to a file/folder having input data
 -list VAL              : path to a file which contains many input paths (one
                          path per line).
 -locallist             : When this flag is set the -list is forced to treat as
                          local file. By default the list is read from
                          distributed filesystem when applicable (default:
                          false)
 -master (--master) VAL : Spark master. This is not required when job is
                          started with spark-submit
 -out VAL               : Path to file/folder where the output shall be stored

Style Similarity

Example1 : Compute Style similarity

java -jar autoext-spark-0.2-SNAPSHOT.jar similarity -func style \
       -list paths.txt -out results/style -master local

Structural Similarity

Note : This step takes much time to complete. Pick a small dataset for testing in localmode. If you have a large dataset, then distributed mode is the way to go!!

Example1 : Compute Style similarity

java -jar autoext-spark-0.2-SNAPSHOT.jar similarity -func structure \
       -list paths.txt -out results/structure -master local

Combine Structure and Style

Usage

$ java -jar autoext-spark-0.2-SNAPSHOT.jar simcombine
Option "-in1" is required
 -app (--app-name) VAL  : Name for spark context. (default: SimilarityCombiner)
 -in1 VAL               : Path to similarity Matrix 1 (Expected : saved
                          MatrixEntry RDD).
 -in2 VAL               : Path to Similarity Matrix 2 (Expected : saved
                          MatrixEntry RDD)
 -master (--master) VAL : Spark master. This is not required when job is
                          started with spark-submit
 -out VAL               : Path to output file/folder where the result
                          similarity matrix shall be stored.
 -weight N              : Weight/Scale for combining the similarities. The
                          expected is [0.0, 1.0]. The combining step is
                           out = in1 * weight + (1.0 - weight) * in2

Example

java -jar autoext-spark-0.2-SNAPSHOT.jar simcombine \
   -in1 results/structure -in2 results/style \
   -weight 0.5 -out results/combined -master local

Cluster the results: Shared Near Neighbor Clustering

Usage

java -jar autoext-spark-0.2-SNAPSHOT.jar sncluster
Option "-out" is required
 -app (--app-name) VAL          : Name for spark context. (default:
                                  SharedNeighborCuster)
 -in VAL                        : path to a file/folder having input data
 -list VAL                      : path to a file which contains many input
                                  paths (one path per line).
 -locallist                     : When this flag is set the -list is forced to
                                  treat as local file. By default the list is
                                  read from distributed filesystem when
                                  applicable (default: false)
 -master (--master) VAL         : Spark master. This is not required when job
                                  is started with spark-submit
 -out VAL                       : Path to file/folder where the output shall be
                                  stored
 -share (--sharingThreshold) N  : if the percent of similar neighbors in
                                  clusters exceeds this value, then those
                                  clusters will be collapsed/merged into same
                                  cluster. Range:[0.0, 1.0] (default: 0.8)
 -sim (--similarityThreshold) N : if two items have similarity above this
                                  value, then they will be treated as
                                  neighbors. Range[0.0, 1.0] (default: 0.7)

Example : Style clusters

java -jar target/autoext-spark-0.2-SNAPSHOT.jar sncluster \
          -in results/style \
          -out results/clusters -master local \
          -share 0.8 -sim 0.8

Example : Structural clusters

java -jar target/autoext-spark-0.2-SNAPSHOT.jar sncluster \
          -in results/structure \
          -out results/clusters -master local \
          -share 0.8 -sim 0.8

Example : Structure and style combined clusters

java -jar target/autoext-spark-0.2-SNAPSHOT.jar sncluster \
          -in results/combined \
          -ids resuls/sim-ids \
          -out results/clusters -master local \
          -share 0.8 -sim 0.8

Export Clusters to D3JS JSON Format

Example

java -jar autoext-spark-0.2-SNAPSHOT.jar d3export \
  -in results/clusters/ -out results/clusters.d3.json -master local

Visualize clusters

Use the JSON file generated from previous step and visualize using sample d3js chats in visuals/webapp/circles-tooltip.html

To load the charts, you may simply launch google-chrome $PWD/visuals/webapp/circles-tooltip.html from the root of the project.

 Once the web page is loaded, use the file chooser dialogue to choose your clusters JSON file.

The following actions are supported in UI:

 + Left click the circle: Zoom inside the cluster
 + Hover on the circle: Shows tooltip
 + Click on the outer circle: Zooms out to the upper level
 + Right click on a circle: Opens a web page if the cluster name is an HTTP URL (In this case, yes)

---

Other tools

There are few more tools developed to test and debug the above tasks. Hopefully they will be useful for anyone experimenting with this clustering tool kit.

Grep Content

This tool filters content in sequence files matching to specified -urlfilter AND/OR -contentfilter patterns.

Usage

java -jar autoext-spark-0.2-SNAPSHOT.jar grep
Option "-out" is required
 -app (--app-name) VAL  : Name for spark context. (default: ContentGrep)
 -contentfilter VAL     : Content type filter substring
 -in VAL                : path to a file/folder having input data
 -list VAL              : path to a file which contains many input paths (one
                          path per line).
 -locallist             : When this flag is set the -list is forced to treat as
                          local file. By default the list is read from
                          distributed filesystem when applicable (default:
                          false)
 -master (--master) VAL : Spark master. This is not required when job is
                          started with spark-submit
 -out VAL               : Path to file/folder where the output shall be stored
 -urlfilter VAL         : Url filter substring

Merge Sequence files

Merges multiple sequence files into one large sequence file with a configurable number of parts (use -numparts argument below).

Usage

 java -jar autoext-spark-0.2-SNAPSHOT.jar merge
Option "-out" is required
 -app (--app-name) VAL  : Name for spark context. (default: ContentMerge)
 -in VAL                : path to a file/folder having input data
 -list VAL              : path to a file which contains many input paths (one
                          path per line).
 -locallist             : When this flag is set the -list is forced to treat as
                          local file. By default the list is read from
                          distributed filesystem when applicable (default:
                          false)
 -master (--master) VAL : Spark master. This is not required when job is
                          started with spark-submit
 -numparts N            : Number of parts in the output. Ex: 1, 2, 3....
                          Optional => default
 -out VAL               : Path to file/folder where the output shall be stored

De-duplicate

This tool pops out only unique records from sequence file. The uniqueness is determined by Keys (i.e. URL) only.

Usage

 java -jar target/autoext-spark-0.2-SNAPSHOT.jar dedup
Option "-out" is required
 -app (--app-name) VAL  : Name for spark context. (default: DeDuplicator)
 -in VAL                : path to a file/folder having input data
 -list VAL              : path to a file which contains many input paths (one
                          path per line).
 -locallist             : When this flag is set the -list is forced to treat as
                          local file. By default the list is read from
                          distributed filesystem when applicable (default:
                          false)
 -master (--master) VAL : Spark master. This is not required when job is
                          started with spark-submit
 -out VAL               : Path to file/folder where the output shall be stored

Dump Keys

This tool dumps Keys of sequence file(s) into a plain text file Usage

java -jar target/autoext-spark-0.2-SNAPSHOT.jar keydump
Option "-out" is required
 -app (--app-name) VAL  : Name for spark context. (default: KeyDumper)
 -in VAL                : path to a file/folder having input data
 -list VAL              : path to a file which contains many input paths (one
                          path per line).
 -locallist             : When this flag is set the -list is forced to treat as
                          local file. By default the list is read from
                          distributed filesystem when applicable (default:
                          false)
 -master (--master) VAL : Spark master. This is not required when job is
                          started with spark-submit
 -out VAL               : Path to file/folder where the output shall be stored

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering Tutorial

Table of Contents

Requirements

Overview

Create Sequence File

Partition Data

Compute Similarity

Style Similarity

Structural Similarity

Combine Structure and Style

Cluster the results: Shared Near Neighbor Clustering

Export Clusters to D3JS JSON Format

Visualize clusters

Other tools

Grep Content

Merge Sequence files

De-duplicate

Dump Keys

Clone this wiki locally