Internationalization (DB backed core)

Important note: This page explains how to create a Spotlight model on your own server. This is a detailed tutorial to explain each step and is partially outdated. A fully automated and up-to-date script for these steps can be found here.

For this part, you need Apache Hadoop and Apache Pig. If you don't have Hadoop and Pig installed, we recommend the following tutorials for setting up Hadoop. The indexing can also be run on a single machine, in this case it is enough to download Apache Pig and run it in local mode (add "-x local" after every pig command to run locally without hadoop).

and for Apache Pig:

Apache Pig Quick Setup

For more details on Hadoop-based indexing, see Indexing with Pignlproc and Hadoop, which also contains all required versions.

In the following sections, we take the Dutch language as an example. If you want to run the indexing for other languages, just replace the nl (the language code for Dutch) with its corresponding language code. We also assume the default working directory is /user/hadoop in HDFS.

The quick way

This section provides a quick way of creating the Spotlight model by executing the indexing script. Before that, you need to prepare some models as shown in the following two points. Meanwhile, the following programs must be available on your server: hadoop, pig, mvn, git, curl.

Create working directory and download OpenNLP models for your language:

  $ mkdir -p /data/spotlight/nl
  $ ls /data/spotlight/nl/opennlp
  nl-chunker.bin  nl-pos-maxent.bin  nl-sent.bin  nl-token.bin

Note: the working directory is given as an absolute path. The NLP models can be download from http://opennlp.sourceforge.net/models-1.5/.

Create a list of stopwords:

  $ head -n5 /data/spotlight/nl/stopwords.nl.list 
  de
  en
  van
  ik
  te

Run the indexing script, which would create the model /data/spotlight/nl/model_nl:

  $ cd /data/spotlight/nl
  $ wget https://raw.github.com/jodaiber/dbpedia-spotlight/master/bin/index_db.sh
  $ ./index_db.sh -o  /data/spotlight/nl/opennlp /data/spotlight/nl nl_NL /data/spotlight/nl/stopwords.nl.list Dutch /data/spotlight/nl/model_nl

Note: start up the hadoop workers before you try the above commands; paths in this command are absolute paths. You can change nl to other language code. But remember to change Dutch to the corresponding language specific Lucene Analyzer, e.g. English for EnglishAnalyzer.

The detailed way

This section describes the detailed steps for creating the Spotlight model. These steps are all performed by index_db.sh.

Download the data

Prepare a stopword list file under /data/spotlight/nl and NLP models under /data/spotlight/nl/opennlp as the quick way.

Download DBpedia data (see here)

  $ mkdir -p /data/spotlight/nl/processed/
  $ cd /data/spotlight/nl/processed/
  $ curl http://nl.dbpedia.org/downloads/nlwiki/20121003/nlwiki-20121003-redirects.ttl.gz | gzcat > redirects.nt
  $ curl http://nl.dbpedia.org/downloads/nlwiki/20121003/nlwiki-20121003-disambiguations.ttl.gz | gzcat > disambiguations.nt
  $ curl http://nl.dbpedia.org/downloads/nlwiki/20121003/nlwiki-20121003-instance-types.ttl.gz | gzcat > instance_types.nt

Note: if gzcat is not available, then replace it with gunzip -c.

Download the Wikipedia dump:

  $ cd /data/spotlight/nl
  $ wget http://dumps.wikimedia.org/nlwiki/latest/nlwiki-latest-pages-articles.xml.bz2

Process the data

Check out and build our version of pignlproc:
```
 $ mkdir pig
 $ cd pig
 $ git clone git://github.com/dbpedia-spotlight/pignlproc.git
```
Note: There are redirect definitions for most languages that have a local Wikipedia, if you are unsure if your language is among those, check the that the language is supported in the method getRedirectPatterns in AnnotatingMarkupParser.
```
 $ mvn assembly:assembly -Dmaven.test.skip=true
```
Note: if fails due to the core-0.6.jar of org.dbpedia.spotlight is not available from the info-bliki-repository, then you need to prepare that jar file by downloading the code of Spotligth and mvn install it manually.

Split the corpus in train, tune and test sets and move the training part into HDFS:

  $ cd /data/spotlight/nl
  $ bzcat nlwiki-latest-pages-articles.xml.bz2 | python pig/pignlproc/utilities/split_train_test.py 12000 /data/spotlight/nl/processed/test.txt | hadoop fs -put  nlwiki-latest-pages-articles.xml

Move the stopwords and tokenizer model into HDFS:

     $ hadoop fs -put /data/spotlight/nl/stopwords.nl.list stopwords.nl.list
     $ hadoop fs -put /data/spotlight/nl/opennlp/nl-token.bin nl.tokenizer_model

Adapt examples/indexing/token_counts.pig.params and examples/indexing/names_and_entities.pig.params to your language. See the two linked files for the example of Dutch.

Note: Due to line 87 of RestrictedNGramGenerator.java, the path for the tokenizer model is fixed to ./nl.tokenizer_model. Thus, you have to make your working directory be the default working directory ./ in HDFS, i.e, /user/hadoop for our example. Otherwise the function would report error for missing the nl.tokenizr_model file.
Run Apache Pig:
```
  $ cd /data/spotlight/nl/pig/pignlproc
  $ pig -m examples/indexing/token_counts.pig.params examples/indexing/token_counts.pig
  $ pig -m examples/indexing/names_and_entities.pig.params examples/indexing/names_and_entities.pig
```
Note: If you got "java.lang.OutOfMemoryError" error,try to set heap space larger by the following:1) added this line to the script: SET mapred.child.java.opts '-Xmx2048m'; 2) commented out this line: --set io.sort.mb 1024

Move the results of both jobs:

  hadoop fs -cat tokenCounts/tokenCounts/part* > tokenCounts
  hadoop fs -cat names_and_entities/pairCounts/part* > pairCounts
  hadoop fs -cat names_and_entities/uriCounts/part* > uriCounts
  hadoop fs -cat names_and_entities/sfAndTotalCounts/part* > sfAndTotalCounts

then, you should have the following files:

     $ ls /data/spotlight/nl/processed/
     pairCounts  sfAndTotalCounts  tokenCounts  uriCounts

Create the Spotlight model

Create the Spotlight model /data/spotlight/nl/model_nl with :

  $ java -cp dbpedia-spotlight.jar org.dbpedia.spotlight.db.CreateSpotlightModel nl_NL /data/spotlight/nl/processed/ /data/spotlight/nl/model_nl /data/spotlight/nl/opennlp /data/spotlight/nl/stopwords.nl.list None

This will create the following Spotlight model folder:

  $ tree /data/spotlight/nl/model_nl/
  /data/spotlight/nl/model_nl/
  ├── model
  │   ├── candmap.mem
  │   ├── context.mem
  │   ├── res.mem
  │   ├── sf.mem
  │   └── tokens.mem
  ├── model.properties
  ├── opennlp
  │   ├── chunker.bin
  │   ├── pos-maxent.bin
  │   ├── sent.bin
  │   └── token.bin
  ├── opennlp_chunker_thresholds.txt
  └── stopwords.list

Run the server

That's it, you can now run the server with your newly created model.

$ java -jar dbpedia-spotlight.jar /data/spotlight/nl/model_nl http://localhost:2222/rest

Note: If you want only fast run Statistical backend without creating a model, there are pre-built models available from the download page.

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Model backend

Developers

Google Summer of Code - GSoC

GSoC - Guidelines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly