Hadoop Plugin Guide

Overview

This guide will teach you how to create Hadoop plugins from your local development environment for use in Infinit.e. Hadoop is a map reduce tool that allows parallel jobs to be completed using multiple resources. We've fused these parallel resources with the ability to access Infinit.e's analytical data. This guide will take you through setting up your environment to creating your first plugin and finally deploying that plugin to Infinit.e.

As an alternative to writing a full Java app, a built-in "Javascript Engine" plugin is provided that allows users to write javascript "scriptlets" and distribute them. This is described here.

Once a custom plugin has been developed and tested, actually deploying is much easier, using the web-based UI described here.

From August 2013 the above "plug-in GUI" includes basic debug functionality that will often make the full development envronment described below unnecessary.

Setup

Install eclipse. The latest version can be found here. We suggest the Eclipe IDE for Java EE Developers but most versions will be suitable.
Install mongodb. The latest version can be found here. Quickstart Installation guides from MongoDB can be found here.
(Windows Only) Install cygwin. The latest version can be found here. Some commands used for local testing require a unix-friendly interface and Cygwin allows a windows pc to handle unix commands. (Note: During installation, cygwin will ask what packages you want to install. Stick with the default minimal packages unless you know you need something additional. More information can be found on cygwin's install page.)
Add cygwin to the environmental path (if on a windows platform). In order for eclipse to be able too make the necessary unix-friendly commands cygwin must be on the path. To do this follow the following commands (on window 7, if using a different OS the directions may vary):
1. Open control panel (start -> control panel)
2. Navigate to System
3. Choose advanced system settings from the left pane
4. Select the advanced tab and choose Environmental Variables
5. In the second section labeled System variables find the Path variable. Click edit.
6. Add the path to your Cygwin's bin folder to the end e.g. ;C:\cygwin\bin; (make sure you dont erase what was previously included, entries need to be separated by semicolons.
7. Click OK until all windows have been closed.
8. Restart your computer
9. To test it worked open a command prompt (start -> search for 'cmd' -> select cmd.exe ) and type "chmod" or any other unix command. The command prompt should respond something along the lines of chmod: missing operand. If the command prompt responds no command found, you have configured the path variable wrong or need to restart windows.
Open eclipse and add our sample java project found here. (File -> Import -> General -> Existing Projects into Workspace -> Select archive file -> browse to the downloaded project and import that)
Turn on mongodb:
1. Open a command prompt (start -> search for 'cmd' -> select cmd.exe ).
2. Navigate to your mongodb bin folder (e.g. C:/mongo/bin)
3. Start mongodb via command: mongod

To test you have setup your environment correctly you can try to run our sample project

In eclipse choose Run -> Run Configurations
In the left pane find Java Application and right click selecting New
For project navigate to the sample project you downloaded e.g. infinit.e.hadoop.examples
For Main class select our sample test harness: StandaloneTest
Click the arguments tab and type the path to the config file necessary to run the job in Program Arguments block. If this is your first time installing the Infinit.e Hadoop plugin, use this path to reach the sample config file: ${workspace_loc}\infinit.e.hadoop.examples\config\sample_data.json
Give your run configuration a new name (e.g. Hadoop Sample Test)
Hit Apply and, finally, Run. This will initiate the Hadoop job and send the output into mongodb.
To see the results, you will first have to look at mongodb (the results will be stored in a table named test.source_sum
1. Open another command prompt (start -> search for 'cmd' -> select cmd.exe ).
2. Navigate to your mongodb bin folder (e.g. C:/mongo/bin)
3. Start mongo console via command: mongo
4. Navigate to the test DB via command: use test
5. Check your results via command: db.source_sum.find()

The results should look similar to:

{ "_id" : ObjectId("520a440742224744f6ac4453"), "key": "Al Jazeera: World News", "value" : 6 }
{ "_id" : ObjectId("520a440742224744f6ac4454"), "key": "Business Wire: Conference News", "value" : 1 }
{ "_id" : ObjectId("520a440742224744f6ac4455"), "key": "Business Wire: Government News", "value" : 3 }
{ "_id" : ObjectId("520a440742224744f6ac4456"), "key": "Global Security: News", "value" : 1 }
{ "_id" : ObjectId("520a440742224744f6ac4457"), "key": "Huffington Post: News", "value" : 32 }
{ "_id" : ObjectId("520a440742224744f6ac4458"), "key": "Huffington Post: Tech", "value" : 33 }
{ "_id" : ObjectId("520a440742224744f6ac4459"), "key": "Kaiser: Health News", "value" : 1 }
{ "_id" : ObjectId("520a440742224744f6ac445a"), "key": "Mashable: Blog", "value" : 1 }
{ "_id" : ObjectId("520a440742224744f6ac445b"), "key": "MedPage Today: News", "value" : 1 }
{ "_id" : ObjectId("520a440742224744f6ac445c"), "key": "NPR: Tech News", "value" : 6 }
{ "_id" : ObjectId("520a440742224744f6ac445d"), "key": "New York Times: World News", "value" : 3 }
{ "_id" : ObjectId("520a440742224744f6ac445e"), "key": "News-Medical: News", "value" : 1 }
{ "_id" : ObjectId("520a440742224744f6ac445f"), "key": "Reuters: Top News", "value" : 4 }
{ "_id" : ObjectId("520a440742224744f6ac4460"), "key": "The American Prospect: News Articles", "value" : 1 }

Note that pre August 2013 releases the developer (ie non-integrated) format was slightly different:

The "key" field above was named "_id" (so there was no ObjectId)

Development

Once you have the environment setup and have the sample project working it's time to make some changes and learn what kind of things you can accomplish. First let's explore the sample project a little bit.

Test Harness - StandaloneTest.java and InfiniteHadoopTestUtils.java

When running the sample test, you created a run configuration that pointed to StandaloneTest.java as the main class. This is our test harness. It allows you to run hadoop jobs locally before uploading them to the server. Any new hadoop plugin projects you create need to have this file as well its helper file, InfiniteHadoopTestUtils.java, in order to test the jobs locally.

Configuation - Config.xml

We passed the location of the config.xml file to the main class, StandaloneTest.java. When a plugin runs on the server, the Infinit.e platform automatically creates these config files (with a little information from you). Locally we can play with the settings to achieve different results. Some fields that may need to be changed for various hadoop job plugins include:

mongo.job.name: This is the name of the job that is displayed in the system monitoring oages
mongo.input.uri: This is the table that the input of your map reduce will be read from. Typically if you are using sample data then leaving this to test.docs is sufficient.
mongo.output.uri: This is the table that the results of your map reduce will be saved to in your local mongodb instance. We recommend setting a different test.<tablename> for each new project so that you do not copy over top of another job's results.
mongo.input.query: This is a JSON query that will be ran on the input dataset. The sample config has an empty query {} but if you wanted to only grab documents that had a large significance score you could put any valid mongodb query instead e.g. {score:{$gt:200}} You can also specify some post-processing here, this is discussed below in one of the advanced topics. The query should be valid XML (may require CDATA or escaping). The following clause should be added if it is important to skip over recently deleted documents (this is automatically added when submitting via the Infinit.e API): '{ "index": { "$ne": "?DEL?" } }'

mongo.input.fields: The list of fields to be extracted from the input object

mongo.input.limit: For debugging, the number of records for each mapper to get.
mongo.job.mapper: This is the mapper class. If you create a different mapper you must put the package.filename$classname here.

mongo.job.input.format: Should be hardwired to com.ikanow.infinit.e.data_model.custom.InfiniteMongoInputFormat

mongo.job.output.format: Should be hardwired to com.ikanow.infinit.e.data_model.custom.InfiniteMongoOutputFormat
mongo.job.combiner: This is the combiner (or reducer) class. If you create a different combiner you must put the package.filename$classname here.
mongo.job.reducer: This is the reducer class. If you create a different reducer you must put the package.filename$classname here.
mongo.job.output.key: This is the reducers output key. Currently the values must come from the org.apache.hadoop.io package, but could be subject to change later. The key defines what the "_id" will be set to in the output table (e.g. the source title in the provided sample).
mongo.job.output.value: This is the reducers output value. Currently the values must come from the org.apache.hadoop.io package, but could be subject to change later. The value defines what the "value" will be set to in the output table (e.g. the # of documents by that source in the provided sample).
infinit.e.selfMerge: This is the table that a second split will be input to your mapper, typically this is this jobs last output collection

Sample Input Data - sample_data.json

The current input data for the sample project was coming from a file named sample_data.json in the config folder. This data came from query results exported from infinite.ikanow.com. If you want to supply your own sample data you can log onto your server (eg infinite.ikanow.com), run a query and then select from the upper right: options -> Create JSON for current query. Move this file into your project and change the run configuration to point to the new file. Once a plugin is uploaded to Infinit.e it will user the entire documents data set (that you have access to and if pointed to) when running. It is important to craft a query in the config file if you do not want to use the entire data when running on Infinit.e.

Note you can also get data from the API directly, see the examples in the query documentation. The only difference is that the documents are returned in a field called "data" instead of "documents", so it will have to be renamed by hand before running the test harness.

Mapper - examplesSourceSumXML$TokenizerMapper

The mappers used in the Infinit.e Hadoop framework follow the standard mapper rules of Hadoop, with the only exception being the input data is a BSONObject (i.e. the data from Mongo). The mapper in the SourceSum example deserializes every document object that is passed in to the mapper and grabs the source. The source is then set with a count of 1 and passed off to the reducer. Any information you want to aggregate should be grabbed here and passed on. If you wanted to sum up entire entity objects you could grab all of the entries from the DocumentPojo and pass them to the context as a BSONObject. Make sure to change the last 2 arguments in the template for the Mapper if you change what you are committing to the context (e.g. in the given example change: extends Mapper<Object, BSONObject, Text, IntWritable> to extends Mapper<Object, BSONObject, BSONObject, IntWritable>

In the SourceSum example we are stripping out the fields: "associations","entities", and "metadata" to speed up the process of deserializing by removing large chunks of unused data. If you were to follow the example I was explaining above you would need to comment out the value.removeField("entities") line so you had access to the documents entities.

Reducer - examples.SourceSumXML&IntSumReducer

The reducers used in the Infinit.e Hadoop framework follow the standard reducer rules of Hadoop, with the only exception being the output data will be wrote to mongodb. Once deployed the results will be available via an API call. The reducer in the SourceSum example receives a single source titles and a list of counts. The reducer simply sums up the values and passes the source title and new count on. A reducer can and will be called multiple times until it gets to a final result for every unique key delivered in the mapper.

If you wanted to follow along with the example from the mapper where we were summing up entity objects from every document we would follow the same method the SourceSum's reducer uses by just summing up the list of values.

More advanced example - infinit.e.hadoop.template

A template project is included here (TODO LINK TO GITHUB). It demonstrates a simple Infinit.e application (counting the number of documents grouped by tags) and demonstrates different combiners, user configuration etc. It is recommended to clone this project to act as the basis for most real-life Infinit.e applications.

Arguments - Advanced Topics

Custom arguments can be passed into a map reduce job to be used during the mapper or reducer. A useful case for this would be if you wanted to sum up a specific regions geotags (say by continent). You could reuse the same jar but have 7 jobs (1 for each continent) where you pass in the continents name. Arguments can be placed in the config.xml file. When using the Infinit.e platform arguments will be placed in the Hadoop configuration in a variable named "arguments". In the supplied config file there is a space you can change this out for testing. To access the arguments you can use this code in your mapper or reducer:

String args = context.getConfiguration().get("arguments");

In the given example we could include this in the mapper and use args to filter which geotags we grab from doc.entities (say only the ones contained in "North America").

Reading From Files - Advanced Topics

From March 2014, it is possible to read data in from HDFS as well from the document metadata/content tables, entity/association feature tables, or previous custom jobs.

For "inputCollection", use "filesystem" (this is also an option in the Plugin Manager GUI), and then instead of a MongoDB/Infinit.e query you can use the JSON format from the File Harvester:

{
    "file":
    {
		"url": "string", // The URL of the directory (include the trailing /) can be absolute path starting with "hdfs://", or a relative path from the Infinit.e working directory

        "pathInclude": "string", // Optional - regex, only files with complete paths matching the regular expression are processed further
        "pathExclude": "string', // Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)
		"renameAfterParse" "string", // Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file
            // (eg "$path/processed/$name")
  
        "type": "string", // One of "json", "xml", "*sv"
  
        "XmlRootLevelValues" : [ "string" ], // The root level value of XML to which parsing should begin
            // also currently used as an optional field for JSON, if present will create a document each time that field is encountered
            // (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one)
            // (Also reused with completely different meaning for CSV - see below)
        "XmlIgnoreValues" : [ "string" ], // XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level.
        // (Also reused with completely different meaning for CSV)
        "XmlSourceName" : "string", // If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.
        "XmlPrimaryKey" : "string", // Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV.
    }
	// other modifiers: see below
}

Only HDFS is supported (it is intended to add S3 in the future)
- All paths must be in "hdfs://user/tomcat/". If a relative path is specified then this prefix will be added.
- Access control is then enforced by the input filename: it must be in the following format
  - ANY_STRING/COMMUNITY_LIST/ANY_PATH/
    - where COMMUNITY_LIST can be any string containing the community ids separated by any valid separator (eg '_')
    - (eg "input/530fabeee4b05f1d7d0957be_530f6f5fe4b0de2e600a57fa/subdir1/subdir2/")
Only JSON/CSV/XML files are currently supported, not tika/Office (it is intended to add this capability in the future)
Only modifiers permitted:
- $fields
The records generated from file mode are in the (JSON) format of IKANOW documents, with the following fields populated:
- All:
  - url: from the record itself, if "XmlSourceName" and "XmlPrimaryKey" are specified (and the field from "XmlPrimaryKey" is present in the data)
    - else <the URI of the file containing the records>/<record number in file>/<csv|xml|json>
  - title: the path of the filename from which the records were taken
  - sourceUrl: the URI of the filename from which the records were taken
  - created: the time the doc was created, modified: the filetime of the file containing the record
- CSV:
  - metadata.csv: an array of size=1 containing the fields from "XmlRootValues"
  - fullText: the entire line
  - description: the first 128 characters of the line
- XML:
  - metadata.xml: an array of size=1 containing the XML object converted into JSON
  - fullText: the entire object
- JSON:
  - metadata.json: an array of size=1 containing the JSON object
- The "key" passed with each of the above records is an object Id (generated from the date when the file was last modified, together with the number of the record within the file split)

Split Size - Advanced Topics

Currently Hadoop jobs have a default amount of input data splits and docs per split. If you need to customize these defaults to get better performance you can change them in your config.xml file.

max.splits
max.docs.per.split

max.splits: the maximum number of groups of data that will be grabbed at once
max.docs.per.split: the maximum number of data entries that will be grabbed for each split.

From August 2013, these can also be entered into the query object (when submitted via the API), see the API documentation.

Dependencies - Advanced Topics

Occasionally you may want to use the map reduce results as input to another job. Infinit.e supports chaining multiple jobs together to make a series of jobs that require the input from the previous job. An example could be job A aggregates a documents geotags by continent (like the previous example), then job B uses job A's results to calculate frequency per country mentioned in the aggregate geotags of a continent. In the Infinit.e application, chaining together jobs like this is easy as selecting a menu option on the upload page. To test doing this locally you can edit the StandaloneTest.java file and a second config file. To run job A use the standard line:

int exitCode = InfiniteHadoopTestUtils.runStandalone( new SourceSumXML(), "config/config1.xml", args );

Then to run job B after that line include:

exitCode = InfiniteHadoopTestUtils.runStandalone( new WordCountXML(), "config/config2.xml", args );

Make sure to create separate config files for each and have the 2nd config file point to the output of job A (e.g. set mongo.input.uri to mongodb://127.0.0.1/test.source_sum and the output of job B to something different).

Post Processing - Advanced Topics

If your map reduce job returns a large amount of results you may want to limit the results and/or sort them to make ingesting the data easier and faster in the future. The query parameter on map reduce job doubles as a post processing command object. Note: currently the post processing will only occur when running a deployed job, not testing locally (we are planning to support this for local testing in the future). A typical query parameter may look something like this:

{ "source" : "Huffington Post: Tech" }

When submitted this would only return documents retrieved from the Huffington Post: Tech RSS feed. If you want to limit your results and/or sort them we can submit a post processing object along with our query (or an empty query!). The post processing object has the following structure:

{
    // Output pre-processing
    "$output": { // (as above)
        "limit":int,          // a record limit, how it is applied depends on limitAllData below
        "limitAllData":boolean,   // if true, the above limit is applied to the entire collection; if false, just to this iteration's records
        "sortField":string,       // field to sort on (defaults to "_id"), supports dot notation
        "sortDirection":int,  // -1 or 1 for descending or ascending  
        "indexes": [{}] or {} // A JSON object or list of JSON objects defining simple or compound MongoDB indexes
    },
    // Other control fields
    "$limit": int,            // If specified then will process this number of records in one giant split (used for debugging)
    "$fields": {},        // Usual MongoDB specification of the fields to provide to the mapper, eg {"_id":0, "entities":1} (defaults to all if {})
    // More advanced parameters:
    "$reducers": int,     // Specifies the number of reducers to use (default: 1)
    "$mapper_key_class": string, // Allows you to use different mapper output classes than the reducer (key class name, should be fully specified)
    "$mapper_value_class": string, // Allows you to use different mapper output classes than the reducer (value class name, should be fully specified)
    // Can mostly be left at their defaults:
    "$splits": int,       // The maximum number of splits before the standard MongoInputFormat class is used (which is very inefficient when a query is applied), default 10
    "$docsPerSplit": int, // The maximum number of docs per split before the standard MongoInputFormat class is used (which is very inefficient when a query is applied), default 12.5K
    "$srctags": string or {...},  // A MongoDB query, that is applied against the source tags (not the document tags) and converts to a list of sources (ie very efficient). 
                                    // (Note this backs out if too many sources - currently >5000 - are selected, so should be treated as a recommendation - ie mappers might still be called on non-matching sources)
    "$tmin": string, "$tmax": string, // Maps Infinit.e-query style time strings (including "now", "now-1d" etc) onto an indexed field in the specified collection to support time-boxing the input
                                        // (supported across the following input types: docs (mongo query and infinit.e query), records, custom table)
 
    "$caches": [ string ], // A list of ids pointing to JARs that are then added to the classpath cache, or other shares that can be accessed via the Hadoop distributed cache
                            // (Currently JS scripting engine only: Also can be a list of ids/job titles pointing to other jobs that can then be accessed via _custom)
 
    // Record specific:
    // (ES formatted query needed, with $tmin/$tmax support)
    "$streaming": boolean, // if present then will search only live/streaming records interface if true, only stashed/demo records interface if false; searches both if not present
    "$types": string, // ,-separated list of ES-types to filter on
    //
    // The MongoDB or Infinit.e or ES query as before eg "sourceKey": "key.to.source", / "qt" [ { "etext", "string" } ], etc
    //
}

(Note this replaces from August 2013 the old array-based format, still described here)

(Infinit.e style temporal queries are described here)

The above query object is converted when a job is submitted to the custom engine via the Infinit.e API (or plugin manager GUI) - if Hadoop jobs are submitted manually then the query should just be a standard MongoDB query. The fields mostly map onto the various config XML parameters described on this page. Special cases:

The "$output" fields must be manually performed using MongoDB shell commands.
The "$caches" field maps onto the Config XML parameter "infinit.e.cache.list"
The query modifiers "$srctags", "$tmin", "$tmax" must be manually mapped into the query parameter

A full example of a query object you could pass to do post processing:

// New example:
{
	"$output": {"limit":5,"sortField":"value","sortDirection":-1,"limitAllData":true},
	"source" : "Huffington Post: Tech"
}
 
// Old format: With a query:
[{ "source" : "Huffington Post: Tech" },{"limit":5,"sortField":"value","sortDirection":-1,"limitAllData":true}]

// Old format: Without a query:
[{},{"limit":5,"sortField":"value","sortDirection":-1,"limitAllData":true}]

It is recommended to use an indexed field when trying to select a small % of large communities:

Document metadata:
- "sourceKey"
- "_id"
- "url"
- "entities.index"
Document content:
- "sourceKey"
- "url"
Entity features
- "index"
- "disambiguated_name"
- "alias"
Association feature
- "index"
Custom metadata tables:
- "_id"
- Any fields specified for sorting

eg in the example above, "source" is not indexed and therefore would perform a full scan within the community which for larger communities is sub-optimal. A future release will allow use of the full query API to provide more flexible queries at high speeds.

Also note: If running in standalone mode you cannot use Infinit.e-style queries, or "$srctags" or "$caches" modifiers.

Authorization - Advanced Topics

The Infint.e custom API inserts two fields:

infinit.e.userid: The _id of the calling user
infinit.e.is.admin: Whether the calling user has admin privileges over the Infinit.e cluster (true/false)

These can be used by the custom Hadoop JARs (which currently are mostly unlimited in terms of their access controls) to decide whether to perform a given action.

Mapping previous output (selfmerge) - Advanced Topics

If you want to map your previous jobs results back into your current job (as an extra split, 1 record per map call) you can set selfMerge to true. If running locally set the config variable "infinit.e.selfMerge" to a second input collection e.g. run the job once with input: collA output: collB, then run the job a second time with input: collA, infinit.e.selfMerge: collB, output: collC.

This can be turned on in the plugins.jsp by selected true for self merge next to the input collection dropdown.

An example of when you would want to do this is you emit every docs entities then cluster in the reducer based on some criteria (num similar ents, timeframe, proximity, etc), outputing with a unique ID. In future runs you only want to run on new docs since the last run, but you still want to cluster with old clusters and keep the existing IDs. With selfmerge you can feed your old clusters back in, and just update them with the new documents entities.

Appending and merging records - Advanced Topics

By default each time a job is re-submitted, all the records from the previous run are deleted before new ones are created.

The job can be run in "append" mode instead (when submitting via the Infinit.e API this is the "appendResults" field), when submitting directly non-append mode must be simulated manually by dropping the output collection before running.

There are 2 different append modes:

Normal append mode ("incrementalMode" null or false): new records are added to the collection with no de-duplication performed on key.
Incremental mode ("incrementalMode" true - corresponds on the Config XML parameter "update.incremental"): described below.

In "Incremental mode" (also called "Extra reduce" mode, eg in the plugin manager), when an object is emitted from the reducer, it is compared to objects already in the output collection. All existing objects with the same key together with the newly emitted object are then reduced one final time (in a new reducer object - developers can recognize this by checking whether "context.getOutputCommitter()" is null (if it is, then the reducer is for the "extra reduce step"). The newly emitted object is always last (normally there will just be two elements, existing and new - but if just switched from non-incremental mode then there could be more).

If no object is emitted from the extra reduce step, then the existing object is left alone (there is currently no way of deleting the existing object, though that would be quite easy to add if needed).

Deployment

Once you have a map reduce job you want to run on the full dataset it's time to upload it to the Infinit.e system. A GUI tool is available for easy uploading and configuration editing here. Most of the settings you created in the config file will need to be copied over into the appropriate fields. To create a packaged jar for your plugin from eclipse you need to:

Create a run time configuration for your plugin. You can do this through run configurations or a really easy way is to right click on your java file that contains the main function with ToolRunner.run() and choose Run As -> Java Application. The application will crash but it will also create a run configuration to use in the next step.
Once you have a run configuration setup then right click on your project and choose Export.
Choose Java -> Runnable JAR file
Select the launch configuration you just created (probably your java files name e.g. SourceSumXML)
Choose a location to export to, leave library handling to Extract required libraries into generated JAR and click finish

Now that you have a jar file you can upload it to the plugin Manager referenced above. Select the same settings as your config file and you are a good to go. A guide to using the plugin manager can be found here.

Development with Hadoop 2.5 (CDH5)

Some small changes need to be made to run jobs locally on a windows machine:

Pull the project infinit.e.processing.custom.library (from ikanow_infinit.e_community repo)
Pull the project infinit.e.data_model (from ikanow_infinit.e_community repo)
Create a system environment variable named "HADOOP_HOME" pointed at infinit.e.processing.custom.library/win_hadoop_home (or copy those files somewhere on your machine and point at that)
Add "%HADOOP_HOME%\bin" to the windows PATH (and ensure it is update, eg restart eclipse/relaunch run configuration)
Copy the jars in "infinit.e.processing.custom.library/standalone_libs" into your event project and add them to the build path
Add the data_model project to your build path

(In order to run custom source tests via the tomcat API/UI, only steps 3 and 4 are needed).