Automated text extraction
Format
{ "display": string, //for display purposes only, a string to help creator keep track of this block "textEngine": { "criteria":string, // A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed "engineName":string, // The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration) "engineConfig" {"config_param_name":string,...}, // The configuration object to be passed to the engine "exitOnError": boolean // if true (default) true then errors during featureExtraction will cause the doc to be removed from the pipeline. If false, the processing will continue. } }
Examples
Extractor Comparisons
Consider the following html source here.
By selecting the different automated text extractors that are possible in the configuration, the output will vary accordingly.
Alchemy API
Extracts text from the web page and places it in the document's fullText.
Source:
{ "_id": "53ff4888e4b005d3891eab23", "communityIds": [ "53add292e4b015f8f5817611" ], "created": "Aug 28, 2014 03:19:36 PM UTC", "description": "cnn", "harvestBadSource": false, "isApproved": true, "key": "www.cnn.com.2014.08.28.world.meast.isis-iraq-syria.index..", "mediaType": "html", "modified": "Aug 28, 2014 03:19:36 PM UTC", "ownerId": "5346aa83e4b017d7e4acadb5", "processingPipeline": [ { "web": { "extraUrls": [ { "url": "http://www.cnn.com/2014/08/28/world/meast/isis-iraq-syria/index.html", "title": "U.N. says peacekeepers detained in Golan Heights - CNN.com", "description": "(no description)" } ] }, "display": "" }, { "textEngine": { "exitOnError": true, "engineName": "alchemyapi" }, "display": "" } ], "shah256Hash": "7DrUfLPz4DS3w1xLz0o1ait2DrDUPmqPOGSJDiXt77o=", "title": "cnn world" }
Output:
{ "communityId": ["53add292e4b015f8f5817611"], "created": "Aug 28, 2014 03:41:16 PM UTC", "description": "(no description)", "fullText": "(CNN) -- [Breaking news update at 10:47 a.m.] \nAn armed group detained 43 U.N. peacekeepers in the Golan Heights area early Thursday, the U.N. said. \n[Previous story, published at 10:09 a.m.] \nFresh fighting in Iraq; ISIS claims mass execution in Syria \n(CNN) -- ISIS said Thursday that it has executed at least 250 Syrian soldiers at an air base in the northeastern city of Raqqa. \nThe group said on one of its official websites that it killed the soldiers Wednesday. It also claimed to have killed some 600 government soldiers in the fight for the al Tabqa air base since August 19. \nThe Syrian Observatory for Human Rights, meanwhile, reported that 200 Syrian soldiers and 346 ISIS fighters died in the fight for the air base. Hundreds more were wounded, the London-based activist group said. \nCNN could not independently confirm the claims. \nThe news comes amid reports of fresh fighting near the Mosul Dam and the burning of oil wells near the strategic town of Zummar, Iraq -- important because of its location near a main road connecting Mosul to the Syrian border. \n Signs point to U.S. airstrikes in Syria \n Car bomb detonates in Baghdad rush hour \n What would McCain do against ISIS? \nThe Peshmerga are battling the militants near the town of Zummar, the Mosul Dam and the strategic Ayn Zala oilfields, which ISIS forces seized from the Kurds this month, said Faud Hussein, chief of staff for Kurdish regional President Masoud Barzani. \nTorching the oil wells is an apparent effort by ISIS fighters to cover their tracks as Peshmerga forces press toward ISIS positions, Hussein said. \nThe extent of the damage to the oil fields wasn't immediately known. \nAt least 50 ISIS militants were killed in fighting near the Mosul Dam on Thursday, said Hemin Hawrami, head of the Foreign Relations Office of the Kurdistan Democratic Party. Kurdish forces also destroyed several ISIS vehicles, he said. \nOne Peshmerga fighter died and five were wounded in the fighting, Hawrami said. \nThe fighting comes nearly two weeks after thousands of Peshmerga and Iraqi commandos ousted ISIS forces for control of the dam, a crucial facility that provides electricity for millions of people in Iraq. \nKurdish officials have credited U.S. airstrikes against ISIS -- which calls itself the \"Islamic State\" -- with helping Peshmerga forces push back against ISIS forces, whose breathtaking gains and brutal tactics captured the attention of world leaders. \nMeanwhile, U.S. President Barack Obama is considering airstrikes and humanitarian airdrops to help save thousands of Iraq's Shiite Turkmen, who officials said face potential slaughter by ISIS. \nISIS fighters have besieged the town of Amerli, about 140 miles (225 kilometers) southeast of Mosul, since the Sunni extremists swept into Iraq from Syria in mid-June. The town's fewer than 20,000 residents -- half of them women and children, according to the United Nations -- are without power. \n\"Residents are enduring harsh living conditions with severe food and water shortages, and a complete absence of medical services -- and there are fears of a possible imminent massacre,\" U.N. High Commissioner for Human Rights Navi Pillay said this week. \nTheir situation echoes the ordeal of Iraq's ethnic Yazidis, whose plight after they were forced to flee into the mountains to escape ISIS militants triggered U.S. aid drops and the first U.S. airstrikes against ISIS. \nSimilar to the chaotic scenes that played out in the Sinjar Mountains, Iraqi military helicopters have been carrying out food drops and picking up Turkmen desperate to get out. \nWhich groups are at risk in Iraq? \nScant defenses \nSurrounded on four sides, the 17,400 residents have had to defend themselves with only the help of local police, Masrwr Aswad of Iraq's Human Rights Commission has said. \nISIS has vowed to push the Shiite Turkmen out, calling them heretics. \nTurkmen are descendants of Turkic-speaking, traditionally nomadic people who share cultural ties with Turkey. There are Sunni and Shiite Turkmen in Iraq, and they account for up to 3% of Iraq's population. \nU.N. report alleges atrocities \nOn Wednesday, U.N. human rights investigators accused ISIS and Syrian government forces of committing war crimes and atrocities in their brutal fight in Syria. \nThe U.N. report said public executions, torture and mock crucifixions have become regular fixtures in ISIS-controlled areas of Syria. It also said that the extremist group is forcing children to fight. \n\"Among the most disturbing findings in this report are accounts of large training camps, where children, mostly boys, from the age of 14 are recruited and trained to fight in the ranks of ISIS along with adults,\" said Paulo Pinheiro, the chairman of the U.N. commission of inquiry on Syria. \nThe report also accuses the regime of Syrian President Bashar al-Assad of repeatedly using chemical weapons against civilians. \nThe U.N. investigators said the Syrian government dropped what was thought to be chlorine gas on civilian areas on eight different occasions in April. \nThe government forces are believed to have made particular use of barrel bombs dropped by helicopters to unleash the gas, said Vitit Muntarbhorn, a commissioner with the inquiry. \nU.S. airstrikes in Syria? \nChelsea J. Carter and Michael Pearson reported and wrote from Atlanta. CNN's Barbara Starr, Anna Coren, Hala Gorani and Jethro Mullen contributed to this report.", "mediaType": ["html"], "modified": "Aug 28, 2014 03:41:16 PM UTC", "publishedDate": "Aug 28, 2014 03:41:16 PM UTC", "source": ["cnn world"], "sourceKey": ["www.cnn.com.2014.08.28.world.meast.isis-iraq-syria.index.."], "title": "U.N. says peacekeepers detained in Golan Heights - CNN.com", "url": "http://www.cnn.com/2014/08/28/world/meast/isis-iraq-syria/index.html" }
Alchemy API Metadata
Extracts text from the web page and also provides entities and associations.
Source:
{ "_id": "53ff4888e4b005d3891eab23", "communityIds": [ "53add292e4b015f8f5817611" ], "created": "Aug 28, 2014 03:19:36 PM UTC", "description": "cnn", "harvestBadSource": false, "isApproved": true, "key": "www.cnn.com.2014.08.28.world.meast.isis-iraq-syria.index..", "mediaType": "html", "modified": "Aug 28, 2014 03:19:36 PM UTC", "ownerId": "5346aa83e4b017d7e4acadb5", "processingPipeline": [ { "web": { "extraUrls": [ { "url": "http://www.cnn.com/2014/08/28/world/meast/isis-iraq-syria/index.html", "title": "U.N. says peacekeepers detained in Golan Heights - CNN.com", "description": "(no description)" } ] }, "display": "" }, { "textEngine": { "exitOnError": true, "engineName": "alchemyapi-metadata" }, "display": "" } ], "shah256Hash": "7DrUfLPz4DS3w1xLz0o1ait2DrDUPmqPOGSJDiXt77o=", "title": "cnn world" }
Output:
fulltext
has been disabled to provide a more succinct example.
{ "communityId": ["53add292e4b015f8f5817611"], "created": "Aug 28, 2014 03:45:54 PM UTC", "description": "(no description)", "entities": [ { "actual_name": "ISIS", "dimension": "What", "disambiguated_name": "ISIS", "doccount": 0, "frequency": 1, "index": "isis/keyword", "relevance": 0.975508, "sentiment": -0.124779, "totalfrequency": -1, "type": "Keyword" }, { "actual_name": "U.S. airstrikes", "dimension": "What", "disambiguated_name": "U.S. airstrikes", "doccount": 0, "frequency": 1, "index": "u.s. airstrikes/keyword", "relevance": 0.830272, "sentiment": -0.387974, "totalfrequency": -1, "type": "Keyword" }, { "actual_name": "ISIS militants", "dimension": "What", "disambiguated_name": "ISIS militants", "doccount": 0, "frequency": 1, "index": "isis militants/keyword", "relevance": 0.813208, "sentiment": -0.343482, "totalfrequency": -1, "type": "Keyword" }, { "actual_name": "ISIS fighters", "dimension": "What", "disambiguated_name": "ISIS fighters", "doccount": 0, "frequency": 1, "index": "isis fighters/keyword", "relevance": 0.809866, "sentiment": -0.356031, "totalfrequency": -1, "type": "Keyword" }, } ], "mediaType": ["html"], "modified": "Aug 28, 2014 03:45:54 PM UTC", "publishedDate": "Aug 28, 2014 03:45:54 PM UTC", "source": ["cnn world"], "sourceKey": ["www.cnn.com.2014.08.28.world.meast.isis-iraq-syria.index.."], "title": "U.N. says peacekeepers detained in Golan Heights - CNN.com", "url": "http://www.cnn.com/2014/08/28/world/meast/isis-iraq-syria/index.html" }
Boilerpipe
Very similar text extraction as Alchemy API.
Source:
Same as examples above, only
"engineName": "boilerpipe"
Output:
{ "communityId": ["53add292e4b015f8f5817611"], "created": "Aug 28, 2014 03:50:28 PM UTC", "description": "(no description)", "fullText": "U.N. says peacekeepers detained by armed group in Golan Heights\nBy Chelsea J. Carter and Michael Pearson, CNN\nupdated 10:47 AM EDT, Thu August 28, 2014\nSTORY HIGHLIGHTS\nISIS claims to have executed 250 Syrian soldiers\nMilitants set fire to oil wells amid continued fighting near Mosul Dam and a strategic town\nAbout 50 militants and one Peshmerga fighter are dead, a Kurdish official says\nThe U.S. is considering airstrikes to aid Turkmen besieged by ISIS\n(CNN) -- [Breaking news update at 10:47 a.m.]\nAn armed group detained 43 U.N. peacekeepers in the Golan Heights area early Thursday, the U.N. said.\n[Previous story, published at 10:09 a.m.]\nFresh fighting in Iraq; ISIS claims mass execution in Syria\n(CNN) -- ISIS said Thursday that it has executed at least 250 Syrian soldiers at an air base in the northeastern city of Raqqa.\nThe group said on one of its official websites that it killed the soldiers Wednesday. It also claimed to have killed some 600 government soldiers in the fight for the al Tabqa air base since August 19.\nThe Syrian Observatory for Human Rights, meanwhile, reported that 200 Syrian soldiers and 346 ISIS fighters died in the fight for the air base. Hundreds more were wounded, the London-based activist group said.\nCNN could not independently confirm the claims.\nThe news comes amid reports of fresh fighting near the Mosul Dam and the burning of oil wells near the strategic town of Zummar, Iraq -- important because of its location near a main road connecting Mosul to the Syrian border.\nSigns point to U.S. airstrikes in Syria\nCar bomb detonates in Baghdad rush hour\nCongresswoman questions Obama ISIS plan\nWhat would McCain do against ISIS?\nThe Peshmerga are battling the militants near the town of Zummar, the Mosul Dam and the strategic Ayn Zala oilfields, which ISIS forces seized from the Kurds this month, said Faud Hussein, chief of staff for Kurdish regional President Masoud Barzani.\nTorching the oil wells is an apparent effort by ISIS fighters to cover their tracks as Peshmerga forces press toward ISIS positions, Hussein said.\nThe extent of the damage to the oil fields wasn't immediately known.\nAt least 50 ISIS militants were killed in fighting near the Mosul Dam on Thursday, said Hemin Hawrami, head of the Foreign Relations Office of the Kurdistan Democratic Party. Kurdish forces also destroyed several ISIS vehicles, he said.\nOne Peshmerga fighter died and five were wounded in the fighting, Hawrami said.\nThe fighting comes nearly two weeks after thousands of Peshmerga and Iraqi commandos ousted ISIS forces for control of the dam, a crucial facility that provides electricity for millions of people in Iraq.\nKurdish officials have credited U.S. airstrikes against ISIS -- which calls itself the \"Islamic State\" -- with helping Peshmerga forces push back against ISIS forces, whose breathtaking gains and brutal tactics captured the attention of world leaders.\nMeanwhile, U.S. President Barack Obama is considering airstrikes and humanitarian airdrops to help save thousands of Iraq's Shiite Turkmen, who officials said face potential slaughter by ISIS.\nISIS fighters have besieged the town of Amerli , about 140 miles (225 kilometers) southeast of Mosul, since the Sunni extremists swept into Iraq from Syria in mid-June. The town's fewer than 20,000 residents -- half of them women and children, according to the United Nations -- are without power.\n\"Residents are enduring harsh living conditions with severe food and water shortages, and a complete absence of medical services -- and there are fears of a possible imminent massacre,\" U.N. High Commissioner for Human Rights Navi Pillay said this week.\nTheir situation echoes the ordeal of Iraq's ethnic Yazidis , whose plight after they were forced to flee into the mountains to escape ISIS militants triggered U.S. aid drops and the first U.S. airstrikes against ISIS.\nSimilar to the chaotic scenes that played out in the Sinjar Mountains, Iraqi military helicopters have been carrying out food drops and picking up Turkmen desperate to get out.\nWhich groups are at risk in Iraq?\nScant defenses\nSurrounded on four sides, the 17,400 residents have had to defend themselves with only the help of local police, Masrwr Aswad of Iraq's Human Rights Commission has said.\nISIS has vowed to push the Shiite Turkmen out, calling them heretics.\nTurkmen are descendants of Turkic-speaking, traditionally nomadic people who share cultural ties with Turkey. There are Sunni and Shiite Turkmen in Iraq, and they account for up to 3% of Iraq's population.\nU.N. report alleges atrocities\nOn Wednesday, U.N. human rights investigators accused ISIS and Syrian government forces of committing war crimes and atrocities in their brutal fight in Syria.\nThe U.N. report said public executions, torture and mock crucifixions have become regular fixtures in ISIS-controlled areas of Syria. It also said that the extremist group is forcing children to fight.\n\"Among the most disturbing findings in this report are accounts of large training camps, where children, mostly boys, from the age of 14 are recruited and trained to fight in the ranks of ISIS along with adults,\" said Paulo Pinheiro, the chairman of the U.N. commission of inquiry on Syria.\nThe report also accuses the regime of Syrian President Bashar al-Assad of repeatedly using chemical weapons against civilians.\nThe U.N. investigators said the Syrian government dropped what was thought to be chlorine gas on civilian areas on eight different occasions in April.\nThe government forces are believed to have made particular use of barrel bombs dropped by helicopters to unleash the gas, said Vitit Muntarbhorn, a commissioner with the inquiry.\n", "mediaType": ["html"], "modified": "Aug 28, 2014 03:50:28 PM UTC", "publishedDate": "Aug 28, 2014 03:50:28 PM UTC", "source": ["cnn world"], "sourceKey": ["www.cnn.com.2014.08.28.world.meast.isis-iraq-syria.index.."], "title": "U.N. says peacekeepers detained in Golan Heights - CNN.com", "url": "http://www.cnn.com/2014/08/28/world/meast/isis-iraq-syria/index.html" }
Tika
Tiak can be used specifically to process text from pdf files located on the web
Consider the pdf file located here.
Tika will process the pdf and add additional metadata and keywords.
Source:
same as examples above, only
"engineName": "tika"
Output:
fullText
has been disabled to provide a more succinct example.
{ "communityId": ["53add292e4b015f8f5817611"], "created": "Aug 28, 2014 04:35:38 PM UTC", "description": "(no description)", "mediaType": ["pdf"], "metadata": {"_FILE_METADATA_": [{"metadata": { "Author": ["Ben Connable"], "Content-Type": ["application/pdf"], "Creation-Date": ["2012-07-12T14:32:08Z"], "GTS_PDFXConformance": ["PDF/X-1a:2001"], "GTS_PDFXVersion": ["PDF/X-1:2001"], "Last-Modified": ["2012-07-12T18:27:42Z"], "created": ["Thu Jul 12 10:32:08 EDT 2012"], "creator": ["Adobe InDesign CS5 (7.0.4)"], "producer": ["Acrobat Distiller 10.1.3 (Macintosh)"], "subject": ["This paper proposes a paradigm shift in how military intelligence is fused. The concept, behavioral intelligence analysis, provides a more complete picture of the complex counterinsurgency environment."], "title": ["Military Intelligence Fusion for Complex Operations: A New Paradigm"], "trapped": ["False"], "xmpTPg:NPages": ["38"] }}]}, "modified": "Aug 28, 2014 04:35:38 PM UTC", "publishedDate": "Jul 12, 2012 02:32:08 PM UTC", "source": ["pdf military"], "sourceKey": ["www.rand.org.content.dam.rand.pubs.occasional_papers.2012."], "title": "Military Intelligence Fusion for Complex Operations: A New Paradigm", "url": "http://www.rand.org/content/dam/rand/pubs/occasional_papers/2012/RAND_OP377.pdf" }
Footnotes:
Legacy documentation:
- Replaces "useTextExtractor" in the Source object
Legacy documentation: