NLP in the Cloud: Measuring the Quality of NLP APIs
Just how good is named entity recognition in the cloud?
Natural Language Processing seems to have become somewhat of a commodity in recent years. More than a few companies have sprung up that offer basic NLP capabilities through a cloud API. If you’d like to know whether a text carries a positive or negative message, or what people or companies it mentions, you can just send it to one of these black boxes, and receive the answer in less than a second. Superficially, all these NLP APIs look more or less the same. Textrazor, AlchemyAPI, Aylien, MeaningCloud and Lexalytics all offer similar services (named entity recognition, sentiment analysis, keyword extraction, topic identification, etc.), and do so through similar interfaces. However, this surface conceals huge differences in quality.
In this article, I evaluate five NLP APIs on one specific, seemingly simple task: named entity recognition (NER) for English. In named entity recognition, the goal is to identify so-called named entities, such as locations, people and organizations. This is useful for many applications in information extraction, such as search engines or question answering software. Most APIs offer more entity types than the three main categories (for example, dates or events), but since locations, people and organizations are the most widely accepted and available, they are the ones I will focus on.
To evaluate the output of these APIs, I collected one hundred sentences from a range of news websites, including the Guardian, BBC, CNN, etc. All of these sentences contain at least one entity. They are well-written sentences in grammatical English, so they should pose fewer problems than, say, tweets. They cover news facts from different corners of the world and various domains (politics, sports, science, etc.), so the entities in them are pretty varied. To give an example of the output I expect, take the sentence
A diplomatic fight, usually kept behind closed doors, exploded in public Thursday as U.N. Secretary-General Ban Ki-Moon accused Saudi Arabia and its military allies of placing "undue pressure" on the international organization.
Good NER software should identify U.N. as an organization, Ban Ki-Moon as a person, and Saudi Arabia as a location.
I used this test set of sentences to evaluate the five APIs that are discussed in Robert Dale’s article NLP meets the cloud: Textrazor, AlchemyAPI (which is now part of IBM Watson), Aylien, MeaningCloud and Lexalytics. These all offer a free account with a limited number of calls that more than suffices for a thorough evaluation of their output. Interfacing with the API happens through manual REST calls or a custom library in your favorite programming language. Some interfaces are more confusing than others (looking at you, Lexalytics), but in general, setting this up is simple enough.
Per entity type, I use three standard metrics to measure the quality of the API:
- Precision: If an API identifies a word or word sequence as an entity, how likely is this to be correct?
- Recall: What percentage of entities in the text does the API identify correctly?
- F-score: the (harmonic) mean of precision and recall captures the quality of the API for each entity type in one single figure.
Here are some additional rules of the game:
- I use the available APIs as off-the-shelf services, without tweaking the results. Some APIs give a confidence score for every entity, which would allow users to filter out entities with a low confidence. This is potentially interesting, but it also takes quite some effort, so I just evaluate the output as-is.
- When an API misclassifies a location as an organization, or vice versa, I don’t penalize it and just follow the classification of the API. Place names are often ambiguous (as in Iran said …), so let’s not be too strict.
- When an API identifies an entity but does not classify it as a person, location or organization, I ignore it.
- Some APIs are able to identify adjectives (such as Chinese) as locations (China), others are not. If they do so, I count these entities as correct, but if they don’t, I don’t penalize them. Again, let’s not be too strict.
- Some APIs not only identify entities in a sentence, but also try to determine the real-world entity that it refers to, and give a link to the relevant Wikipedia page. Because not all APIs provide it, I ignore this reference, whether it is correct or not.
TextRazor’s offering is pretty complex, because it classifies entities as both DBPedia categories and Freebase types. Not only do these two sources have completely different lists of categories, sometimes the Textrazor classifications do not agree. In a sentence such as Iraqi troops begin operation to seize Falluja from Isis, Isis is classified as a Freebase Organization (correct), but a DBPedia Place (incorrect). In this exercise, I chose to work with the Freebase types. Additionally, Textrazor tends to give long lists of categories for every entity, most of which are really specific and frankly, irrelevant. Are you interested in the fact that the United States is not only a /location/location, but also a /meteorology/cyclone_affected_area, /fictional_universe/fictional_setting or /travel/travel_destination, wherever it occurs? I know I’m not.
Aylien’s API, too, is rather confusing, but in a different way: the API has two endpoints that can be used for entity extraction (Entity and Concept extraction), and their results can be very different. Aylien’s Concept extraction makes use of an external knowledge base (e.g. DBPedia), whereas its Entity extraction is purely self-contained. Concept extraction makes use of rather specific DBPedia entity types (e.g. Scientist, OfficeHolder, TennisPlayer, etc. are all types of Person), whereas Entity extraction has just the three main categories (Location, Person, Organization). Concept extraction can also return several entity types per entity, whereas Entity extraction sticks to one.
The remaining APIs are simpler, and return just one category per identified entity. The table below shows how I mapped them to the three main entity types.
|Textrazor||/location/location||/people/person, /royalty/noble_title||organization/organization, internet/website, sports/sports_team|
|AlchemyAPI||City, Country, StateOrCountry, Region, Facility, GeographicFeature||Person||Company, Organization|
Let’s now take a look at the results, starting with locations. Locations are generally the easiest type of entity to identify. Four of the five APIs in this evaluation exercise are best at identifying locations, when compared to organizations or people. Still, the differences in quality between the APIs are enormous. Textrazor does a great job, with an F-score of 95.9%. It finds 99% of all locations in the test sentences (recall), and when it identifies a word or word sequence as a location, this is correct in 93% of the cases (precision). AlchemyAPI is in second position, with an F-score of 90.8%. Its entities are also correct in 93% of the cases (precision), but it only finds 89% of all locations. Likewise, the other three APIs all achieve a precision of 93% and above, but their recall suffers greatly: MeaningCloud finds 76% of all locations, Aylien's Concept extraction 72%, its Entity extraction 68%, and Lexalytics a meagre 53%.
Let me give two examples to show what this means in practice. In the following sentence, Aylien’s Concept Extraction finds Belgium, but fails to classify it as a person, location or organization. Its Entity extraction does not identify it at all. Both endpoints miss all four people:
Belgium created the game’s first opening when Lukaku and Fellaini combined to tee up Nainggolan for a 25-yard drive that Buffon pushed away
Indian president Pranab Mukherjee arrived in Ivory Coast capital Abidjan on Tuesday for a two-day state visit, the first visit of an Indian president since the establishment of diplomatic relations between the two countries in 1960.
Lexalytics misses both locations (Ivory Coast and Abidjan), as well as Pranab Mukherjee (person). That’s pretty disappointing.
The results for organizations and people follow a similar pattern. Again, Textrazor and AlchemyAPI are in first and second position, respectively. Textrazor is particularly strong at identifying many entities (recall), AlchemyAPI wants to get them right (precision). MeaningCloud is a solid third place, while the performance of Aylien depends on the endpoint you use: the Concepts endpoint is better at identifying well-known people, places and organizations, while for lesser known entities and surnames without a first name, the Entities endpoint mostly does a better job. Lexalytics achieves a very high precision, but its recall is really low. It’s possible its recall and F-score can be improved by setting a lower confidence threshold for the entities.
Some of the more striking results can help illustrate the differences between the APIs. In the sentence
Jenson Button was an encouraging seventh for McLaren, matching the time of Williams' Valtteri Bottas in sixth.
both Lexalytics and Aylien’s Entity extraction fail to identify a single entity, although there are two people (Jenson Button and Valtteri Bottas) and two organizations (McLaren and Williams). Textrazor and MeaningCloud find all four entities correctly. Aylien’s Concept extraction finds three entities, but fails to identify Williams as an organization. AlchemyAPI finds Valtteri Bottas, but it misses Jenson Button and misclassifies McLaren and Williams as people.
In the sentence
This statement by the vice president of the NFF, Seyi Akinwunmi, is a repeat of homophobic statements made by football officials in the country in the past which were strongly condemned by FIFA.
Aylien’s endpoints do not classify a single entity as person, location or organization. MeaningCloud and Lexalytics both find FIFA (organization), but miss NFF (organization) and Seyi Akinwunmi (person). Textrazor and AlchemyAPI find all three entities, although they both map NFF to the Norwegian Football Federation, and not the Nigerian one.
Named entity recognition is a harder task than it might seem at first glance. Even the best APIs occasionally make mistakes. In many ways, I’ve tested only the most basic features of named entity recognition here. Things get considerably harder when the identified entities have to be linked to their correct real-world counterpart (such as NFF in the sentence above), or when several mentions of the same entity (Donald Trump … Trump … he) have to be mapped to one and the same referent. Similarly, even the best APIs struggle when they have to disambiguate an entity and infer, for example, that Southwest in a context such as drier than normal conditions in the Southwest refers to a location rather than a company. Kudos to AlchemyAPI, which was the only one in this test to get this example right. There’s no way this can be done without the basic entity recognition that I’ve tested here.
It is certainly true that the NLP APIs in this article have made basic NLP tasks available to the masses. However, while the offerings may look similar on the outside, their differences become very clear when we measure their performance. In my NER test, Textrazor outperformed all others, but it did so with long and often confusing lists of possible entity types. AlchemyAPI’s results were much more straightforward, but it found considerably fewer entities. MeaningCloud performed reasonably well on simple examples, but failed to perform on the more complex ones. Aylien has two endpoints that both struggle at times, and its plans to merge them should really pay off. Lexalytics has its work cut out. Of course, it is important to bear in mind that my evaluation exercise is rather informal and only looked at 100 sentences typical of one linguistic style and domain. It’s always possible that other domains may give different results. Still, it clearly pays off to compare the various APIs available, so buyer beware!