Semiocast API tutorial
Raw text analysis

About

This chapter gives examples of querying the Semiocast API for raw text analysis. What we call raw text is a simple short string of characters less that 256 unicode characters.

We start off with raw text because it's easier to read the examples, but it pretty much works the same for processing Twitter and Facebook messages.

Language Identification

Let us try a language identification on the sentence: "This is an english test". We do that with the following curl command:

curl -E semiocast-api.pem:PASSWORD -d ident=language -d data="This is an english test" "https://api.semiocast.com/1/analyze/raw.json"

The method analyze/raw.json is used for analyzing raw text. We pass the ident=language parameter to specify that we want a language identification. The sentence itself is passed with the data parameter. When runing this command, the answer should be:

{"language":{"script_code":"latn", "language_code":"en"}}

This result contains two pieces of information:

  • the script identified by the ISO-15924 code specifying the set of graphic characters: "script_code":"latn", that is latin script;
  • the language identified by the ISO-639-1 code: "language_code":"en", meaning English.

All this means that in this case the message was identified as an english sentence written with latin characters.

Semiocast API is able to identify more than 60 languages, even japanese, arabic or hebrew. Here is an example in Japanese:

curl -E semiocast-api.pem:PASSWORD -d ident=language -d data="Twitterの投稿シェア調査に関するSemiocast社のプレスリリース:アメリカ30 % 日本15% ブラジル12%" "https://api.semiocast.com/1/analyze/raw.json"

This should return the following:

{"language":{"script_code":"jpan", "language_code":"ja"}}

The sentence is written in Japanese language in Japanese script.

Location identification

Semiocast API provide methods to identify locations in raw text. Results return both the country and the city. Countries are specified using the ISO 3166-1 alpha-2 standard.

Let's say we want to know which location is specified by the text "Living in Paris", we do that by issuing the following command:

curl -E semiocast-api.pem:PASSWORD -d ident=location -d data='Living in Paris' "https://api.semiocast.com/1/analyze/raw.json"

Queries for location identification are almost the same as language identification but the ident parameter is set to location. The result should be:

{"location":{"country_code":"FR", "city_name":"Paris"}}
or in "GPS coordinates" (WGS84 format):
curl -E semiocast-api.pem:PASSWORD -d ident=location -d data="ÜT: 3.496255,99.123443" "https://api.semiocast.com/1/analyze/raw.json"

This should return:

{"location":{"country_code":"ID","city_name":"Tebingtinggi"}}
Even japanese cities don't have any secret for us:
curl -E semiocast-api.pem:PASSWORD -d ident=location -d data="さいたまのえろいとこ付近" "https://api.semiocast.com/1/analyze/raw.json"
{"location":{"country_code":"JP", "city_name":"さいたま市"}}

Format

If you prefer to see results in XML format, you just have to replace json by xml in all queries. For instance:
curl -E semiocast-api.pem:PASSWORD -d ident=location -d data="ÜT: 3.496255,99.123443" "https://api.semiocast.com/1/analyze/raw.xml"

The expected result is:

<location><country_code>ID</country_code><city_name>Tebingtinggi</city_name></location>

Further reading

Read micromessage analysis documentation for a complete list of features, languages recognized, location's formats accepted, and results provided by this query.