Sumerian Editor: tough cookie

sumerian editor with cuesFollowing an ages old tradition, I’ve decided to not read the documentation and dive deep into Sumerian instead. So here’s the first impediment I’ve encountered: it does not work in all browsers!

So far I’ve tried in an old Firefox (i.e. not Quantum) running on my Linux laptop and it did not work (but it warned me that it works better in newer versions).

In my Windows machine I struggle with Sumerian on Vivaldi. The scene was loading forever (with the round thingy going round and round) for about 10 minutes, nothing. Until I wanted to make a screen shot to document this post – and just by activating the Sniping Tool (the one that comes with Windows), it loaded up! Intriguing to say the least.

The first experience is good. You have dialogs explaining all the elements in the editor, and tips for beginners so that you can do stuff easily and don’t get frustrated on your first go.

Next I’ll try with IE11 and Firefox Quantum and share the results. Edge and Chrome are banned software in my work laptop. Go figure!

 

Sumerian: building blocks

VR personae

So, it looks like Sumerian is something like this:

  1. Web-based editor to create and edit 3D scenes (JavaScript or drag and drop thanks to its state machine that you can interact with graphically).
    1. First crazy idea: VR application that replicates the functionality of this Web based editor, thus you can go all Tom Cruise @ Minority report to work with Sumerian?!?
  2. Large repository of 3D objects / scenes that you can use with the above / you can inherit from & enhance with your needs. If you have your own, you can also upload them (OBJ and FDX file formats).
  3. The result is stored on AWS CloudFront (a CDN) and it can be accessed via a WebVR-enabled browser and headset.
  4. It seems that you can also create AR applications, i.e. leverage your phone’s camera. In particular,聽ARKit applications for iOS phones are a possibility from the onset.

Now that I have a basic but solid mental model of what Sumerian is, I get to play with the editor. I’ll tell you all about it once I figure it out!

 

I’ve been invited to the Amazon Sumerian Preview

Gilgamesh king of Uruk Credit DEA-Getty Images

My inbox had good news this morning. Turns out I’ve been invited to the Amazon Sumerian Preview. AWSome 馃槈 But what is Sumerian? In Amazon’s words:

“Amazon Sumerian is a set of tools for creating high-quality virtual reality (VR) experiences on the web. With Sumerian, you can construct an interactive 3D scene without any programming experience, test it in the browser, and publish it as a website that is immediately available to users.”

Not really voice, eh! But I believe this is the complement to Voice Services. Touch Screens will be a thing of the past. Voice will be the natural user interface of choice for commands and small snippets of information, VR or AR to get more complex information back ,and the good old Kindle reader for the noble art of reading novels!

Here’s a good place to start learning Sumerian. And the whole documentation is here. Bear in mind that this documentation is live and it’s bound to change significantly.

Time to play!

Alexa is coming home for Christmas: available worldwide (>80 countries)

Berlin Hauptbahnhof

Good news! After the initial US release, followed by Germany, the UK and India, the family of Amazon Echo products can be purchased in over 80 countries. The languages supported are German and English, the latter with 3 different locales (US, Britain and India). Here’s the official news.

For Europe, the German Amazon store has a bargain Echo Dot 2nd generation for just 35 euros.

Time for me to review my skills, be sure multilanguage is well implemented, to be ready to add my own. Surely it’s around the corner 馃槈

Who am I? Alexa introduces Voice Profiles

hand in hand

Privacy within the privacy of your home is a concern for users of the Amazon Echo and of any other voice assistant, especially since skills that聽 sync it with your personal accounts were made available. I am okay with my spouse checking my calendar, but I would not be so happy to mistake her appointments with mine! Voice assistants also bring out an an ages old problem that us tecchies detect very well, but others not so much: that of cardinality. An example: when you have one Echo (or multiple, linked Echos acting as one, if your home is bigger than mine!) but you don’t live alone, then it’s quite likely that more than one human will speak to Alexa. Why this one-to-many relationship between humans and machines represents a cardinality problem?

Let’s continue with the example. At home, my spouse and I use our Amazon Echo. We’re both non-English speakers and have distinct accents in English (we learnt the language in different continents). Our Echo sometimes goes crazy understanding one or the other. The Machine Learning element of Alexa must be very confused about supposedly the same human saying the same thing in such different ways at random moments in time! I bet Alexa would be happier if we could let her know that we’re two humans, if we could teach her to tell us apart, and then teach her to understand us better one by one.

If on top of having different voices and different accents, you wish to use individual services information (personal calendars, mail accounts…) then you need to be able to somehow link those individual services with your Echo devices – again, cardinality problem. Which one will Alexa use? Mine or my spouse’s? Why does it have to be only one? Can’t it be both?

Luckily, Amazon has just launched Voice Profiles to achieve this. You configure your Echo devices to pair with as many humans as needed. How? Through the Alexa app on your Smartphone. Here’s how:

  • The person whose Amazon account is linked with the Echo device must launch the Alexa app on their Smartphone, visit Settings -> Accounts -> Voice, and follow the instructions.
  • The second adult in the household must do the following:
  1. When both of you are at home, launch the Alexa app on the primary user’s Smartphone.
  2. Settings -> Accounts -> Household profile, and follow the instructions to set up this new user.
  3. With any of your Smartphones, log on to the Alexa app with the credential of the second adult in the household.
  4. Follow the instructions below.
  • Any other humans other than the primary account holder must do the following:
  1. Install the Alexa app on your Smartphone if you haven’t done so.
  2. Log in with your Amazon account (or create one if you’re not the second adult in the household).
  3. Provide the info that’s required to pair up with the Echo device.
  4. (you can skip Alexa calling and messaging if you don’t want to use that with your Echo).
  5. Settings -> Accounts -> Voice, and follow the instructions.

Here’s the full instructions.

New generation of Alexa-enabled devices is here!

New Alexa-enabled devices

Last week, Amazon announced the next generation of voice-enabled devices (and tools for devs!). Here’s what we could learn from the official announcement and subsequent media coverage.

Echo Plus: Same form factor as the original Echo device, but enhanced in many ways. It will act as the control center for the home. It can manage over 100 IoT home devices “out of the box” and without the Bluetooth fuss. A simple “Alexa, find my devices” will get them all hooked up. The big question is, when will we start to hear about cheeky neighbours going all Poltergeist on your living room lights, or worse?

Echo new generation: Same functionality of the original Echo device, but smaller, and covered in cloth (different colors). It will sell for $99, according to The Verge.

Echo Spot: Finally! Some years ago I fell in love with a device/idea called Chumby. It was some sort of potato shaped, Internet-enabled alarm clock. Sadly (or not!) I never got one. Echo Spot will fill that gap in my life. A device slightly bigger than a baseball with a nice screen that you can talk to, that can wake you up.

I foresee the Echo Spot being the bestseller of the 3. So for us devs, this means we must enhance our Skills with visual functionality (a.k.a. cards).

Wilkommen! Amazon Echo and Alexa now speak “the Queen’s English”… and German!

German and British flags
German and British flags

In September聽2016 (time flies…), Amazon announced that the Amazon Echo and therefore Amazon Alexa would be made available in the UK and in Germany.

One would think that this would affect two geographic areas and only one language, but nothing further from the truth. Trying to make Alexa understand Geordie or Scouse makes Deutschsprache crystal clear.

So, from now on, there are three languages that you should consider when you define your skill: English (US), English (GB) and German.

It’s very important that you realise that Geography and Language are different things, and you have to make decisions on both areas. I.e. you can publish a Skill in Germany in English (US) and German, or you can decide that your Skill won’t apply to expats and you want to publish it for Germany and in German only. When you define the Interaction Model, you define as many models as languages you wish to implement. When you provide publishing information, you decide on the geography.

 

In our next post we will solve the following riddles: What happens to my “functionality”, do I need to create one version per language (hint: don’t do it!!!). What are the implications of limiting my Skill to a certain geography? Then we will write a bit about predefined Slot types and multilanguage implications.

Amazon Lex, the beating heart of aLEXa, opened for conversational bot creation

Amazon Lex Logo
Amazon Lex Logo

This morning I finally got my invitation for the beta/preview program of Amazon LEX, the heart of Alexa’s voice recognition system.

I am just browsing though the documentation, so bear with me, but it looks very exciting. There are lots of concepts that will be familiar to any Alexa skills developer, especially around the interaction definition area. Some other are brand new.

Hope to have a bot up and running in the upcoming weeks. I’ll keep you posted!

Creating “The Functionality” Part 1: Introduction and “Existing Functionality”

In the post with the overall description with the magic formula for Skills, we broke those down in two parts: The Interaction Model and “The Functionality”. My usage of quotation marks is not just for fun. In the documentation provided by Amazon, the interaction model is mentioned by name all the time. The other part, no. So I decided to coin the term myself. Any kind readers with a better suggestion please leave it in the comments!

So, it’s now time to discuss “the Functionality”. This is what I’ve already said about the matter:

鈥淭he functionality鈥 can be an application that already exists (Fitbit, Uber, etc. were happy systems with millions of users before Alexa was invented), or one you make now to be used specifically with Alexa. In the first case, the developers for that existing system will have to develop an interface that uses the AVS API. Well, actually, a product manager will have to identify the functionality that will be used via Alexa, then the developers will encapsulate that functionality in a way that can be exposed to the AVS API so that Alexa can use it. In some cases the developer and the product manager are the same person!

If you鈥檙e creating a Skill from scratch, then Amazon recommends that you build and host 鈥渢he functionality鈥 with Amazon Web Services and they suggest you do it as a Lambda Function. We鈥檒l speak a lot about this soon, stay tuned!

 

I don’t know if it was clear enough, so here it goes. “The Functionality” is the stuff that the Skill actually does (telling the time, the horoscope, telling you the status of a flight, telling you how to prepare a mojito, suggesting which wines go well with pasta, etc.) And of course this functionality can already exist and is currently used through a different format (smartphone app, Web application, wearable, plain old desktop application, etc.) or you can create something totally new.

Existing functionality: Focus is Key

This will be the most frequent case. Your bank decides to offer their services to Alexa. Your fitness tracker expands the way to interact with you with voice. 聽And a long et cetera. Every month new Skills with existing functionality are published in the Alexa Skills list.

So, how does it work? You already have a working system with zillions of users. How do you add it to the list of stuff that Alexa is capable of doing? Well, first of all you need to define what functionality you want to expose to Alexa (“expose” here means “make available to”). Imagine you’re a bank. What do you want your customers to be able to do with voice? You have to get a lot of things into consideration. Stuff that is different with voice than with other interfaces. This list is not exhaustive:

  • Security & Privacy considerations: anyone in the house can give instructions to Alexa. It’s probably not a good idea to be able to do bank transfers via voice.聽And everyone around will hear what Alexa says. Is it okay to hear your account balance? Don’t even think of protecting transactions with passwords. Because the point of Alexa with many user personas is that they can only use voice, and the possibility of eavesdropping makes saying passwords aloud a no-no.
  • Ergonomy: Okay, this is the realm of the Voice Interaction Designer, but 聽you really need her input to decide what will fly and what will never work. Imagine you want to interact with your fitness tracker via Alexa. Will there be any value in hearing the list of your heart rates minute by minute? Will you remember it, will you aprehend it? The amount of information that a human can process depends on the sense being used. Sight is okay for browsing and for finding a needle of information in a visual haystack. Hearing is not.
  • Value and coherence: You want to implement stuff that is useful to the user and that brings value to your organization. And you want to paint a coherent picture to your user. He or she should not get frustrated because things you’ve implemented lead her to believe that similar or related things, ones that seem equally important to the user, are also implemented, when they are not.

Sounds daunting? No, not really. It’s just a lot of work. This is why you need a Product Owner, or you need to be able to act like one and devote enough time to it, when designing any kind of system. You need someone who knows well the needs of the organization, the needs of the user and who is capable of understanding the possibilities and limitations of the technologies being used.

Okay, imagine you’ve done all of that and you have a list of “services” you want to use through Alexa Voice Services. What do you have to do now? Easy. Get your Developer and your Interaction Designer together, get them together, make them read and understand the post about the Interaction Model (probably they know much more than me, so maybe they skip this part!), make them agree the “contract” between Functionality and Interaction Model (the Intent Schema) (don’t let them part ways until they do this!!)

Technical聽Implementation

Now your Developer can start work. It’s all about creating a Web service that exposes the functionality that you wish to serve via Alexa. Remember this diagram? It’s the bit at the bottom right.

AVS Overview
AVS Overview – “The Functionality” depicted on the bottom right part of the diagram.

Your Web Service must comply with the following (extracted from here. My comments between [square brackets]):

  1. The service must be Internet-accessible. [Pretty obvious, eh! But not easy to achieve in some big organizations.]
  2. The service must adhere to the Alexa Skills Kit interface. [More on it later]
  3. The service must support HTTP over SSL/TLS, leveraging an Amazon-trusted certificate.
    • For testing, Amazon accepts different methods for providing a certificate. [i.e. you don’t have to shell out money buying a certificate when you’re just testing]. For details, see the 鈥淎bout the SSL Options鈥 section of Registering and Managing Custom Skills in the Developer Portal.
    • For publishing to end users, Amazon only trusts certificates that have been signed by an Amazon-approved certificate authority.聽[You work with Amazon, you leverage their services, you accept their rules. Certificates are a matter of trust anyways and you should use the ones they trust!]
  4. The service must accept requests on port 443.
  5. The service must present a certificate with a subject alternate name that matches the domain name of the endpoint.
  6. The service must validate that incoming requests are coming from Alexa. [This last point is actually trying to protect you from DoS attacks]

So, the secret of the sauce is in complying with the Alexa Skills kit interface. And believe me, this will be trite unless you understand how the custom skill works, what you need to do to react to Intents, how to handle slots, and so on. To do that, you need to understand very well the interface specification聽but most important perhaps, have a broad picture of how everything clicks together. To do that, I recommend two things:

This will be time well spent, it will pay off with a high return rate later.

Good luck!

Creating the Interaction Model

As we said in the previous post, a Skill has two distinct parts: the Interaction Model and what I call 鈥渢he functionality鈥. In this post we will try to describe the elements of the Interaction Model, the rationale behind them, behind the split, and the shortcomings or limitations of the model adopted by Amazon.

So, quoting what we said already:

The Interaction model is everything related with speech. It鈥檚 where you specify the Invocation name, the slots that your Skill can understand, and very important, examples of whole sentences that your Skill can process. These sentences are called 鈥淪ample Utterances鈥 and you will spend many hours perfecting those. There鈥檚 also something called the 鈥淚ntent schema鈥 and it鈥檚 very, very important, because it defines the different tasks that Alexa will be asking to 鈥渢he functionality鈥, based on what the user has asked Alexa to do. It鈥檚 where you define the hooks between the two parts of the Skill.

We mentioned four elements:

  • Invocation name
  • Intent Schema
  • Slots
  • Sample Utterances

Let’s start from the beginning!

Invocation name

We saw the other day that this is not the name of the Skill, but you will probably decide that they are identical. The Invocation name is made of the words that you pronounce so that Alexa can figure out which Skill you want to use. You will always use it in conjunction with the wake word (Alexa! Echo! or Amazon! at the time of writing) and some verb: start, ask, etc. So, when you’re deciding on an invocation name, it’s worth trying it out. Just imagine how the Skill will be used:

  • “Alexa, start <<name of my skill>>”
  • “Alexa, ask <<name of my skill>> to…”

Make sure that the sentences above are easy to remember, easy to say, and easy for Alexa to recognize. My two golden rules would be:

  1. Make sure that the entire sentence is semantically and syntactically correct (i.e. makes sense). E.g. if you’re going to invoke your skill in the first way (“start”), it’s best that your Skill name represents a thing (saying “start the car” sounds okay, saying “start the driver” sounds really weird). If it’s going to be the second聽 way (“ask… to…”), then you probably want the Skill to represent a profession or a person who carries out a task (e.g. Wine Helper, Dream Catcher, things like that). Also avoid falling into language ambiguity. More on this later when we talk about Utterances, and limitations of the model adopted by Amazon.
  2. Make sure that it’s easy to pronounce, you don’t want to end up with a tongue twister, or drive those with a particular accent crazy. Using the word “think” it’s probably a very bad idea.

Intent Schema

This is the most technical part of the Interaction model because it’s the boundary between “the functionality” and speech. You can say it’s the “contract” between these two parts of the Skill. Once defined, voice interaction designer and developer can part ways and do their thing. Once they finish, if both complied with the Intent Schema, then everything will integrate nicely.

Not getting into code, but staying at the conceptual level, this is what happens.

The developer comes up with a list of different situations where the functionality will receive instructions from the user. Let’s call these “Intents”. Examples are: start, help, play, ask, quit, etc.

Some of those “Intents” will require a bit more info from the user. “Start” doesn’t require more information, but what about “Ask”? “Ask” what? So this must also be specified. This “what” is known as a a”Slot”, and their definition takes place in their own section of the Interaction Model. They are used here in the Intent Schema, though, hence this little explanation.

So, the content an Intent Schema will be something like this:

  • Start
  • Stop
  • Play
  • Quit
  • Ask “the time”
  • Ask “the weather” “location”
  • Ask “the weather” “location” “date”

The Intent Schema is written in JSON. This is a certain syntax or notation to write down information. To learn more about it, w3schools has a good tutorial. But the whole point of using JSON is that it’s “user friendly”, so having a good Intent Schema example is typically enough, it isn’t hard to modify it with the actual Intents and Slots for your Skill. This is the one that would represent the “Ask”: questions above:

{
  "intents": [
    {
      "intent": "GetWeatherIntent",
      "slots": [
        {
          "name": "Location",
          "type": "LIST_OF_LOCATIONS"
        },
        {
          "name": "Date",
          "type": "AMAZON.DATE"
        }
      ]
    },
    {
      "intent": "GetTimeIntent"
    }
  ]
}

Two things to notice here:

  1. The Intent names (GetTimeIntent and GetWeatherIntent) are not in written in human or natural language. They are code. It’s the task of the interaction designer to define the “human language” that must be mapped to those Intent names. She will do that in the Utterances section. A piece of advice with Intent names: it’s good practice to add the word Intent as a suffix (i.e. GetWeatherIntent instead of GetWeather). Don’t get lazy so that things don’t get confusing!!!
  2. The Slot names have a name and a type. Type means the kind of data that will be used when the Intent is invoked (e.g. 3rd April for the “Date” slot, Barcelona for the “Location” slot). And name is… well… self explanatory. We’ll explain those in the Slots section.

Slots

[If you’re a developer, to say that Slots are just a “variable” will suffice for you to understand]

They are the placeholder for the specific information that the user will provide to Alexa when using a Skill. . The Slot section is where you define them. You have to define them BEFORE you use them in the Interaction Schema, or you won’t be able to save your Skill in the developer console.

There are two kind of Slots: built-in, and custom. Built-in Slot types are those you would expect as basic data types in any programming language: numbers, dates, etc. Amazon is adding new ones all the time. This is the list as I type (obtained from here). Note they all have the prefix “AMAZON.” so that we can easily see that they are built-in:

  • AMAZON.DATE 鈥 converts words that indicate dates (鈥渢oday鈥, 鈥渢omorrow鈥, or 鈥渏uly鈥) into a date format (such as 鈥2015-07-00T9鈥).
  • AMAZON.DURATION 鈥 converts words that indicate durations (鈥渇ive minutes鈥) into a numeric duration (鈥淧T5M鈥).
  • AMAZON.FOUR_DIGIT_NUMBER – Provides recognition for four-digit numbers, such as years.
  • AMAZON.NUMBER 鈥 converts numeric words (鈥渇ive鈥) into digits (such as 鈥5鈥).
  • AMAZON.TIME 鈥 converts words that indicate time (鈥渇our in the morning鈥, 鈥渢wo p m鈥) into a time value (鈥04:00鈥, 鈥14:00鈥).
  • AMAZON.US_CITY – provides recognition for major cities in the United States. All cities with a population over 100,000 are included. You can extend the type to include more cities if necessary.
  • AMAZON.US_FIRST_NAME – provides recognition for thousands of popular first names, based on census and social security data. You can extend the type to include more names if necessary.
  • AMAZON.US_STATE – provides recognition for US states, territories, and the District of Columbia. You can extend this type to include more states if necessary.

Custom slots are just lists of possible values (like the values in a drop down list). That’s why they are typically called LIST_OF_WHATEVER. An example would be LIST_OF_WEEKDAYS and the content would be

Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Sample Utterances

This is the heart of the Interaction design. I bet you will spend many hours polishing this!!!

So, here what you do is this. Remember the Intent Schema? Well, for each one of them you have to come up with all the real-life examples you can think of. And you type them here. One by one. You will easily come up with hundreds of lines. I’ll discuss this in the shortcomings part of this post. A Sample Utterance looks like this:

NameOfIntent and a section in natural language that may contain none one {SlotName} or many {SlotNames}

In the example we’ve been following:

  • GetTimeIntent tell me the time
  • GetTimeIntent time please
  • GetTimeIntent what time is it
  • GetWeatherIntent tell me the weather for {Location} on the {Date}
  • GetWeatherIntent tell me the weather for {Location} {Date}

On the 4th example, the interaction designer is thinking of “tell me the weather for Paris on the 3rd April. On the 5th example, the interaction designer is thinking of “tell me the weather for Paris tomorrow”. This is like Pok茅mon: you gotta catch them all (possible examples of speech by your users!)

That’s it!!! Now you know how to create your Interaction Model!!!

[The Amazon folks explain this interaction business聽 here.]

Shortcomings or Limitations

You have to understand the big effort that Amazon is making here. The computing power that speech recognition uses is vast and you have to avoid the convoluted, complicated cases like the plague. From the days of Noam Chomsky and all his good work on Grammars, we know that natural language is inherently ambiguous, and that when you’re defining a synthetic grammar it’s quite easy to generate ambiguity and very hard (impossible in fact) to make sure you don’t.

If you don’t know what I am talking about, let’s analyze this sentence.

“In the show, I liked everything but everything but the girl girl.” If you don’t know that there is a band called “everything but the girl” with a female lead singer, you would think that the sentence above is gibberish, and discard it. Alexa would go crazy!

In order to avoid that, what AVS does is, before it accepts the Interaction Model of your skill, it runs some checks to make sure that you’re not introducing any ambiguity or crazy loops for Alexa to go crazy. In Computing Science terms, what you’re creating with the Interaction Model is a Context Free Grammar and the checks I mention are some heuristics trying to detect if the grammar is ambiguity-free. If you’re interested, some heavy reading here.

So, Amazon set very strict ways in the definition of your Interaction Model, and these generate the main limitations, in my opinion: that both Custom Slots and Sample Utterances are static: you have to define them beforehand and you cannot change them on the go while the Skill is live. If you want to include an extra Utterance, no matter how innocent it looks like, if you need a new value in one of your Custom Slots, you have to change the Interaction Model AND submit the Skill for re-certification. Best case possible: it will take two full working days to introduce the change.

Imagine that your Skill deals with names of people (names of players, names of friends… whatever) as Slots. You have to provide the list with all the possible names BEFOREHAND. You cannot add Ana茂s, or any other person with a name you wouldn’t have thought of, on the fly through usage of the聽 Skill. You have to add it to the Interaction Model, and re-submit for certification.

Managing the Sample Utterances as plain text is also very, very tricky. You will just lose track of what’s in there and troubleshooting is kind of hard. My workaround is the creation of聽 a little Access database tool with a simple but relational data model and some wonderful macros that “dump” the content of the database as a long string of text that matches the syntax of the Sample Utterances expected by AVS in the developer console, then I copy & paste this super long string.

Everything else, I think it’s super, and I am really grateful to Amazon for opening the platform for all to explore and develop Skills.