After multiple weeks of pondering what I should do first with Apple's newly announced ARKit, I decided that I wouldn't narrow my mindset to just that one API. I had viewed multiple tutorials on CoreML/Vision's object recognition features, and I decided to give it a shot myself.
TL;DR: ARKit and Vision is an awesome combination.
What are we doing?
We're going to create an ARKit app that displays what the iOS device believes the object displayed in the camera is, whenever the screen is tapped. (See bottom of article for example pictures)
Project Setup
We begin our journey in Xcode (9 or above), where we create a newAugmented Reality App...
...give it a name... (in my case "arkit-testing-2") and set theContent Technology asSpriteKit...
...select its location on our hard drive, and start plugging away.
ViewController.swift
We're going to focus on the important pieces of code in this class, as most of it is general boilerplate.
overridefuncviewWillAppear(_animated:Bool){super.viewWillAppear(animated)// Create a session configurationletconfiguration=ARWorldTrackingSessionConfiguration()// Run the view's sessionsceneView.session.run(configuration)}
InviewWillAppear
theARWorldTrackingSessionConfiguration
class is created, and then the view's session is run. You can modify the configuration if you wish, but for this tutorial we won't be playing with it.
funcview(_view:ARSKView,nodeForanchor:ARAnchor)->SKNode?{// Create and configure a node for the anchor added to the view's session.letlabelNode=SKLabelNode(text:"👾")labelNode.horizontalAlignmentMode=.centerlabelNode.verticalAlignmentMode=.centerreturnlabelNode;}
Inside this function, anARSKView
object is provided, along with anARAnchor
object. TheARAnchor
object will be important later. Inside the function anSKLabelNode
is configured and returned. This will also be important later.
Before we jump into the other important file in this boilerplate project, let's modify ourviewDidLoad
method so we won't encounter a bug that I encountered when creating this project.
Replace...
// Load the SKScene from 'Scene.sks'ifletscene=SKScene(fileNamed:"Scene"){sceneView.presentScene(scene)}
with...
letscene=Scene(size:self.view.frame.size)sceneView.presentScene(scene)
I'm not sure what the bug is, or why this fixes it, but it does. You can play with the original code and find alternative fixes if need-be.
Scene.swift
To begin, comment out the following code inside oftouchesBegan
:
// Create a transform with a translation of 0.2 meters in front of the cameravartranslation=matrix_identity_float4x4translation.columns.3.z=-0.2lettransform=simd_mul(currentFrame.camera.transform,translation)// Add a new anchor to the sessionletanchor=ARAnchor(transform:transform)sceneView.session.add(anchor:anchor)
Yes, comment all of this out. Do not delete it, we'll come back to it later.
Vision!
Inside of theScene.swift
file, make sure you import the Vision framework before getting started:
importVision
Now go to theApple Developer Website's machine learning page and download the InceptionV3 model. You can download any model you'd like, this is just the one I prefer and for what it does it's relatively small in file size.
Editor's Note: The InceptionV3 model is no longer on the site. Fortunately, you can download a different model and adapt the code accordingly.
All you have to do now is drag and drop the InceptionV3 MLModel file into your project, just like you would with any other file.
What Xcode does for you here is generate a Swift interface for the model. I would recommend watching the Vision and Introducing CoreML sessions from WWDC17 to learn more about it, locatedhere andhere, respectively.
Now we're finally ready to write some code insidetouchesBegan
.
Let's enter a background thread to not completely wreck our application's performance when we run one of these requests (I learned this the hard way):
DispatchQueue.global(qos:.background).async{}
Now let's create a do, catch and create aVNCoreMLModel
object from our CoreML model we downloaded moments ago (depending on your internet speeds, of course)
do{letmodel=tryVNCoreMLModel(for:Inceptionv3().model)}catch{}
Inside of ourdo
catch
and just after our model initialization, let's create aVNCoreMLRequest
with acompletionHandler
like so:
letrequest=VNCoreMLRequest(model:model,completionHandler:{(request,error)in})
Now, let's create aVNImageRequestHandler
and perform our request (Write this code afterVNCoreMLRequest
'scompletionHandler
):
lethandler=VNImageRequestHandler(cvPixelBuffer:currentFrame.capturedImage,options:[:])tryhandler.perform([request])
Let me explain what this is code actually doing, because it can get a little strange.
We're creating an image request handler to handle our request, and passing it a...
CVPixelBuffer
?!? What the heck is that? According toStackOverflow, CVPixelBuffer is a part of theCoreVideo
framework. Fortunately for us, we can access one from ARKit by pulling it out of thecurrentFrame
object, saving us from doing any heavy-lifting .
currentFrame.capturedImage
Then we're performing our request withhandler.perform([request])
.
Now let's write the code inside ofcompletionHandler
:
// Jump onto the main threadDispatchQueue.main.async{// Access the first result in the array after casting the array as a VNClassificationObservation arrayguardletresults=request.resultsas?[VNClassificationObservation],letresult=results.firstelse{print("No results?")return}}
Awesome, we're almost done with ourScene
class. Remember the code we commented earlier? Let's paste it in after we perform thatguard
statement.
We're also going to modify a property to make our text appear further away from the device when we instantiate our ARKit object:
// Create a transform with a translation of 0.2 meters in front of the cameratranslation.columns.3.z=-0.4// Originally this was -0.2
If you'd like, you can update the comment to read0.4
meters, because that comment was for the previous value of the property.
One last thing and we're done with our Scene class. Create a new swift file calledARBridge
and paste the following code:
importUIKitimportARKitclassARBridge{staticletshared=ARBridge()varanchorsToIdentifiers=[ARAnchor:String]()}
TheanchorsToIdentifiers
property will allow us to associate an ARAnchor with its corresponding machine-learning value.
Let's add a value to this dictionary, and restructure our code so that it executes properly:
// Create a new ARAnchorletanchor=ARAnchor(transform:transform)// Set the identifierARBridge.shared.anchorsToIdentifiers[anchor]=result.identifier// Add a new anchor to the sessionsceneView.session.add(anchor:anchor)
Side note: If we save our identifier after we add the anchor to our scene, it won't appear properly. Make sure your code is in the order shown above.
We're all set! This is all of the code we just wrote inside of ourtouchesBegan
function:
DispatchQueue.global(qos:.background).async{do{letmodel=tryVNCoreMLModel(for:Inceptionv3().model)letrequest=VNCoreMLRequest(model:model,completionHandler:{(request,error)in// Jump onto the main threadDispatchQueue.main.async{// Access the first result in the array after casting the array as a VNClassificationObservation arrayguardletresults=request.resultsas?[VNClassificationObservation],letresult=results.firstelse{print("No results?")return}// Create a transform with a translation of 0.4 meters in front of the cameravartranslation=matrix_identity_float4x4translation.columns.3.z=-0.4lettransform=simd_mul(currentFrame.camera.transform,translation)// Add a new anchor to the sessionletanchor=ARAnchor(transform:transform)// Set the identifierARBridge.shared.anchorsToIdentifiers[anchor]=result.identifiersceneView.session.add(anchor:anchor)}})lethandler=VNImageRequestHandler(cvPixelBuffer:currentFrame.capturedImage,options:[:])tryhandler.perform([request])}catch{}}
(Finally) Back to ViewController.swift
The only thing we need to do now is modify ourview
method to retrieve the text associated with ourARAnchor
, which was generated by our machine learning model.
funcview(_view:ARSKView,nodeForanchor:ARAnchor)->SKNode?{// Create and configure a node for the anchor added to the view's session.guardletidentifier=ARBridge.shared.anchorsToIdentifiers[anchor]else{returnnil}letlabelNode=SKLabelNode(text:identifier)labelNode.horizontalAlignmentMode=.centerlabelNode.verticalAlignmentMode=.centerlabelNode.fontName=UIFont.boldSystemFont(ofSize:16).fontNamereturnlabelNode}
If there is no text associated with theARAnchor
, noSKNode
is returned. If text exists, we create anSKLabelNode
, change the font, and return it!
Testing!!!
I ran around my room pointing my camera at random objects, and this was the result:
It believed the MacBook Air on my desk was a stethoscope (that could have been the headphones or the mic), the pen on my nightstand was a revolver, and my Apple Watch sport band was a hatchet.
Other than that, it was amazing at predicting what the objects were. It thought the code for this project was a web-site, which wasslightly correct. It also detected the snake pattern on my mousepad from Razer, which was pretty amazing.
With different models, I'm sure there will be different results, so try multiple models out and see what happens. It's as simple as dragging and dropping them into the project and changing the line of code that accesses the model.
The final project can be found on GitHubhere, if you just want to run it and see what happens!
Thank you so much for reading, hopefully you enjoyed my (pretty basic) endeavor into ARKit and Vision!
Top comments(9)

- Email
- LocationNY
- EducationMount Allison University
- PronounsHe/him
- WorkCo-founder at Forem
- Joined
So cool!
For further actions, you may consider blocking this person and/orreporting abuse