My team has been using Step Functions for a couple of months now to orchestrate application deployments. It is way better than our old SWF-based solution - I especially love how easy it is to implement dynamic backoff between states.
Biggest gripe: updating a state machine is still not supported (you can manage state machines through CloudFormation now, but I believe it updates with replacement, meaning the state machine gets blown away and replaced by a new one with the same name.) We've had to roll our own versioning to get around this.
As someone currently at the "prove it's a good idea" stage of moving production from SWF to SFN, I'm happy to hear it's been such a good experience in practice!
I think I'm mostly excited to stop locally representing/maintaining/enforcing the state machine -- it's been a huge source of toil for very little benefit. Hoping the cheapness of a new state machine will encourage splitting them when it makes sense instead of inflating/complicating an existing workflow...
How are the latency gains in practice? Specifically the activity -> decision -> activity interstitial latency in SFN vs the old SWF style.
I haven't actually worked with activities in SFN, just the pure Lambda state machines. But I can tell you those are fast. Like single-digit millisecond latency between states fast.
Step functions has pre-defined parallel states, but have you come up with a solution for dynamic parallel states?
Such as: a parent state's output will determine the size of the input to pass on to multiple copies of identical parallel state children (the number of which is determined by the parent's output). These get the same input or a specific slice of the same input from their parent.
I guess these could be created dynamically within the parent, then deleted afterwards. Not as clean as declaring this in the states markup though.
Right now, we have to coordinate our own batch processing, since dynamic parallel states aren't supported, as you note. I've actually asked the Step Functions team about this feature, and I know I'm not the only person to do so, so hopefully it's on their roadmap :)
I think there may be better options than Step Functions(or that can be used from Step Functions) for this particular purpose. Depending on the processing required, passing data further to something like recently announced AWS Batch can solve such problem.
I am actually starting to look into creating a batch workflow, which would be very suited for this (or SWF). However, I would like to know what the alternatives are that are non-amazon specific. Are there any open source projects that cover this space of defining workflows?
Step Functions is a massive improvement over SWF. As the author said, there are major differences and areas for improvement. Lifecycle management of state machines is pretty poor, especially considering that deleting state machines is hit-or-miss. You'll need to version your state machines yourself--which is still less painful than SWF versioning. Also, because there are no "domains" as in SWF, development can be more of a pain. The last SWF feature that would let me drop SWF completely would be signaling ("sleeping" a workflow until an event occurs).
Is it possible to accept an activity task with no timeout, store the state in a DB, then when your event occurs, load state from the DB and send the task_success signal? Not convenient, but seems feasible..
I've worked around this with Lambda tasks that write a result to a presigned S3 URL and a wait loop that polls it (like CloudFormation custom resources do) but it would be nice if this was built-in.
I couldn't figure out how to return a named error from a python2.7 Lambda function; as far as I could tell it would always return UnhandledError. How'd you do it?
I don't think you can set custom error names from a Lambda function:
"Important
Custom Error Names, like those shown above ("ErrorA", "ErrorB", etc.) can be generated by an Activity, but not Lambda Functions. (See "FunctionError" field in the Lambda Response Syntax for more information"[0]
Just read the AWS Step Functions Developer Guide (1.0) and it's not currently possible to generate custom named errors from a Task. Only possible from Activities.
The blog post talks about triggering the workflow off an upload to S3, but neglects to discuss the critical step around the fact that you need a Lambda function to trigger the step function state machine (if you have a look at the github it's there)
I wish step functions just supported the triggers directly tbh, i.e. have the ability to invoke a state machine on an S3 put, passing the event information to the first step in the machine.
Very much looking forward to an upcoming project with Step Functions at $DAYJOB. I suspect there are many use cases for turning complicated intersections of batch jobs and blobs of code into discrete, independent tasks like SFN requires.
Particularly interested in composing higher and lower level state machines for better separation of concerns, but haven't yet come up with a compelling use case :)
Biggest gripe: updating a state machine is still not supported (you can manage state machines through CloudFormation now, but I believe it updates with replacement, meaning the state machine gets blown away and replaced by a new one with the same name.) We've had to roll our own versioning to get around this.
If you'd like to know more about how we are using Step Functions, as well as some specific benchmark comparisons against our old SWF architecture, feel free to check out this post: https://forrestbrazeal.com/2016/12/29/serverless-workflows-o...