Home
Videos uploaded by user “Spark Summit”
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming  - by Michael Armbrust
 
28:26
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael Slides: http://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797 Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer" https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html // About the Presenter // Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization. Follow Michael on - Twitter: https://twitter.com/michaelarmbrust LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Views: 47031 Spark Summit
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
 
44:03
A Deeper Understanding of Spark Internals Aaron Davidson (Databricks)
Views: 95883 Spark Summit
Spark 2.0
 
20:26
Views: 8915 Spark Summit
Performing Advanced Analytics on Relational Data with Spark SQL - Michael Armbrust (Databricks)
 
21:30
Live from Spark Summit 2014 // About the Presenter // Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization. Follow Michael on - Twitter: https://twitter.com/michaelarmbrust LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Views: 3893 Spark Summit
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
29:25
“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D. Slides: http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming" https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html // About the Presenter // Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica. Follow T.D. on - Twitter: https://twitter.com/tathadas LinkedIn: https://www.linkedin.com/in/tathadas
Views: 23573 Spark Summit
Mastering Spark Unit Testing (Ted Malaska)
 
31:48
Traveling to different companies and building out a number of Spark solutions, I have found that there is a lack of knowledge around how to unit test Spark applications. In this talk we will address that by walking through examples for unit testing, Spark Core, Spark MlLib, Spark GraphX, Spark SQL, and Spark Streaming. We will build and run the unit tests in real time and show additional how to debug Spark as easier as any other Java process. The end goal is to encourage more developers to build unit tests along side their Spark applications to increase velocity of development, increase stability and production quality.
Views: 14431 Spark Summit
Deep Dive into Monitoring Spark Applications Using Web UI and SparkListeners (Jacek Laskowski)
 
30:34
During the presentation you will learn about the architecture of Spark’s web UI and the different SparkListeners that sit behind it to support its operation. You will learn what information about Spark applications the Spark UI presents and how to read them to understand performance of your Spark applications. This talk will demo sample Spark snippets (using spark-shell) to showcase the hidden gems of Spark UI like queues in FAIR scheduling mode, SQL queries or Streaming jobs.
Views: 13571 Spark Summit
Spark and the future of big data applications - Eric Baldeschwieler
 
18:27
Spark and the future of big data applications Eric Baldeschwieler (Tech Advisor)
Views: 8787 Spark Summit
What's Next for BDAS? - Mike Franklin
 
21:46
What's Next for BDAS? Mike Franklin (Director, UC Berkeley AMPLab)
Views: 2743 Spark Summit
Leveraging UIMA in Spark - Philip Ogren (Oracle)
 
16:58
Leveraging UIMA in Spark Philip Ogren (Oracle)
Views: 2433 Spark Summit
Catalyst: A Query Optimization Framework for Spark and Shark - Michael Armbrust (Databricks)
 
12:18
Live from Spark Summit 2013 // About the Presenter // Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization. Follow Michael on - Twitter: https://twitter.com/michaelarmbrust LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Views: 3025 Spark Summit
Exceptions are the Norm: Dealing with Bad Actors in ETL: Spark Summit East talk by Sameer Agarwal
 
31:27
Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications. In this talk we go over new and upcoming features in Spark that enable it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.
Views: 5213 Spark Summit
Music Recommendations at Scale with Spark - Christopher Johnson (Spotify)
 
26:29
Music Recommendations at Scale with Spark Christopher Johnson (Spotify)
Views: 10224 Spark Summit
Recipes for Running Spark Streaming Applications in Production - Tathagata Das (Databricks)
 
31:37
Live from Spark Summit West 2015 in San Francisco // About the Presenter // Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica. Follow T.D. on - Twitter: https://twitter.com/tathadas LinkedIn: https://www.linkedin.com/in/tathadas
Views: 9452 Spark Summit
Spark on YARN: a Deep Dive - Sandy Ryza (Cloudera)
 
22:37
Spark on YARN: a Deep Dive Sandy Ryza (Cloudera)
Views: 20664 Spark Summit
Building Data Pipelines with Spark and StreamSets (Pat Patterson)
 
31:52
Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an Apache 2.0 licensed open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we’ll look at how SDC’s “intent-driven” approach keeps the data flowing, with a particular focus on clustered deployment with Spark and other exciting Spark integrations in the works.
Views: 5799 Spark Summit
Get Rid of Traditional ETL, Move to Spark! (Bas Geerdink)
 
32:18
ETL has been around since the 90s, supporting a whole ecosystem of BI tools and practises. While traditional ETL has proven its value, it’s time to move on to modern ways of getting your data from A to B. Since BI moved to big data, data warehousing became data lakes, and applications became microservices, ETL is next our our list of obsolete terms. Spark provides an ideal middleware framework for writing code that gets the job done fast, reliable, readable. In this session I will support this statement with some nice ‘old vs new’ diagrams, code examples and use cases. Please join if you want to know more about the NoETL paradigm, or just want to be convinced of the possibilities of Spark in this area!
Views: 60393 Spark Summit
Beyond SQL: Spark SQL Abstractions for the Common Spark Job - Michael Armbrust (Databricks)
 
30:02
Live from Spark Summit 2015 // About the Presenter // Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization. Follow Michael on - Twitter: https://twitter.com/michaelarmbrust LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Views: 6663 Spark Summit
Advanced Apache Spark Training - Sameer Farooqui (Databricks)
 
05:58:31
Live Big Data Training from Spark Summit 2015 in New York City. "Today I'll cover Spark core in depth and get you prepared to use Spark in your own prototypes. We'll start by learning about the big data ecosystem, then jump into RDDs (Resilient Distributed Datasets). Then we'll talk about integrating Spark with resource managers like YARN and Standalone mode. After a peek into some Spark Internals, we touch base upon Accumulators and Broadcast Variables. Finally, we end with Spark Streaming and a technical explanation of how the 100 TB sort competition was won in 2014." - Sameer Slides: https://spark-summit.org/wp-content/uploads/2015/03/SparkSummitEast2015-AdvDevOps-StudentSlides.pdf Want to learn more about Spark? Check out my new class, "Exploring Wikipedia with Apache Spark", recorded June 2016: https://www.youtube.com/watch?v=vlVnSpJ6TDE&t=21m23s // About the Presenter // Sameer Farooqui is a Technology Evangelist at Databricks where he helps promote the adoption of Apache Spark. As a founding member of the training team, he created and taught advanced Spark classes at private clients, meetups and conferences globally. Follow Sameer on - Twitter: https://twitter.com/blueplastic LinkedIn: https://www.linkedin.com/in/blueplastic
Views: 209461 Spark Summit
Real-time big data processing with Spark Streaming- Tathagata Das (Databricks)
 
37:22
Live from Spark Summit 2013 // About the Presenter // Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica. Follow T.D. on - Twitter: https://twitter.com/tathadas LinkedIn: https://www.linkedin.com/in/tathadas
Views: 22116 Spark Summit
Just Enough Scala for Spark (Dean Wampler)
 
23:31
Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while Python and R remain popular with data scientists. Fortunately, you don’t need to master Scala to use Spark effectively. This session teaches you the core features of Scala you need to know to be effective with Spark’s Scala API. Topics include: 1) classes, methods, and functions, 2) immutable vs. mutable values, 3) type inference, 4) pattern matching, 5) Scala collections and the common operations on them (the basis of Spark’s RDD API), 6) really useful Scala types, like case classes, tuples, and options, 7) effective use of the Spark shell (Scala interpreter), and 8) common mistakes and how to avoid them.
Views: 11513 Spark Summit