Data engineering: How to build industry-strength data lakes and processing platforms
This three-day course teaches practical data engineering, how to build industry-strength data lakes and data processing platforms, and how to use them to build robust, scalable, and high-performing data processing applications.
Nota bene! The course is given online.
The focus is practical engineering, with real-life examples. The course covers architecture and development end-to-end, from data collection through batch and stream data processing, to exporting and serving data artifacts to users. In selected key areas, we will go down and cover implementation in detail. The course includes theoretical lectures as well as practical exercises that teach programming of scalable data processing frameworks. The contents are vendor neutral, but we will present a recommended selection of open source and public cloud components that serve as a starting point for a complete technology stack.
The course targets professionals requiring a hands-on understanding of state-of-the-art data engineering practices, such as backend engineers, data scientists, BI developers, database admins, and managers of those roles. Participants are expected to have at least three years of technical work experience.
Course participants are supposed to be proficient in either a major object-oriented or functional programming language, e.g. Java, Python, C++, Scala, or to be proficient in data modelling and SQL. For the practical exercises, it is recommended that participants work in pairs, ideally one person with developer experience and one person with data modelling experience.
Practical exercises will be done in Scala. Participants who do not know Scala in advance need not fear, however. Advanced language features are not needed, so it is sufficient to go through a tutorial in advance. Links will be provided.
Participants need to use own laptops, and download the exercise source code a few days in advance. Links and instructions will be provided. The exercises depend on open source libraries, downloaded as part of the preparations.
Lars is founder of Scling, providing data-value-as-a-service - a partnership solution for creating business value from data. He is a frequent conference speaker on big data technology and privacy protection. Before founding Scling, Lars has worked at Google, Spotify, and as an independent consultant, helping organisations create value with data processing technology. As independent consultant, his clients have ranged from startups to banks. LinkedIn profile at https://www.linkedin.com/in/larsalbertsson
The following topics will be covered in the lectures. Practical exercises will be interleaved with the lectures.
Overview and motivation. Why building a data platform, and how to use it.
Data collection. Gathering data into a data platform.
Batch processing. How to process data with scalable frameworks, such as MapReduce, Spark, Flink, etc.
Intro into serving and NoSQL. How to export data from a data platform, and how to serve data-driven applications.
Workflow orchestration. Connecting batch processing flows into robust pipelines.
Real-time processing. Data processing with scalable stream processing frameworks.
Deployment. Deploying batch processing applications in production.
DataOps and quality assurance. Testing, continuous deployment, error handling, and engineering data quality.
Lifecycle, evolution, schemas. How to evolve data pipelines over time without breaking applications.
Privacy by design. Architecting data processing in order to comply with privacy regulations.
The course is held in English, unless there is unanimous agreement for holding it in Swedish. The course material is in English.
Consider registering in pairs, coupling developer experience and data-modelling experience.
Maximum number of participants is 24, and the minimum for holding the course is eight.
26 maj 2020 - 28 maj 2020
Type of event