const freeContent = `
<span class="course-bookmark" id="Introduction"><h1 class="c13" id="h.s5prgdcuygzw"><span class="c16">Course Introduction</span></h1>
<p class="c2"><span class="c3">Hello and welcome to the course! Data Engineering is a very in demand position in technology today and is a fascinating topic to explore. I hope that you continue down the path of learning what exactly a DE (Data Engineer) does on a daily basis and how you can obtain this skill set. </span></p>
</span><span class="course-bookmark" id="Course Expectations">
<h2 class="c9" id="h.nwap4ua7qpxt"><span class="c11">What can one expect out of this course?</span></h2>
<p class="c2"><span class="c3">By the end of the course you will be able to perform the basic duties of a Data Engineer. You will be able to set up data pipelines in the cloud. Along the way you&rsquo;ll learn how to ingest data and to store it efficiently, and then build analytics and visualizations on top of it. At the end you will even learn how to deploy machine learning models to learn from your data. Other related skills such as programming, DevOps and architecture will also be worked on. If your objective is to become employed in the DE field, this course will set you up for being able to apply for jobs and to perform well in the interview. The goal of this course is to not necessarily make you an expert in one area of data engineering, but to provide you the know-how and then give you the resources to further your path in what interests you.</span></p>
<h3 class="c9" id="h.3byj2cv3g9l"><span class="c11">What type of folks is this course intended for?</span></h3>
<p class="c2"><span class="c3">This course is intended for just about anyone. You might be coming from one of these backgrounds or from somewhere else in your life/career completely. Feel free to let us know!</span></p>
<h4 class="c7" id="h.yhr0wywaj6bc"><span class="c6">Data Engineer job applicants and students </span></h4>
<p class="c2"><span class="c3">This profession can often be an enigma to break into. Employers are often hesitant to hire junior employees, as the ramp up to become a fully functioning team member with little oversight often takes months, sometimes even years. College, while very useful for learning the theory and major concepts of computer science, almost never prepares a student to be directly prepared for an industry/non-academic role in software engineering, let alone a specialized data engineer. This course acts as a way to bridge that gap, whether you&rsquo;re a student or a non data software engineer. You will have the skills to directly contribute to your team.</span></p>
<h4 class="c7" id="h.fxhfpqnksj6o"><span class="c6">Current Data Engineer</span></h4>
<p class="c2"><span class="c3">Like most positions in tech, the landscape of technologies being used is ever evolving. One must continue to keep their skillset up to date in order to stay an efficient engineer. While in the past few years we&rsquo;ve seen a stabilization of the technology ecosystem in the space, there continues to be improvements and new features added to existing technologies on a regular basis, as well as some interesting greenfield ones. This course is regularly updated to reflect the latest trends in the industry. </span></p>
<h4 class="c7" id="h.weux984zpm0c"><span class="c6">Adjacent career role ie Data Analyst, DBA or Data Scientist</span></h4>
<p class="c2"><span class="c3">Learning how to build, maintain, deploy and scale data pipelines is becoming an ever increasing responsibility of adjacent roles in an organization. Oftentimes the data engineering team&rsquo;s backlog is full to the brim, and the responsibility of minor DE type tasks fall on others outside. This course will teach you how to perform these tasks and become a key player on your own team. A current trend is organizations switching over to becoming more cross functional, meaning you will need to learn key DE concepts to interact with DE team members.</span></p>
<h4 class="c7" id="h.9pqxt9ie5zs8"><span class="c6">Everybody else!</span></h4>
<p class="c2"><span class="c3">Whether you&rsquo;re a manager who has been assigned a team of Data Engineers, a product or project owner who has been given a data product to support, a recruiter who wants to know the latest DE trends or simply someone who enjoys learning new things, this course can provide useful insights into the day to day operations of a data engineer, and teach you the state of the modern data stack.</span></p>
<h3 class="c1" id="h.x47s5ph4ttgv"><span class="c5">Are there any prerequisites to this course?</span></h3>
<p class="c2"><span class="c3">In short, no. If you had the technologic competence to access this website, you should be able to get by. I know many successful DE&rsquo;s who don&rsquo;t have a college degree or who have entered the field from a different career path. With that said however, having the background on the technology side such as a SE, DBA, SysAdmin, etc. or from the more data or business analyst side will go a long way in helping one obtain the necessary skills and rounding out what they&rsquo;re missing. Don&rsquo;t sweat it if you don&rsquo;t know anything about data engineering, this course is designed for folks of all skill levels!</span></p>
</span><span class="course-bookmark" id="Modern Data Engineer">
<h2 class="c9" id="h.xwq1pjxff1qc"><span class="c11">What exactly is the role of a modern Data Engineer?</span></h2>
<p class="c2"><span class="c3">The Data Engineer title is a relatively new one, prior to the early/mid 2010s (and even at some companies today!) this role was simply called Software Engineer or was sometimes performed by a Database Administrator or similar ITOps employee. Nowadays the role is better understood, and the responsibilities of a modern Data Engineer has drastically increased. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The main responsibility for a DE is to build and maintain data pipelines, however many skill sets are involved in this process and the DE is often expected to perform other related tasks. These include understanding and analyzing the data, becoming the subject matter expert for the datasets that are collected in order to answer questions from others in the company. Data engineers often work alongside or directly as BI (Business Intelligence) Developers, helping build out reports and visualizations that upper management up through C level employees and even shareholders use to make impactful business decisions and to gauge the performance and overall health of the company. As machine learning becomes more prominent and accessible to non machine learning experts, DEs are often expected to act as Machine Learning engineers, and assist in deploying and scaling ML models. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Gone (well mostly gone) are the days of using GUI ETL tools such as IBM DataStage, modern data engineers must learn how to write code in order to solve modern data problems. In previous decades, Data Engineers could get by with little programming skills, but today&rsquo;s challenges often require at least some grasp of high level languages such as Python or Java, and SQL is often the bread and butter for a Data Engineer.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">As the volume and velocity of data increases, modern DEs must know how to work with distributed systems to efficiently store and process this Big Data. Frameworks built to handle massive data sets such as Hadoop and Spark are essential in the modern DE&rsquo;s toolbelt. </span></p>
<p class="c2"><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://www.domo.com/learn/infographic/data-never-sleeps-5&amp;sa=D&amp;source=editors&amp;ust=1690745729706830&amp;usg=AOvVaw2PO8tsYQzK7PZpS2fUyCJ-">https://www.domo.com/learn/infographic/data-never-sleeps-5</a></span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">While the majority of current and future DE projects exist using modern technologies and are often deployed to a cloud or edge environment, many companies have legacy products on prem that must be maintained or migrated. Knowing how older technologies worked and being able to reproduce their outputs is also important for a DE. </span></p>
<h2 class="c9" id="h.mi6bii117ogr"><span class="c11">Why become a Data Engineer?</span></h2>
<p class="c2"><span class="c3">The prospect of becoming a data engineer is alluring to many as this field can be both lucrative and fulfilling. Since this role is a specialized type of software engineer, the compensation is usually on the higher end of the spectrum of the field. Data engineering problems, while different from other SE problems such as building mobile or web apps, are interesting in their own right and will keep one thinking on their feet throughout their entire career. Even when one has solved a similar problem previously, there will always be room for improvements such as creating more scalable, cheaper and efficient systems. This role is often very tied to business decisions, so one will quickly become a subject matter expert in their business and domain, and get to interact with different departments in the company such as finance, marketing and compliance. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span style="overflow: hidden; display: inline-block; margin: 0.00px 0.00px; border: 0.00px solid #000000; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px); width: 624.00px; height: 374.67px;"><img alt="" src="images/image19.png" style="width: 624.00px; height: 374.67px; margin-left: 0.00px; margin-top: 0.00px; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px);" title=""></span></p>
<p class="c2"><span>While this graph is a bit dated (2020 by dice.com), as you can see Data Engineers were one of the fastest, if not the fastest, growing roles in the tech industry. This trend continues to the day, and Dice has stated during their latest that &ldquo;Other data-related roles were among the highest in both posting volume and growth: data analysts, data scientists and data engineers&rdquo; </span><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://www.dice.com/technologists/ebooks/tech-job-report/occupations.html%23Top-Tech-Cities-by-Job-Posting-Growth&amp;sa=D&amp;source=editors&amp;ust=1690745729708020&amp;usg=AOvVaw1fNizN9l40ZLn4YcE832Ii">source</a></span><span class="c3">. 2022 did see a big dip in total volume of all tech jobs, but the industry has seen an expected resurgence in job growth and Data Engineers are still very high in demand. Now is the time to get into the industry to start gaining experience, as more senior level employees are typically the last to be laid off (and easiest to get rehired somewhere else if it does happen) when the next time mass layoffs happen again.</span></p>
<h3 class="c1" id="h.x3o7tou2pbt"><span class="c5">What makes for a good Data Engineer?</span></h3>
<p class="c2"><span class="c3">First and foremost, the ability to think analytically about problems and to have an inquisitive mindset are the biggest indicators of whether an individual will be successful at the job. A good DE will learn how the entire data ecosystem they&rsquo;re supporting works, and have a good mental mapping of how data flows through, which they can work backwards through when bugs are found or changes are needed. Oftentimes the structure and logic rules around a system are undocumented and not explained to the DE. The DE must be proactive in learning about the data and processes involved, running queries against the data and mapping out the relationships between structures and systems on their own. Having good communication and soft skills will enable a DE to effectively seek out the correct subject matter expert and learn how and why business decisions behind the system were made. Likewise for building new systems, a good DE will be able to flesh out both the functional and nonfunctional requirements, while keeping in mind best practices around their design and architecture. DE&rsquo;s need to constantly keep up with the current technology trends of the industry and decide if they need to apply them to their workload. A good DE thinks outside the box for problem solving, and seeks others input on how to create the best systems. However it&#39;s also important to not reinvent the wheel, and to go with the simplest solution that will work for the current task and avoid overengineering where possible. Finding a happy medium of simple but flexible if requirements change is often a career long pursuit, and experience brings the ability to anticipate these changes before they occur. </span></p>
</span><span class="course-bookmark" id="Data Lifecycle">
<h2 class="c9" id="h.7hvoj31jb1h0"><span class="c11">What is the Data Lifecycle?</span></h2>
<p class="c2"><span class="c3">The data lifecycle is the primary output of a DE&rsquo;s effort and refers to the stages that data goes through from its creation to its eventual disposal or archiving. This course splits out these stages into separate modules so you can learn about each one in greater detail. This fundamentals course will mostly skip over the creation process as they are normally out of scope for what a DE does and start with the collection/ingestion process. The following diagram illustrates the lifecycle stages.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span style="overflow: hidden; display: inline-block; margin: 0.00px 0.00px; border: 0.00px solid #000000; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px); width: 601.00px; height: 231.00px;"><img alt="" src="images/image20.png" style="width: 601.00px; height: 231.00px; margin-left: 0.00px; margin-top: 0.00px; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px);" title=""></span></p>
<p class="c0"><span class="c3"></span></p>
<h3 class="c1" id="h.q61irxp73972"><span class="c5">How Has the Cloud Transformed the Industry?</span></h3>
<p class="c2"><span class="c3">To say that the cloud has had an impact on the field of data engineering would be a massive understatement. The cloud has completely revolutionized how data engineering is done and brought it to a wider audience, perhaps why there is such a growth in the field today. Here are some key benefits of moving from on prem to the cloud:</span></p>
<h4 class="c7" id="h.nn7f22nodz45"><span class="c6">Scalability and elasticity</span></h4>
<p class="c2"><span class="c3">Data systems should be able to scale up easily depending on the size and complexity of the data its handling, and also be able to scale down on times when the data isn&rsquo;t needed as frequently, such as on evenings or weekends. Infrastructure (the underlying computing resources) are often ephemeral, meaning they may disappear over a certain amount of time, such as when a cluster scales down and a node and its IP are lost.</span></p>
<h4 class="c7" id="h.1fyhbga5dqgf"><span class="c6">Server Hosting </span></h4>
<p class="c2"><span class="c3">Since companies no longer have to buy and manage their own servers, they can easily spin up the resources they need for their requirements with a click of a button. Cloud providers regularly hit their SLAs (Software Level Agreements) of 99.9% and higher uptime, and allow companies to split their infrastructure into multiple availability zones in the case of the rare outage. </span></p>
<h4 class="c7" id="h.4hshwz8jdag2"><span class="c6">Storage</span></h4>
<p class="c2"><span class="c3">Storage is dirt cheap nowadays as compared to even a decade ago. Each cloud provider has their own object store which is the basis for where DEs store the data they collect. Backups and archives are much more possible given the costs. &nbsp;</span></p>
<h4 class="c7" id="h.mrdv4sksv2ra"><span class="c6">Managed Services </span></h4>
<p class="c2"><span class="c3">On top of the base machines that cloud providers &ldquo;rent&rdquo; out, they have also built specialized software that allow companies to quickly build out their information infrastructure. For data engineering, they provide tools such as EMR (Elastic Map Reduce), Glue for ETLs, Kinesis for data streaming and other tools that fit in the modern data engineering ecosystem. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Even if a company doesn&rsquo;t choose to directly move their infrastructure to a cloud provider, chances are they will at least use the cloud offering provided by the individual tool. For instance, the legacy ETL tool DataStage has an offering to move onto IBMs servers and manage it for a company. Companies often choose this route as it provides the above advantages with less hassle. The downside is the cost over time might end up being more than just self hosting. </span></p>
<p class="c2"><span class="c3">There are a ton of jobs out there primarily focused on moving legacy data pipelines from on prem to the cloud. A lot of opportunity exists around modernizing existing pipelines and making them faster, more robust and cheaper. You may be involved in one of these migrations, I have been multiple times in my career! That being said, there are negatives to moving to a cloud provider. One is that a company doesn&rsquo;t own the infra, they are just renting it meaning it might be cheaper over time to purchase and maintain it on their own. Another is a concept called vendor lock-in, which means it&#39;s often difficult to move to a different cloud provider or back to on prem since resources are specific to the cloud environment they&rsquo;re being run in.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">In this course we will focus on building our data pipelines in AWS, however you will be able to apply these same concepts to other cloud providers such as Google&rsquo;s GCP and Microsoft&rsquo;s Azure. AWS has the largest market share and the most amount of jobs available, so it is a good starting point. We will try to keep everything in the free tier, it&#39;s important you set billing limits on your account so that you never spend more money than you intend to. With big data this is especially easy to do, be careful with what resources you provision and make sure to decommission them when you&rsquo;re done using them.</span></p>
</span><span class="course-bookmark" id="Keys to Course Success">
<h2 class="c9" id="h.p02r9aqx03wc"><span class="c11">Can&rsquo;t one learn all the information of this course using ChatGPT and other free resources?</span></h2>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Well somewhat, at least for the cold hard information. In fact students of this course are encouraged to use ChatGPT or other AI alongside this guide in order to help solve problems they run into in order to make learning even more accessible. ChatGPT was even used to assist in creating some of the content for this course. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The value of this course is that one gets a curated and streamlined list of content all in a single place. Not only are the cold hard facts presented, but the personal experience of multiple data engineers with many years of professional background in the field. You get personal anecdotes and opinions that AI can&rsquo;t yet provide. </span></p>
<h3 class="c1" id="h.wcz9oypn6psy"><span class="c5">Why choose this course over other similar online courses?</span></h3>
<p class="c2"><span class="c3">Other courses on Coursera and other sites exist that fulfill a similar goal to this site - to get one hired in the field of data. However these almost always come from the perspective of a Data </span></p>
<p class="c2"><span class="c3">Scientist or Analyst. This course was designed specifically for Data Engineers and presents the information from that viewpoint. There is also a heavy emphasis on Cloud in this course and the examples provided allow one to build their own sandbox in their own account that can be hacked on later, rather than code snippets run in the browser which aren&rsquo;t useful after the course is completed. </span></p>
<h3 class="c1" id="h.rij9vx9eyyz3"><span class="c5">What&rsquo;s next?</span></h3>
<p class="c2"><span class="c3">This course is designed to present information in a multi layered way. Since folks are coming into it with different technical literacy and subject matter expertise, the course helps you choose to learn only information that is relevant to your current level and goals. To learn how to get the most out of the course, it&#39;s recommended to view the quick guide on how to use it. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The first skill laid out in this course is getting a basic grasp on programming languages, as this skill will be used in other sections of the course. However, you&rsquo;re free to skip it if you already have a strong programming background. It&#39;s important to know that you can skip or revisit any subject in the course, so if a certain topic sounds more interesting, feel free to jump straight to it and pick the other things up later! Get stuck on a certain section? Ask the forums or schedule a mentoring session and jump to a different section in the meantime. Ideally you will eventually go through the entire course, but as you may already have experience in certain topics, don&rsquo;t feel bad about not completing every section. </span></p>
</span><span class="course-bookmark" id="Programming Languages">
<h1 class="c13" id="h.aezu2rhd5ctm"><span class="c16">Programming Languages</span></h1>
<p class="c2"><span class="c3">This section is intended to get the basics down of commonly used programming languages in data engineering, not to become a master in any specific language. This section aims to give you the bare minimum programming skills you&rsquo;ll need to know as a DE. It&#39;s recommended that one pursues further practice in at least one of these languages on the side. Interviews, especially for junior level positions, will often have programming challenges to solve, so one should become versed in SQL and at least one other language listed below. See the future reading at the end of this section to enhance your skills. We&rsquo;ll start with SQL as it&#39;s often the most used language for a DE.</span></p>
</span><span class="course-bookmark" id="SQL">
<h2 class="c9" id="h.9fqoed733ivw"><span class="c11">SQL</span></h2>
<p class="c2"><span class="c3">SQL (Structured Query Language), pronounced Sequel or S-Q-L like the acronym, is perhaps the most useful language for a DE to learn. While SQL is based on set theory, you don&rsquo;t have to be an expert in math to understand how it works. SQL differs from most programming languages in that it is declarative and not imperative, meaning you describe to the computer what you want it to produce instead of telling it directly what to do. In this case, the computer is the DBMS (database management system) or engine. There are different types of database engines as we&rsquo;ll discuss shortly, and how they work under the hood can be very different. To make it more confusing, they even don&rsquo;t always support the same dialect, meaning some keywords won&rsquo;t be available in certain engines but will be in others. No need to fear though, the basics are the same between every one. The basic syntax is simple, however you can do many complex things with it! SQL is a very powerful tool and can accomplish a lot with little coding. Modern engines are generally optimized to make your code run quickly, pushing the hard work to the optimizer and not to you.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Code snippets in SQL are referred to as queries. A query is simply a question that you are asking the database to provide you an answer to, such as how many dollars each product in a location cost, how much increased business a marketing campaign brought in, what is the average age of your customers, etc. For ad hoc queries, ie queries that you only run once or a query that changes, you often run directly in an editor that is designed for data exploration, examples are DBeaver, MySQL workbench, Toad and DataGrip. For more automated queries such as those that are used to power reports and applications, you typically run these in the tool directly such as Tableau or PowerBI, or wrapped in code from another language which can dynamically pass in inputs to the query. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">If you&rsquo;re already a programmer but a non DE, stick around as your knowledge of SQL is probably not as extensive as a DE needs it to be. This is especially true if your background is in application development, you&rsquo;re probably used to using an ORM or similar tool and thinking of the database transactionally. A DE often performs bulk operations that insert or update many records at a time, and analytical queries can be very verbose and complex. Skip down to the appropriate sub section, a good place to start might be the &ldquo;Aggregating&rdquo; section. </span></p>
<h3 class="c1" id="h.7zt1s4kc2bj8"><span class="c5">Database Basics</span></h3>
<p class="c2"><span class="c3">To understand SQL, you must first need to know what a relational database is and how one works. A database, at its most basic concept, is a group of tables which hold information. This information is organized into rows and columns. Rows, also called records, are individual pieces of information. Columns, also called fields, define a specific attribute about that data. For instance if we had a Person table, we might have a column for name, one for age, one for email address, etc. Each row of the table would represent a different person. If you&rsquo;re familiar with Excel, you should already have a basic understanding of rows and columns, a table is just a way to group those together. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Tables are grouped into a structure called a schema. A schema lays the blueprint for tables, and are typically separated by a domain i.e. a marketing schema, product schema, user schema, etc. Multiple databases can be supported by the same engine, so the overall structure looks like this:</span></p>
<p class="c2"><span class="c3">Engine -&gt; Database -&gt; Schema -&gt; Table</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Relational means that tables can form relationships with one another. Later on we&rsquo;ll talk about the concepts of primary and foreign keys, and best practice for designing them. We&rsquo;ll learn how we can join tables together to combine data.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">As mentioned before, there are different database engines. Common ones include Postgres, MySQL and Oracle. These engines are often installed onto a computer or server, and the program will use the resources of the computer such as the hard disk and RAM to store your data and to process queries. Certain database engines are lightweight and can be held on very tiny devices like Raspberry PIs or entirely in memory of a machine, such as SQLite. There are different tradeoffs to each database engine, and you are encouraged to research on your own what the differences are. As a junior DE you will most likely not have to choose one, this is typically done by an Architect or senior engineer. </span></p>
<h3 class="c1" id="h.uvvnjdi971nz"><span class="c5">DDL vs DML</span></h3>
<p class="c2"><span class="c3">Before we go further, it&#39;s important to understand that with SQL you will define structures in your database before you access them. Later in this course we will discuss NoSQL databases, in where these data structures don&rsquo;t have to be defined as clearly as SQL databases, but for now we will be very explicit when we define what our data should look like.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span>DDL (Data definition language) is what we use to create, update and delete the </span><span class="c14">structure</span><span class="c3">&nbsp;of our data ie. tables and columns. Each column must have an associated data type. In the person table example above, we would use SQL DDL like below to create our table and columns:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Create table&hellip;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span>DML (Data manipulation language) is what we use to create, update and delete the </span><span class="c14">actual </span><span class="c3">data ie the rows. So when we want to add a new person to our database, we would use SQL DML like below.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Insert into table main.person</span></p>
<h3 class="c1" id="h.i5iesevtmbq"><span class="c5">Basic Syntax</span></h3>
<p class="c2"><span class="c3">After creating our tables and filling them with data, we are now ready to start querying and exploring the data. This is a very basic query:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select name from main.person;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select: This is the command we use to get back data</span></p>
<p class="c2"><span class="c3">Name: This is the column name we want returned in our query. We can select multiple, or use &ldquo;*&rdquo; to select all columns in the table at once.</span></p>
<p class="c2"><span class="c3">From: This precedes the location of the data</span></p>
<p class="c2"><span class="c3">Main: This is the schema name.</span></p>
<p class="c2"><span class="c3">Person: This is the table name.</span></p>
<p class="c2"><span class="c3">;: All queries end with a semicolon, although most modern database engines don&rsquo;t enforce it.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">This query returns a dataset known as the resultset. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">You can rename columns in your resultset using aliases. These use the &ldquo;as&rdquo; keyword. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select name as first_name from main.person;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">(table)</span></p>
<h3 class="c1" id="h.o03sk37wic7s"><span class="c5">Ordering and Limiting</span></h3>
<p class="c2"><span class="c3">Say we have a huge table and we only want to return a small sample size of the data. We can limit the amount of records we want back using the limit clause. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select name from main.person limit 5;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">This will only return 5 records. Limit is a keyword that not all database engines support, for instance in SQL Server we would use the following instead:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select top 5 name form main.person;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Now say we want to order the data so we only return the oldest people in our persons table. We would use the order by clause like so:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select name from main.person order by age desc limit 5;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Note that limit is always the last clause in our query, order by is second to last. The column we want to sort by comes write after &ldquo;order by&rdquo;. If we want to reverse order like we did above, ie the largest will comes, we use &ldquo;desc&rdquo; for descending. The other option is &ldquo;asc&rdquo;, but you can omit this as ascending is what gets defaulted to. </span></p>
<h3 class="c1" id="h.stdlarsecgtb"><span class="c5">Filters</span></h3>
<p class="c2"><span class="c3">Filters allow us to remove records from the output dataset by using a column in the table. We use the &ldquo;where&rdquo; clause to define them. In this example, our query will return only records where the age of the person is over 50.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select * from main.person where age &gt; 50;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Where: Begins the filter criteria</span></p>
<p class="c2"><span class="c3">Age: The column we&rsquo;re filtering on</span></p>
<p class="c2"><span class="c3">&gt;: The comparison operator.</span></p>
<p class="c2"><span class="c3">50: The number we are filtering. If this was a string/varchar column, we&rsquo;d surround it with single quotes.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Comparison operators are = (equals), &gt; (greater than), &lt; (less than), &gt;= (greater than or equal than), &lt;= (lesser than or equal to).</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The wildcard operator is used to fill in part of the filter criteria, typically with a varchar column. This is useful when we want to fuzzy. For instance if we want to return a person&rsquo;s name that begins with a T we can do it like so:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select * from main.person where name like &lsquo;T%&rsquo;;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Note that we use the &ldquo;like&rdquo; keyword instead of a logic operator when using wildcards.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">You can use the wildcard operator at the beginning, end or both part of a string. This finds names ending in S:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select * from main.person where name like &lsquo;%S&rsquo;;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">And this finds names containing O:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select * from main.person where name like &lsquo;%O%&rsquo;;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Multiple where clauses are possible. We connect multiple filters using boolean logic, using and/or. If you&rsquo;re not familiar with boolean logic, use this chart to get an idea of how it works:</span></p>
<p class="c2"><span style="overflow: hidden; display: inline-block; margin: 0.00px 0.00px; border: 0.00px solid #000000; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px); width: 624.00px; height: 404.00px;"><img alt="" src="images/image28.png" style="width: 624.00px; height: 404.00px; margin-left: 0.00px; margin-top: 0.00px; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px);" title=""></span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">So for example, we want to return people in the person table with names that start with T and are age over 50, we&rsquo;d use this query:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select * from main.person where name like &lsquo;T%&rsquo; and age &gt;50;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Using quotes allows the engine to evaluate where clauses separately. See the following example:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select * from main.person where (name like &lsquo;T%&rsquo; and age &gt; 50) or email_address like &lsquo;%gmail.com&rsquo;;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">This will cause the engine evaluate the stuff inside the parentheses first and whats outside separately. So everything in the parentheses must be true, or anything outside true (since we used an or operator) for the record to show in the resultset. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">We can also look for things in a list. For example, if we want specific names returned, we could use a query like this.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select * from main.person where name in (&lsquo;bob&rsquo;, &lsquo;sally&rsquo;, &lsquo;ken&rsquo;);</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Make sure to use the &ldquo;in&rdquo; keyword after your column name, and surround your list with parentheses. We can use &ldquo;not in&rdquo; if we want to return records with names NOT in the list.</span></p>
<h3 class="c1" id="h.774pssa09v5h"><span class="c5">Joins</span></h3>
<p class="c2"><span class="c3">Easy enough so far right? Joins are where SQL starts getting a little more confusing for a newbie, but what make the language so powerful. They allow us to query multiple tables at once, which are related by a join key. There are different types of joins:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Inner: This the default join, inner joins only records that match on the key. All other records from either table are discarded in the resultset.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Left (outer): Along with inner join this is the other commonly used join. A left join will contain every record from the original table (left side) and only the records that join on the join key </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Right (outer): The flipside of left join, these will return all records from the right side of the join but only the joined records on the left side. These are very rarely used and are a bit redundant, as left join can give us everything we need. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Full (outer): This join returns every record from both tables regardless if they joined or not. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Cross: Also known as a cartesian join, a cross join will join each record of the left table with each record from the right, creating what&#39;s known as a cartesian product. This can be very costly to large tables and is only meant to be used with small tables. Common use cases for these are generating test data and for creating dates. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Self: Not a true join as there is no keyword for it, this is still a useful join. To self join, you do an inner join between the same table. This is used when finding duplicates in a table, or finding common patterns in a table. </span></p>
<p class="c2"><span class="c3">&nbsp;</span></p>
<p class="c2"><span class="c3">Here&rsquo;s how to visualize the different join types, I recommend printing it out and taping it somewhere on your desk!</span></p>
<p class="c2"><span style="overflow: hidden; display: inline-block; margin: 0.00px 0.00px; border: 0.00px solid #000000; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px); width: 253.00px; height: 199.00px;"><img alt="" src="images/image31.png" style="width: 253.00px; height: 199.00px; margin-left: 0.00px; margin-top: 0.00px; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px);" title=""></span></p>
<p class="c2"><span class="c3">Todo: create examples</span></p>
<h3 class="c1" id="h.v6q671dww0z9"><span class="c5">Keys</span></h3>
<p class="c2"><span class="c3">Keys are what connect tables to other tables, and what we use for joins. There are the main types of keys:</span></p>
<p class="c2"><span class="c3">Primary - Used to define what makes a record unique in a table, usually an id. </span></p>
<p class="c2"><span class="c3">Foreign - These &ldquo;point&rdquo; to the primary key of a foreign table. </span></p>
<p class="c2"><span class="c3">Composite - A primary key on multiple columns. If you don&rsquo;t have an id field and you don&rsquo;t have a single column that&rsquo;s unique, you can use a composite key to combine multiple columns that will define a unique record. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Most database engines will enforce primary keys, meaning it&rsquo;ll make sure new records will have a unique one before inserting, however some like Redshift don&rsquo;t. If you don&rsquo;t know what to use for a primary key, you can create an id column and tell the engine to auto increment it when inserting a new record, for instance the first record will be automatically populated with &ldquo;1&rdquo; for the id and the second record will be &ldquo;2&rdquo;. You can also create a hash (to simplify, a randomly generated string) as the primary key. There are pros and cons of both approaches. Both are regarded as &ldquo;surrogate&rdquo; keys as opposed to a natural key which would be a unique identifier already in your data. This link is useful if you want to dig in a little deeper: https://web.archive.org/web/20140618031501/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">While many database designs will use PK (primary keys) and FK (foreign keys) to join tables, you are not forced to do so. You can use any column to join on.</span></p>
<h3 class="c1" id="h.efq0zy55zafp"><span class="c5">Aggregating</span></h3>
<p class="c2"><span class="c3">Aggegates are a way to combine smaller pieces of data based on a common grouping and usually perform a calculation on top. Say we want find the number of people in our persons table per each state they live in, we can use the aggregate function known as count:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select state, count(*) from main.person group by state;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">count(*): the aggregate function, in this case count. Functions take in inputs, in this case it&#39;s columns, the syntax is parentheses surrounding the input. We use the * here because we&rsquo;re not concerned with a specific column. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Group by: precedes the column or columns you want to aggregate</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Common aggregate functions are count, sum, max, min, and avg. There are additional lesser used ones, but for the most part you&rsquo;ll want one of these. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">If you use a column in the group by, it must appear in the select. SQL will throw an error if you omit it. You can use any number of columns in the group by.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The having clause is used when we want to filter using an aggregate function. These must have a group by, and also must appear in the select statement. The below example finds states having an average age greater than 25:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select state, avg(age) from main.person</span></p>
<p class="c2"><span class="c3">Group by state</span></p>
<p class="c2"><span class="c3">Having avg(age) &gt; 25;</span></p>
<h3 class="c1" id="h.hgu7jcg9qsrt"><span class="c5">Window Functions</span></h3>
<p class="c2"><span class="c3">Windows Functions are analytical functions that operate on a subset of rows within a result set or partition, all in the same query. They&rsquo;re kind of like other aggregations like group bys, but they return a value for each row in the result set instead of just a single value for the entire group. You define how you want the data to be partitioned (not to be confused with the distributed computing concept! You can think of these as groups) and how it should be ordered. Then you define a specific function you want to run against that group such as a sum, count, or raking function. Here&rsquo;s a very basic example:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">SELECT</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; customer_id,</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; order_date,</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; total_amount,</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS cumulative_amount</span></p>
<p class="c2"><span class="c3">FROM</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; orders</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Window functions can be hard to grasp for a beginner, and it&#39;s recommended that you play around a lot with these to fully comprehend their power. </span></p>
<p class="c0"><span class="c3"></span></p>
<h4 class="c7" id="h.kyyh78dcs51s"><span class="c6">Ranking Functions</span></h4>
<p class="c2"><span class="c3">row_number(), rank() and dense_rank() are the three ranking functions. They&rsquo;re a way to provide a row number or rank within a partition. They&rsquo;re useful for things such as finding what products are performing best and also for &ldquo;iterating&rdquo; in recursive CTEs by linking values together. The difference between the three are row_number provides a unique row id for each value (this can be non deterministic when there are ties), rank handles ties by making them the same value, and dense_rank handles ties but continues onto the next row number. </span></p>
<h4 class="c7" id="h.r2imizn9n0ze"><span class="c6">Aggregate Functions</span></h4>
<p class="c2"><span class="c3">sum(), avg(), min(), max() and count() are all examples of aggregate functions and act exactly as their group by counter partners do, but are contained with an individual partition instead of the entire group.</span></p>
<h4 class="c7" id="h.xsr38168xz7i"><span class="c6">Lead and Lag Functions</span></h4>
<p class="c2"><span class="c3">These are basically a way of taking ranking functions and automatically connecting the values between each other. lead() is used to find the value following the current one, and lag() finds the preceding one. </span></p>
<h4 class="c7" id="h.p46tqqo9isq6"><span class="c6">Windowing Functions</span></h4>
<p class="c2"><span class="c3">first_value() and last_value() return the value of the specified expression from the first and last row of a partition, respectively. These aren&rsquo;t very commonly used. </span></p>
<h4 class="c7" id="h.2ec5f6iwbu3o"><span class="c6">Analytical Functions</span></h4>
<p class="c2"><span class="c3">ntile() can further partition a partition into more groupings and assigns a number to them. percent_rank() returns the rank relative to the other rows. cume_dist() is used to determine the cumulative distribution as a percentage. These can be useful for things such as finding highs and lows of a weather pattern.</span></p>
<h3 class="c1" id="h.2xz6q5g1ip0i"><span class="c5">Views</span></h3>
<p class="c2"><span class="c3">Views can be thought of as tables that exist temporarily. There are concepts of temp tables, but these exist only per connection/session to the database and do not persist once that connection is closed. Views are mainly used to abstract query logic and to limit permissions from entire tables. This is a very simple example of one:</span></p>
<p class="c2"><span class="c3">Create view main.people_over_50 </span></p>
<p class="c2"><span class="c3">As select * from main.person where age &gt; 50;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The benefit of this is a user could just query this view instead of having to reproduce the logic in the query. Of course this is a super simple and non realistic example, in practice this would probably encompass much complex logic.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">If we want to limit permissions to users to specific columns, we could do so with a view like this:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Create view main.email</span></p>
<p class="c2"><span class="c3">As select email from main.person </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">This would create a view where only the email of the person table is exposed, and one could grant certain users to access just this view and not the base table. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">It&#39;s important to note that views are recalculated every time they&rsquo;re queried. For complex views, this may take awhile.</span></p>
<h4 class="c7" id="h.awekbc4vgjhn"><span class="c6">Materialized Views</span></h4>
<h3 class="c1" id="h.b0ul372okw1n"><span class="c5">Set Operators</span></h3>
<p class="c2"><span class="c3">We can combine result sets of multiple queries in a single query using set operators. Here are the four types:</span></p>
<p class="c2"><span class="c3">Union: Combines all records from both resultsets, minus duplicates.</span></p>
<p class="c2"><span class="c3">Union all: Combines all records from both resultsets, including duplicates.</span></p>
<p class="c2"><span class="c3">Intersect: Only returns duplicate records.</span></p>
<p class="c2"><span class="c3">Except: Only returns records that exist in the first resultset but don&rsquo;t exist in the second one. </span></p>
<p class="c2"><span style="overflow: hidden; display: inline-block; margin: 0.00px 0.00px; border: 0.00px solid #000000; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px); width: 624.00px; height: 440.00px;"><img alt="" src="images/image18.png" style="width: 624.00px; height: 440.00px; margin-left: 0.00px; margin-top: 0.00px; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px);" title=""></span></p>
<p class="c2"><span class="c3">An important thing to note is that both queries must have the same columns between the two result sets. You may alias columns to make the names line up, but the same number of columns must always appear between the two.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select name, age from main.person</span></p>
<p class="c2"><span class="c3">Where email like &lsquo;%gmail%&rsquo;</span></p>
<p class="c2"><span class="c3">UNION</span></p>
<p class="c2"><span class="c3">Select name, age from main.person</span></p>
<p class="c2"><span class="c3">Where email like &lsquo;%hotmail%&rsquo;;</span></p>
<h3 class="c1" id="h.9de0kvtbirt9"><span class="c5">UDFs, Stored Procedures and Triggers</span></h3>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">UDFs (User Defined Functions) are ways one can create custom functionality in one database. These can be useful for storing business logic, encapsulating common patterns and making code more reusable. Built-in functions that are callable right out of the box are similar, UDFs take it a step further to be customizable. Each RDMS has its own implementation of UDFs, so the syntax will vary between them. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Stored Procedures are similar to UDFs, the difference being SPs (stored procedures) don&rsquo;t have to return a value whereas UDFs do. Generally SPs are used to change that state of your database, for instance applying an insert or creating an object, while UDFs are stateless meaning they do not manipulate data or objects in a database.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Triggers are pieces of code that get activated when a specific event happens, and do a specific task. Some triggers come with the RDBMs such as referential integrity checks (i.e. things like making sure primary keys are enforced). One can also create their own trigger, typically these will activate a Stored Procedure to perform an action. </span></p>
<h3 class="c1" id="h.l8p9wk2e27ro"><span class="c5">CTEs</span></h3>
<p class="c2"><span class="c3">CTEs are a </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">With orig_query as (</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select * from main.person</span></p>
<p class="c2"><span class="c3">Where age &gt; 50</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">)</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select * from orig_query;</span></p>
<h4 class="c7" id="h.5m4z758fr1ck"><span class="c6">Recursive CTEs</span></h4>
<h3 class="c1" id="h.2xr64w63w9cj"><span class="c5">Bringing it Altogether</span></h3>
<p class="c2"><span class="c3">Let&#39;s create a query together that brings in all the concepts we&rsquo;ve talked about before. We already have the building blocks to start building and understanding very complex queries! </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">With orig_query as (</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select row_number () over (partition by name order by age) as row_num, name from main.person</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">),</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Filtered_query as (Select p.* from orig_query oq</span></p>
<p class="c2"><span class="c3">Left join main.person p on og.name = og.name</span></p>
<p class="c2"><span class="c3">Where oq.row_num = 1)</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Select count(1), name from filtered_query</span></p>
<p class="c2"><span class="c3">Group by name</span></p>
<p class="c2"><span class="c3">Ordery by name desc;</span></p>
<h3 class="c1" id="h.eadm79d5xkiv"><span class="c5">Transactions</span></h3>
<p class="c2"><span class="c3">A transaction is a way to group a sequence of separate database operations that are done all at once. If there is an error, the change can be reverted, undoing the entire sequence. This is a very important and powerful concept of a DBMS as it allows the programmer to guarantee everything goes through, or nothing goes through. Transactions must follow the ACID principle. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">ACID is an acronym representing the 4 key properties of a transaction:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">A - Atomic: Each transaction is a single &ldquo;unit&rdquo; aka its own atom. Every change is committed at once or the database remains unchanged.</span></p>
<p class="c2"><span class="c3">C - Consistent: Transactions must follow the rules of the database such as key constraints before going through.</span></p>
<p class="c2"><span class="c3">I - Isolation: Since multiple transactions can happen at once, a single transaction must be isolated so that the database would look the same if it didn&rsquo;t go through. </span></p>
<p class="c2"><span class="c3">D - Durability: Once a transaction goes through i.e. committed, the database will reflect that even upon a database outage, like if someone trips on the power cord of the database server!</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Transactions start with the begin statement and end with end. Which should be easy to remember:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">BEGIN;</span></p>
<p class="c2"><span class="c3">&ndash;do sql stuff!</span></p>
<p class="c2"><span class="c3">END;</span></p>
<p class="c2"><span class="c3">&ndash;transaction is complete</span></p>
<h3 class="c1" id="h.qm0x2jp08dgt"><span class="c5">OLTP vs OLAP</span></h3>
<p class="c2"><span class="c3">Database engines are usually designed to support either transactional (OLTP) or analytical (OLAP) workloads. Some can be configured to do either one, but for the most part a specific engine is usually good for one or the other. A transactional workload centers around many writes happening at a time, typically one record at a time. Some databases are able to scale up to millions (or more) of these writes per second. Writes are first class citizens and reads are second class citizens. An analytical workload centers around processing complex read queries being run against the database. Oftentimes an OLAP DB will be columnar store, meaning the engine will store data based on the column and not record, in order to speed up read queries. Writes are typically done in large batches. Reads are first class citizens and writes are second in OLAP. As DEs, we typically only use OLAP engines, however there are times where we must read or replicate OLTP engines, so it&#39;s important to know both. </span></p>
</span><span class="course-bookmark" id="Python">
<h2 class="c9" id="h.3oo1y6nv1isf"><span class="c11">Python</span></h2>
<p class="c2"><span class="c3">Python is known as the day-to-day language and is often chosen by beginners to learn programming. It has become very popular in the field of Data Engineering/Analysis/Science as libraries such as Pandas, NumPy and SciPy have gained mass support. It&#39;s also used in other software development fields such as web development, where frameworks like Django and Flask are commonly used. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Python supports multiple programming paradigms like imperative, object oriented and functional. This gives the programmer multiple approaches to problems, and the ability to combine multiple paradigms in the same codebase. Unlike Java and similar languages, python is dynamically typed meaning you don&rsquo;t have to specify the types of variables as those are figured out by the interpreter during run time. This saves the programmer time, but the tradeoff is that variables aren&rsquo;t as explicit. Also unlike Javascript but similar to Java, Python is strongly typed meaning a variable&rsquo;s type can&rsquo;t change over time, which can create run time bugs if a variable is used incorrectly. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The ease of use and library support of Python make it an ideal language for Data Engineering. Also popular Big Data and Machine Learning frameworks such as Apache Spark and TensorFlow provide an API for it. </span></p>
<h3 class="c1" id="h.axeeriik36ob"><span class="c5">Syntax Basics</span></h3>
<p class="c2"><span class="c3">Python is pretty close to English in terms of syntax. Unlike languages like Java, Python doesn&rsquo;t force you to surround your statements in brackets or parentheses but instead infers this based on new lines and whitespace. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span>When you have a syntax error, meaning your code isn&rsquo;t up to par with what the interpreter is expecting, the interpreter will tell you about it during its parsing stage, usually telling you the exact line number where it encountered the invalid syntax. There are also common patterns your code should follow to make it easier to read by others. This is known as a style guide, and Python&rsquo;s is called PEP. You can read more about it here: </span><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://peps.python.org/pep-0008/&amp;sa=D&amp;source=editors&amp;ust=1690745729731117&amp;usg=AOvVaw3jjp11O44f6l1EB8NRByrC">https://peps.python.org/pep-0008/</a></span><span class="c3">. </span></p>
<h4 class="c7" id="h.2q69fjtm2atz"><span class="c6">Comments</span></h4>
<p class="c2"><span class="c3">Comments are what a programmer writes to describe how the program works or give additional info, they are not executed with the rest of the code but are completely separate entities. One can write whatever they see fit, and its they are often used as a way to describe how the code works. Here&rsquo;s an example.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Foo = &ldquo;bar&rdquo; # I am a comment! I can write whatever I want here if I don&rsquo;t escape the comment. &nbsp;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Comments in python start with a &ldquo;#&rdquo; character. Anything written after the # will be part of the comment, up until you start a new line. Some languages enforce comments being on separate lines, but in Python you can include them within the same line of actual code like you see above. You can also create a multi line comment by using three quotes (single or double), and enclosing the comment with a second set of three quotes.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">&ldquo;&rdquo;&rdquo;</span></p>
<p class="c2"><span class="c3">I am a multiline comment!</span></p>
<p class="c2"><span class="c3">I will comment until the next set of quotes. </span></p>
<p class="c2"><span class="c3">Nothing in here will be executed by the interpreter. </span></p>
<p class="c2"><span class="c3">Fin.</span></p>
<p class="c2"><span class="c3">&ldquo;&rdquo;&rdquo;&rdquo;</span></p>
<h4 class="c7" id="h.j7iu53h30kap"><span class="c6">Scoping</span></h4>
<p class="c2"><span class="c3">Most modern programming languages, including Python, use static or dynamic scoping. Essentially it means you can only use variables only in the same block of code they are created, including blocks of code that are inside that same block. If there are variables that are named the same thing, the language will use the variable that was defined in the lowest level block. There is also a concept of dynamic scoping where the compiler tries to find other references to a variable, but we won&rsquo;t get into that. Here&rsquo;s a concept of how scoping works in python. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Var1 = &ldquo;a&rdquo;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Def foo():</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; Var1 = &ldquo;b&rdquo; # var1 will use the value b and not a</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; Var2 = &ldquo;x&rdquo;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Def bar():</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp;print(var2) # this will error out because var2 is outside the scope of bar</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">As you can see, there is a hierarchy where variables can be referenced from. In Python, you can use the LEGB rule to determine this hierarchy. Letters earlier in the hierarchy will be what gets used. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">1. Local: Variables declared inside a function</span></p>
<p class="c2"><span class="c3">2. Enclosing Function: When you have a function inside a function, this is the outer function.</span></p>
<p class="c2"><span class="c3">3. Global: Variables declared outside of a function</span></p>
<p class="c2"><span class="c3">4. Built-In: Python pre-assigns its own functions and variables, these are last in the order. </span></p>
<h3 class="c1" id="h.yag5grld0sik"><span class="c5">Variables</span></h3>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">A variable is simply a reference point for a stored value, it is how the language provides access to manipulate a certain piece of data. Here is an example of how to declare and use a variable: </span></p>
<p class="c2"><span class="c3">Age = 50</span></p>
<p class="c2"><span class="c3">print(&ldquo;I am this many years old:&rdquo; + str(age))</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">In python variables are mutable, meaning they can change over time. Here is an example of a variable changing its state:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Age = age + 1</span></p>
<p class="c2"><span class="c3">print(&ldquo;I am now:&rdquo; + str(age))</span></p>
<h4 class="c7" id="h.6uup7eqpqsss"><span class="c6">Variable Types</span></h4>
<p class="c2"><span class="c3">Python is able to infer the type of a variable depending on what gets assigned to it. In our previous example python would know that the variable as an integer type. The five standard data types are as follow:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Numbers - there are four numerical types in python: integer, long float and complex</span></p>
<p class="c2"><span class="c3">Int = 10 # do no have a decimal point!</span></p>
<p class="c2"><span class="c3">Long = 0x82B1 # these can be octal or hexadecimal</span></p>
<p class="c2"><span class="c3">Float = -25.689 # these represent decimals and can even including scientific notation</span></p>
<p class="c2"><span class="c3">Complex = 1.35j # represent complex/imaginary numbers and can also use scientific notation</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">String - think of these as text, a set of characters.</span></p>
<p class="c2"><span class="c3">Str = &ldquo;Hello World, I&rsquo;m learning Python!&rdquo;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">List - a sequence of variables, lists can contain different types of variables in the same list, including other lists! The syntax is square brackets around the elements, and commas to separate the elements inside. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Mylist = [&lsquo;a&rsquo;, 50, [&lsquo;b&rsquo;, 60]]</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Tuple - An immutable list, meaning you can&rsquo;t change values in the list without creating a new one. The syntax is to use parentheses around the elements. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Mytuple = (&lsquo;a&rsquo;, &lsquo;b&rsquo;, [&lsquo;b&rsquo;, 60])</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Dictionary - A way to map keys with values. These are similar to &ldquo;maps&rdquo; in other languages</span></p>
<p class="c2"><span class="c3">Mydict = {</span></p>
<p class="c2"><span class="c3">&ldquo;A&rdquo; : &ldquo;foo&rdquo;,</span></p>
<p class="c2"><span class="c3">&ldquo;B&rdquo;: &ldquo;bar&rdquo;</span></p>
<p class="c2"><span class="c3">}</span></p>
<h4 class="c7" id="h.5m1s05a1bknw"><span class="c6">Changing Types</span></h4>
<p class="c2"><span class="c3">Variables can only change values, but we can change its type by creating a new variable. This is known as converting or casting. To convert a number, you call the associated function of the desired type. For instance if we wanted to convert a int, we&rsquo;d do so like this:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Age = 50</span></p>
<p class="c2"><span class="c3">String_age = str(age) </span></p>
<h4 class="c7" id="h.61fg3hu3g52q"><span class="c6">Data Structures</span></h4>
<p class="c2"><span class="c3">Data structures are entities that tell a computer how to best organize and store data. Since efficiency is very important in computer science, using the right data structure for the task at hand is critical to compose the optimal program. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span>Lists, tuples and dictionaries are the key data structures in Python, they&rsquo;re designed to be very flexible and accomplish many things. Python doesn&rsquo;t implement all normal data structures like stacks, queues and sets right out of the box like a language like Java does, one can create them by using lists, tuples and dictionaries. See the link here for how this is done: </span><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://docs.python.org/3/tutorial/datastructures.html&amp;sa=D&amp;source=editors&amp;ust=1690745729735840&amp;usg=AOvVaw2pQt_NFWRG4gleGVVYBc7y">https://docs.python.org/3/tutorial/datastructures.html</a></span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Other important data structures to know about are:</span></p>
<p class="c0"><span class="c3"></span></p>
<ul class="c12 lst-kix_7jovr7v3kfl6-0 start">
   <li class="c2 c4 li-bullet-0"><span class="c3">Graph: a data structure where nodes are connected to each other using edges, representing relationships between each other. These can have a certain flowing direction called a DAG (Directed Acyclical Graph) or they can flow backwards. Edges can also have a weight, otherwise called an edge, which is a numerical value representing things such as length of a route, energy to move there, etc. </span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">Tree: a type of graph which represents a hierarchy, stemming from a root node and diverging into one or multiple nodes per level, depending on what type of specific tree structure is defined. These have a ton of use cases such as file systems, sorting and searching data, graphics processing and machine learning.</span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">Linked List: A data structure composed of nodes, similar to a normal array or list, where the nodes/elements of the list are aware of each other. In a normal LinkedList, nodes are only aware of the next node, and in Doubly LinkedList nodes are also aware of the previous node in the sequence. </span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">Hash Table: Also known as a hash map, this is a data structure that provides very speedy operations such as insertions, deletions and retrievals. It accomplishes this by hashing the key of a key value pair. </span></li>
</ul>
<p class="c0 c25"><span class="c3"></span></p>
<h3 class="c1" id="h.oybugvvmkphz"><span class="c5">Logic Flow</span></h3>
<p class="c2"><span class="c3">Logic Flow is the art of controlling how inputs flow through a system and using a series of actions based on decisions to arrive at an output. In python the basic building blocks for logic flow are if statements. Here&rsquo;s what they look like:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">If expression:</span></p>
<p class="c2"><span class="c3">&nbsp; statement</span></p>
<p class="c2"><span class="c3">Elif expression:</span></p>
<p class="c2"><span class="c3">&nbsp;statement</span></p>
<p class="c2"><span class="c3">Else:</span></p>
<p class="c2"><span class="c3">&nbsp;Statement</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The elif and else are optional. You can also chain multiple elif statements in the same block if needed. If statements can be nested in other statements, creating a chain of decisions that the program can make. Here&rsquo;s what a nested if looks like:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">If expression:</span></p>
<p class="c2"><span class="c3">&nbsp; If other expression:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; Other Statement</span></p>
<p class="c2"><span class="c3">&nbsp; Else:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; Other Statement</span></p>
<p class="c2"><span class="c3">Else:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp;Statement</span></p>
<h4 class="c7" id="h.wmo4cr7a272s"><span class="c6">Comparison Operators</span></h4>
<p class="c2"><span class="c3">Like SQL, Python uses comparison operators as well as and/or statements to act as the decisions in our control flow. See the SQL section around filters to get a refresher. The only difference is SQL uses a single = symbol for equals, since Python uses = as the assignment operator, the comparison operator becomes == like so:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">If age == 50:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; print(&ldquo;I&rsquo;m 50 years old!&rdquo;)</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">You can also look up values in lists and tuples using the in and not in operators:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">If &lsquo;a&rsquo; in [&lsquo;x&rsquo;, &lsquo;y&rsquo;, &lsquo;z&rsquo;]:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; print(&lsquo;yep!&rsquo;) &nbsp;</span></p>
<h4 class="c7" id="h.l1sozt33tshi"><span class="c6">Guard Clauses</span></h4>
<p class="c2"><span class="c3">An if statement without an else is known as a guard clause. These can simplify your logic flow as often the else is redundant. They also can prevent an anti-pattern (something to avoid) called a pyramid of doom where you have a really messy and over complicated set of nested if statements. Once we learn about functions, we can start using guard clauses to return the function when a certain criteria is met. Here&rsquo;s an example of changing our code to use a guard clause to simplify.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3"># original</span></p>
<p class="c2"><span class="c3">If num &lt;&gt; 5:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; print(&ldquo;its 5 yo&rdquo;)</span></p>
<p class="c2"><span class="c3">Elif num == 5:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; print(&ldquo;its not 5 yo&rdquo;)</span></p>
<p class="c2"><span class="c3">Else:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; print(&ldquo;how did i even get here?&rdquo;)</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3"># using guard clause</span></p>
<p class="c2"><span class="c3">If num == 5:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; print(&ldquo;I got what I wanted, peace!&rdquo;)</span></p>
<h4 class="c7" id="h.58czexhohbzu"><span class="c6">Truthiness</span></h4>
<p class="c2"><span class="c3">Since we can compare values of different data types. Expressions that evaluate False are called Falsy and True ones are called Truthy. Empty values and None (null) values including empty lists and similar are Falsy. Also numbers that represent 0 are Falsy. Most other values will be Truthy. You can use the bool() function to determine truthiness. </span></p>
<h4 class="c7" id="h.emde0gshb46y"><span class="c6">Switch Statements</span></h4>
<p class="c2"><span class="c3">Switch statements have been recently added to Python and are a short hand way to represent multiple if/elif cases. Here&rsquo;s what they look like:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Match expression:</span></p>
<p class="c2"><span class="c3">&nbsp; Case pattern-1:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; Statement-1</span></p>
<p class="c2"><span class="c3">&nbsp; Case pattern-2:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; Statement-2</span></p>
<p class="c2"><span class="c3">&nbsp; Case _:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; Default statement # aka our else block</span></p>
<h3 class="c1" id="h.vzz4j1ltyvzd"><span class="c5">Iterating</span></h3>
<p class="c2"><span class="c3">With logic flow under your belt, you&rsquo;re already becoming a powerful programmer and can start writing code that actually does stuff! The other tool you&rsquo;ll need to start writing useful code is iterating, in Python the main way to do this is with for loops. There are also while loops but we&rsquo;ll ignore those for now as for loops are far more common and useful. Here&rsquo;s a basic for loop:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">My_list = [&lsquo;a&rsquo;, &lsquo;b&rsquo;, 5]</span></p>
<p class="c2"><span class="c3">For i in my_list:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp;print(str(i))</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Lists and tuples are common data structures that we iterate over. What this means is we take every item in the list and apply whatever action or multiple actions is listed inside the for block. </span></p>
<h4 class="c7" id="h.e77risgs75e5"><span class="c6">Ranges</span></h4>
<p class="c2"><span class="c3">Ranges allow the programmer to quickly create a range of values and are often used with iterating. Here&rsquo;s an example of a &ldquo;clock&rdquo; which sends an alert every 5 seconds. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">For x in range(1..120):</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; If </span></p>
<h4 class="c7" id="h.27sggcx51276"><span class="c6">Iterating Over Dictionaries</span></h4>
<p class="c2"><span class="c3">Lists are the most common data structure to iterate over in Python, but it&#39;s also possible to use dictionaries. You can access both the key and the value at the same time like so:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">for key, value in my_dict.items():</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; print(key, value)</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Keep in mind that Dictionaries are unordered by default in Python, if you want to guarantee the dictionary is sorted before you use it, use the data structure OrderedDict which is in the native library collections. OrderedDicts can mutate (change state) during use though. </span></p>
<h4 class="c7" id="h.e751tvrqmmqk"><span class="c6">Lambdas and List Comprehensions</span></h4>
<p class="c2"><span class="c3">These are both more functional approaches to problems. In practice they&rsquo;re syntactic sugar, you can do what they do with simple for statements. To keep things simple, this course will gloss over them but they are definitely worth revisiting in the future as you&rsquo;ll see some programmers use them frequently. </span></p>
<h3 class="c1" id="h.xpaihba18u9s"><span class="c5">Functions</span></h3>
<p class="c2"><span class="c3">Functions encapsulate certain repeatable functionality in our program. They give the ability for the programmer to define a set of inputs (called arguments), do some type of action or multiple actions, and then return an output or multiple outputs (this is optional but encouraged as a best practice). Here is what the syntax looks like for them:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Def myfunc(input1, input2):</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; If (input1 &gt; input2):</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; return &ldquo;yes!&rdquo;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">After a function gets defined, they&rsquo;re used by calling them, ie . You can set their returned value directly into a new variable and are encouraged to do so. As mentioned, functions should always have a return value, if you can&rsquo;t think of one to use, have it simply return True.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Input_a = 5</span></p>
<p class="c2"><span class="c3">Input_b = 3</span></p>
<p class="c2"><span class="c3">Answer = myfunc(input_a, input_b)</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span>Sometimes inputs to a function can be optional in Python. This has powerful implications as it allows one to dynamically pass in a list or dictionary of values, known as args and </span><span>kwargs</span><span class="c3">&nbsp;respectively. We won&rsquo;t go too far into these in the fundamentals course, but if you see an argument of a function with a * or ** in front of it, you&rsquo;ll know that&rsquo;s what happening. </span></p>
<h3 class="c1" id="h.uya15oitixmm"><span class="c5">Classes</span></h3>
<p class="c2"><span class="c3">Classes are the basis for all OOP (Object Oriented Programming) languages, Python included. Classes define what an object is and does which includes variables and functions (called method when inside a class), think of it as a template. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Class MyFirstClass:</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; </span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; def __init__(self):</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; self.data = []</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; Def method(self):</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; print(&ldquo;something!&rdquo;)</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">my_first_object = MyFirstClass() # this is how we create an object from a class!</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Notice that creating a class looks the same as calling a function, a class is basically just a function! This empowers Functional Programming in Python. Also note that the __init__ method is what holds the state of the object and what is created first. Other methods can update the state of the object. The &ldquo;self&rdquo; keyword references the state of an object, and is required when describing state in classes. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">OOP is a powerful paradigm that can make code more concise and extensible. Here are the core tenants of OOP:</span></p>
<p class="c0"><span class="c3"></span></p>
<ol class="c12 lst-kix_d6ezbczafj8g-0 start" start="1">
   <li class="c21 c4 li-bullet-0"><span class="c3">Encapsulation: data is bundled to the scope of the class, state is handled within the individual object and unexposed to the outside world.</span></li>
   <li class="c21 c4 li-bullet-0"><span class="c3">Inheritance: classes/objects can be created in a hierarchical fashion where child classes/objects inherit properties from their parents.</span></li>
   <li class="c21 c4 li-bullet-0"><span class="c3">Polymorphism: similar to inheritance, polymorphism means classes can inherit attributes and data from other classes, but might not necessarily have a parent/child relationship, they&rsquo;re more for describing a blueprint. Some languages like Java have interfaces which is how it implements polymorphism. </span></li>
   <li class="c4 c21 li-bullet-0"><span class="c3">Abstraction: objects not only hide data from the outside world, but also functionality.</span></li>
</ol>
<h3 class="c1" id="h.v6j2ztf1q6i"><span class="c5">Logging and Error Handling</span></h3>
<h3 class="c1" id="h.op6v7cuwvoog"><span class="c5">Installing and Importing Libraries</span></h3>
<p class="c2"><span class="c3">Creating your own libraries </span></p>
<h3 class="c1" id="h.ygpu1cchqkwb"><span class="c5">Pandas</span></h3>
<p class="c2"><span class="c3">Pandas is a library that provides support for manipulating and analyzing data, and is important to learn as a DE as it is commonly used in analysis and in conjunction with other data tools.</span></p>
</span><span class="course-bookmark" id="Java">
<h2 class="c9" id="h.sx1mbu46r6ks"><span class="c11">Java</span></h2>
<p class="c2"><span class="c3">Java has a reputation for being verbose and being the boring &ldquo;corporate&rdquo; language. However Java has shown through the years that it is very reliable, and the sheer amount of libraries and support Java has is second to none. Additionally, it gets regular releases that bring in changes to address complaints and pain points that its users face.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Since we learned the basics of programming in the Python section, this section will revolve more around the differences between Java and Python and when to use one over the other. In this course the examples will mostly be in Python and SQL, so feel free to jump past this topic for now if you want to get to the other DE subjects. &nbsp;</span></p>
<h3 class="c1" id="h.ipv2444nu4mw"><span class="c5">Object Oriented Programming</span></h3>
<p class="c2"><span class="c3">Python can implement Object Oriented Programming styles since it has support for classes and objects, however Java was created with OOP being the first choice of solving problems. In fact Java goes as far as to force one to use it, every piece of executed code must be done in the context of a class. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">class HelloWorld {</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; public static void main(String[] args) {</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; System.out.println(&quot;Hello, World!&quot;); </span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; }</span></p>
<p class="c2"><span class="c3">}</span></p>
<p class="c2"><span class="c3">(Notice how even just printing &ldquo;hello world&rdquo; requires writing a class to perform)</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The four main principles of OOP are abstraction, encapsulation, polymorphism and inheritance.</span></p>
<h4 class="c7" id="h.1l9wti2vmd3d"><span class="c6">Abstraction</span></h4>
<p class="c2"><span class="c3">The first principle of OOP is abstraction which makes writing complex logic much simpler by hiding said logic, and only exposing their input and output. Java uses Abstract classes and Interfaces to achieve this. Abstract classes are classes that can&rsquo;t be instantiated (ie can&rsquo;t create objects) but inherited from. We&rsquo;ll learn more about inheritance in a bit. Interfaces are very similar, they force a class to use certain methods/functions by using the &ldquo;implement&rdquo; keyword, but don&rsquo;t actually do anything on their own. </span></p>
<h4 class="c7" id="h.sf9y408bqx69"><span class="c6">Encapsulation</span></h4>
<p class="c2"><span class="c3">Encapsulation describes how data and functions can be accessed in other points of the program. Java, like Python, implements this through classes, which we went into detail in the Python section. Objects don&rsquo;t expose their state (data) without having to use a Getter or Setter method. Modifiers (public, protected, private, default) describe where in the code base that other objects can interact with them. </span></p>
<h4 class="c7" id="h.6dhuo864ivji"><span class="c6">Polymorphism</span></h4>
<p class="c2"><span class="c3">Polymorphism is a feature that allows objects to be treated as other objects, as long as they share a parent class or interface. Java implements this principle in two ways, method overloading and overriding. Overloading is when there are multiple methods to the same class but they take in different parameters and can output different things from each other. Overriding is implemented with the @Override decorator and provides a different implementation of an inherited method from a parent class. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">class Calculator {</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; int add(int a, int b) {</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; return a + b;</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; }</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; double add(double a, double b) {</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; return a + b;</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; }</span></p>
<p class="c2"><span class="c3">}</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">public class Main {</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; public static void main(String[] args) {</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; Calculator calc = new Calculator();</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; int result1 = calc.add(5, 10); &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Calls the int add(int a, int b) method</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; double result2 = calc.add(2.5, 3.7); &nbsp; &nbsp; &nbsp; &nbsp;// Calls the double add(double a, double b) method</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; System.out.println(&quot;Result 1: &quot; + result1); // Output: Result 1: 15</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; &nbsp; &nbsp; System.out.println(&quot;Result 2: &quot; + result2); // Output: Result 2: 6.2</span></p>
<p class="c2"><span class="c3">&nbsp; &nbsp; }</span></p>
<p class="c2"><span class="c3">(Example of method overloading, the Calculator class has two methods both named add, but their return types are different therefore one can get different output based on the data type that gets input into it.)</span></p>
<h4 class="c7" id="h.ui9b0fe39keh"><span class="c6">Inheritance</span></h4>
<p class="c2"><span class="c3">The final OOP principle, inheritance, is perhaps the most important and powerful. Inheritance allows classes to have a hierarchy, allowing a programmer to structure their program around rules in this hierarchy. For instance, a &ldquo;car&rdquo; class might be a subclass to the parent &ldquo;vehicle&rdquo;. The parent class &ldquo;vehicle&rdquo; tells the &ldquo;car&rdquo; class it must define how many wheels it has and a &ldquo;move&rdquo; method to make it implement logic around how it moves on a road. </span></p>
<h4 class="c7" id="h.clrub6q806kr"><span class="c6">Gang of Four Design Patterns</span></h4>
<p class="c2"><span>With the four main principles listed above at our hands, we can start creating complex but useful design patterns based on them (the Gang of Four refers to the four publishers of the original book, not the four principles). We won&rsquo;t go into too much detail on these, but they&rsquo;re good to know as many OOP codebases will implement at least a few of them. Here is a site that lists all of them out: </span><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://www.javatpoint.com/gof-design-pattern-java&amp;sa=D&amp;source=editors&amp;ust=1690745729747061&amp;usg=AOvVaw2_sfqz1I1stT8OdS35rK1c">https://www.javatpoint.com/gof-design-pattern-java</a></span><span class="c3">. </span></p>
<h3 class="c1" id="h.adoq9u9jwaee"><span class="c5">Compiled</span></h3>
<p class="c2"><span class="c3">The conception of Java was based on creating a language that can run directly on any hardware and operating system, which was revolutionary at the time. The JVM (Java Virtual Machine) was created to serve this purpose. It translates the Java code that a programmer writes into a lower level and platform independent source code called Java Bytecode, serving as an intermediary. The Java Bytecode is then translated directly into machine readable code by the JVM which emulates a virtual environment and has access to the underlying computer&rsquo;s resources directly. This entire process is called Compiling/Compilation.</span></p>
<h4 class="c7" id="h.bqk2iru8vbof"><span class="c6">Performance</span></h4>
<p class="c2"><span class="c3">Since Java is a compiled language, Java code is often faster than Python as the compiler is able to optimize the code prior to the runtime execution of it. Python does compile into bytecode, but doesn&rsquo;t separate this step from the Interpreter, meaning it&#39;s compiling the code as it&#39;s being run. This is dependent on the implementation of Python, Python itself is just a spec, and multiple implementations of it exist such as CPython, Jython and IronPython. &nbsp;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">You&rsquo;ll often hear that Java is more performant than Python, but keep in mind this is an overgeneralization. It&rsquo;s very dependent on how the code is written, which compiler/VM is being used, and the individual task at hand. It&#39;s also important to keep in mind that Java requires code to be compiled which takes time on its own, so Python often has the edge in terms of speed of development. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span style="overflow: hidden; display: inline-block; margin: 0.00px 0.00px; border: 0.00px solid #000000; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px); width: 596.00px; height: 274.00px;"><img alt="" src="images/image22.png" style="width: 596.00px; height: 274.00px; margin-left: 0.00px; margin-top: 0.00px; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px);" title=""></span></p>
<p class="c2"><span class="c3">(example of performance benchmarks between Java and Python: while this is an extreme example, Java&rsquo;s Just-In Time compiler often gives it an edge in performance over Python.</span></p>
<p class="c2"><span class="c3">Source: https://medium.com/swlh/a-performance-comparison-between-c-java-and-python-df3890545f6d)</span></p>
<h4 class="c7" id="h.nlygty40opsl"><span class="c6">Memory Management</span></h4>
<p class="c2"><span class="c3">While both Python and Java have memory automatically managed by the Python Interpter/JVM, how they approach it is different. Specifically when it comes to how it does garbage collection (the deallocation of unreferenced memory in order to re-allocate it). Python uses an approach called reference counting, where it continually checks to see if the count of an object&rsquo;s reference/dereference is 0. GC is Java is more efficient and reliable as it uses an approach called CMS (concurrent mark sweep), which does everything in one big sweep, often in another thread. Java does use more memory to perform this action though, and there are more possibilities of out of memory and memory leak errors</span></p>
<h4 class="c7" id="h.hohev9qr36ut"><span class="c6">Multithreading and Concurrency</span></h4>
<p class="c2"><span class="c3">An important concept in computer science is concurrency. This is when two or more tasks run alongside each other (not to be confused with parallelism, which means they start at the same time, but can end at different times). Multithreading is a way to achieve concurrency, its when a set of instructions is given to a core of a multicore cpu to run a task separate from other tasks (although a single core can be split into multiple processes, so even a single core cpu can achieve multithreading). Multithreading, while allowing for faster processing as it distributes the workload to multiple workers, can cause certain problems like race conditions and deadlocks because the workers are running on the same hardware and have access to the same data. We won&rsquo;t go too in detail on these in this fundamental course, but be aware they exist. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Python, in its default interpreter CPython, doesn&rsquo;t truly implement multithreading because of a concept called GIL (Global Interpreter Lock). Think of this as a traffic cop, the traffic cop stops other traffic while one stream of traffic is happening. Python does have a multithreading library, but these aren&rsquo;t truly running concurrently. Java on the other hand is not limited by a GIL and can implement true multithreading. This gives Java another edge on performance if handled correctly. It is however often difficult to write good multithreaded code in Java, and it&#39;s often recommended to use an actor model based language built on the JVM such as Scala or Akka if a lot of multithreading is required in a program.</span></p>
<h3 class="c1" id="h.4hz8g18bgkue"><span class="c5">So is Java better than Python?</span></h3>
<p class="c2"><span class="c3">Well that&rsquo;s extremely subjective and comes down to the task at hand, as well as personal preference. It&#39;s always a good idea to pick the best tool for the job. This section has highlighted the advantages of Java, however Python is still more commonly chosen for Data Engineering tasks and for a reason. Python has a lesser learning curve and more native support for data analysis libraries. Even though Java code is usually more performant and less error prone due to its type system, Python&#39;s ability to get up and running quicker and to be more approachable to newer programmers have allowed it to reign as king in the data world. </span></p>
</span><span class="course-bookmark" id="Other Languages">
<h2 class="c9" id="h.10wzcsxzr1cy"><span class="c11">Other Languages</span></h2>
<p class="c2"><span class="c3">The following languages are important to at least be aware of, but there&rsquo;s a good chance you will only use these sparingly, or possibly not at all. I&rsquo;d recommend not spending too much time learning about them unless they sound interesting to you. Still, it&#39;s good to know what they are in case you have a good use for them. </span></p>
<h3 class="c1" id="h.q11lh5r56ra8"><span class="c5">Scala</span></h3>
<p class="c2"><span class="c3">Scala is another popular choice for Data Engineers. In fact a bunch of tools that DEs use daily are written in Scala including Spark and Kafka (and also some Java). Scala is generally compiled to Java bytecode and run on the JVM, making it as performant as Java. It seeks to fix the woes that Java creates such as verbosity and forced OOP. Scala can import Java libraries and work in conjunction with Java, since it will eventually be compiled altogether anyway. Unlike Java, Scala was designed to support functional programming features as first class. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Overall Scala is a good choice if one favors the reliability and performance of the JVM but wants a more concise language which offers support for the functional paradigm. Scala is less commonly known and used than Java, so hiring managers might have a tough time finding competent Scala programmers. However the transition between the two isn&rsquo;t particularly difficult and can be worth the tradeoff. </span></p>
<h4 class="c7" id="h.rq2nk9hkujdd"><span class="c6">Fun Fun Functions!</span></h4>
<p class="c2"><span class="c3">FP (Functional programming) is another approach/style to programming, Scala was designed with FP in mind as Java was designed with OOP. It&#39;s possible to use either approach in both language, but you&rsquo;ll get more benefit by using the style it was designed with in mind. Like OOP, FP has its own tenets:</span></p>
<p class="c0"><span class="c3"></span></p>
<ol class="c12 lst-kix_c05ahjvic05g-0 start" start="1">
   <li class="c2 c4 li-bullet-0"><span class="c3">Pure functions: functions on their own take in an input and produce the same consistent output. They do not change state or depend on the state of a program (say the state of an object). This is called referential transparency. </span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">Immutability: Data is not modified in a FP program, rather new data is created.</span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">First class functions: in FP, data and functions are the same thing. Meaning a function can be passed into or returned from another function. High order functions are functions that do one or the other of these. </span></li>
</ol>
<h4 class="c7" id="h.4mfryy63bjgk"><span class="c6">Currying</span></h4>
<p class="c2"><span class="c3">Currying is a FP feature that allows a function to take multiple arguments into a sequence of functions, as long as they all have a single argument. The first function won&#39;t fully apply all transformations until all parameters are eventually met. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">def add(x: Int)(y: Int): Int = x + y</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">val addOne = add(1) // returns a new function that takes one parameter</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">val result = addOne(2) // invokes the new function with the parameter</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">println(result) // prints 3</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">(notice how we can &ldquo;chain&rdquo; function calls by using currying, making functions very modular!)</span></p>
<h4 class="c7" id="h.v8dhoo3q3khw"><span class="c6">Immutability</span></h4>
<p class="c2"><span class="c3">Immutability is something that is supported by Scala, but not enforced (unlike &ldquo;pure&rdquo; functional programming languages such as Haskell). Immutability means that variables can&rsquo;t change state without creating a new variable. Since in an ideal functional world there will only ever be one input and one output for the entire program (not usually practical as things like I/O operations are almost always needed), immutability is a way to guarantee that state will not change over time via side effects. Monads are a data type in Scala and other FP languages that provide a more elegant approach to necessary side effects.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span style="overflow: hidden; display: inline-block; margin: 0.00px 0.00px; border: 0.00px solid #000000; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px); width: 624.00px; height: 412.00px;"><img alt="" src="images/image7.png" style="width: 624.00px; height: 412.00px; margin-left: 0.00px; margin-top: 0.00px; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px);" title=""></span></p>
<h4 class="c7" id="h.4vs954dalwvf"><span class="c6">Lazy Evaluation</span></h4>
<p class="c2"><span class="c3">Lazy Evaluation is another functional programming feature that Scala supports. To simplify what that means, Scala only computes values when necessary. An example is a data structure known as a LazyList, which is a list where the elements only get evaluated on demand when they are needed. It also only computes enough values as necessary, it won&rsquo;t keep calculating values if they are not needed further down the program, as determined by the compiler. Scala also keeps computed values in a lookup table to easily recompute when necessary. This can make certain algorithms very speedy and is a perfect match for recursion which is a common tool of a functional programmer. </span></p>
<h4 class="c7" id="h.rl2fano7nqvg"><span class="c6">Recursion</span></h4>
<p class="c2"><span class="c3">In the Python subsection we learned about iterating by using for and while loops. In a pure FP language, these concepts don&rsquo;t exist. Instead everything is done through recursion, which is when a function calls itself, and passes in a new, smaller input every call. Using this with immutability provides very elegant and concise code.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">def factorial(n: Int): Int = {</span></p>
<p class="c2"><span class="c3">&nbsp; if (n &lt;= 1)</span></p>
<p class="c2"><span class="c3">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1</span></p>
<p class="c2"><span class="c3">&nbsp; else</span></p>
<p class="c2"><span class="c3">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;n * factorial(n - 1)</span></p>
<p class="c2"><span class="c3">}</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">val result = factorial(5)</span></p>
<p class="c2"><span class="c3">println(result) // Output: 120</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">(notice how factorial will keep calling itself in the &ldquo;else&rdquo; statement until it reaches n = 1, in which case the result variable will finally be populated.)</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">https://books.underscore.io/scala-with-cats/scala-with-cats.html</span></p>
<h3 class="c1" id="h.j8oph9bkuz49"><span class="c5">Shell</span></h3>
<p class="c2"><span class="c3">While shell will not be your primary language, it&#39;s important to pick up the basics of it as you will at some point need it to interact with your servers. A lot of times your databases and data processing tools will be installed on a unix/linux box, or a cluster of them. You&rsquo;ll need to have some basic scripting skills to assist in installing and supporting them. Also once we&rsquo;ll discuss infrastructure code later on, and shell scripts are often used as glue code here to connect functionality where your IaaS platform doesn&rsquo;t provide it. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">The purpose of shell is to provide a way to interact with the underlying operating system. Things such as managing files, processes and devices are all done by the kernel of the OS, and shell provides a way for you to interact with the kernel. There are different types of shells, all providing different features and slightly different ways of doing things. The most commonly used one is probably bash, although zsh and fish are more recent and very common too. It&#39;s also a good idea to get an idea of the linux distribution you&rsquo;re working with, as certain features are supported in some and not others. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Shell scripting provides basic imperative code syntax. It allows you to do things such as arithmetic, iterating and logic flow. Pipes are used to connect outputs as input to other commands by using the character &ldquo;|&rdquo;. This allows you to chain multiple commands together without having to issue a new statement or line of code in your script. Be careful not to write very cryptic code. You&rsquo;ll want to add a lot of comments in your shell scripts as it&#39;s often hard for others to read and understand how it works. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Commands are what you issue the shell to do certain things. Most take in arguments. They also use flags like so to specify what type of argument you&rsquo;re passing in: &ldquo;&ndash;recursive&rdquo;. Some have shorthand syntax like so: &ldquo;-R&rdquo;. Using the man command will tell you what these all do for a given command. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Here are some common commands to become familiar with:</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Cd: changes your directory to what you specify</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Chmod: changes permissions of a file or directory</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Cp: copies a file or directory to another location. Mv moves it without copying. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Df/du: provides info on the system&rsquo;s disk usage and free space</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Echo: Displays a line of text</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">LS: lists files and directories under your current directory.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Man: Pulls up the manual for a certain command. You would use it like &ldquo;man ps&rdquo;</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Mkdir: creates a new directory</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">PS: lists all running processes</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">PWD: prints the working directory aka which directory/folder you are currently in.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Sudo: Superuser do, this allows you to run commands with elevated privileges. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Vi/Vim/Emacs/Nano: These are all text editors which allow you to . Not all distributions come with them right out the box, but you should be able to install your favorite one on any modern distribution. Matlab is a proprietary technology while R is open source. </span></p>
<h3 class="c1" id="h.e6kntuxduzw9"><span class="c5">Matlab and R</span></h3>
<p class="c2"><span class="c3">Both Matlab and R were designed to support mathematical functions firsthand. Matlab is a proprietary language which focuses on numerical computations and matrix operations, while R is an open source technology more well suited for statistical analysis.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">We won&rsquo;t go further into these in the fundamentals course. One should be aware that they are used in Data Analysis and Science but aren&rsquo;t super common since general purpose languages like Python have libraries that accomplish similar goals. Typically as a DE you&rsquo;ll play more of a support role for these and help the analyst or scientist deploy and optimize their Matlab or R code.</span></p>
<h3 class="c1" id="h.p1ipjs6a0kam"><span class="c5">More Languages</span></h3>
<p class="c2"><span class="c3">While SQL, Python, Java, Scala, Shell, Matlab and R are the most common languages a DE will encounter, here are a list of other languages to be aware of in case they come up:</span></p>
<p class="c0"><span class="c3"></span></p>
<ul class="c12 lst-kix_dgklcad1ntb8-0 start">
   <li class="c2 c4 li-bullet-0"><span class="c3">C and C++: These lower level languages are good for writing efficient algorithms. As a DE you&rsquo;ll probably never use them directly, but other C or C++ services in your company might create or use the data that you maintain. </span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">C#: This language is very similar to Java, and might be chosen over it by companies that favor Windows environments (sometimes called Windows shops). The common language runtime ie what translates the language into runnable code for the computer is called .NET (Pronounced DotNet) in C#, you can think of this as similar to Java&rsquo;s JVM but isn&rsquo;t virtualized since it has bindings directly to the Windows Kernel. .NET supports other languages like VisualBasic and F#. A lot of Big Data frameworks like Spark have APIs for .NET these days, so its usage in DE has been increasing.</span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">JavaScript: originally designed to be the glue language of the web, JavaScript has taken the Software Engineering industry by storm in the past decade or so and is now used for just about everything. With the inception of NodeJS JavaScript can be run server side and be run on any hardware that has the resources to run its interpreter. A DE might choose this language to build certain tooling and possibly even data pipelines. It might also be used in collecting data such as web scraping. </span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">Go: short for Golang, this language was developed and then open sourced by Google. It&#39;s typically used for lower level type tasks as it runs closer to the bare metal of hardware than higher level languages like Python. This can give a DE more flexibility in designing data pipelines that are very efficient. It&#39;s still relatively a new player so may see more use in data engineering in the near future.</span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">Kotlin: this language was created by JetBrains and is meant to be a more concise and approachable flavor of Java and also provide better support for the functional paradigm. It runs on the JVM and has access to Java libraries, meaning you can basically use it anywhere where you&rsquo;d use Java. The Kotlin team has been good with building and maintaining their own APIs to Big Data frameworks like Spark to make it even easier to use as a DE. JetBrains ships it with all of their popular IDEs so it can be a good choice if you&rsquo;re a fan of their tooling. </span></li>
   <li class="c2 c4 li-bullet-0"><span class="c3">Powershell: if you&rsquo;re working in a Windows environment, this will be used over shell. It&#39;s a more modern scripting language, but the syntax is a little different than a lot of other languages and takes time getting used to.</span></li>
   <li class="c2 c4 li-bullet-0"><span>Solidity and other smart contract languages: with Blockchain technologies becoming very relevant and more widespread in the last few years, it&#39;s important to be aware of the programming languages designed to directly interface and manipulate them. Solidity was created to be the primary language of the Ethereum Blockchain and uses &ldquo;smart contracts&rdquo; which are a reactive set of instructions when certain events occur on the blockchain. In the DE field Blockchains are having increasing use cases (albeit very specific ones) such as data governance and integrity, data sharing and collaboration, and monetization. Blockchains tend to be used with smaller data points, so they don&rsquo;t scale well with the Big Data technologies that a DE normally uses. </span></li>
</ul>
</span><span class="course-bookmark" id="Section Recap">
<h2 class="c9" id="h.5i7y12mu1pic"><span class="c11">Section Recap</span></h2>
<p class="c2"><span class="c3">Feeling overwhelmed? Try not to! Programming is an important tool for a DE, however most of what we do with it is relatively basic. Once you get the basics down you will have the ability to work in most DE codebases. A lot of times a DE uses a language mostly as a way to interact with the API provided by tools such as Spark. So it&#39;s often more important to understand how these tools and APIs work rather than being a pro at a specific language. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c22 c14">Personal Anecdote</span><span class="c14 c17">: An important tip I&rsquo;ve gotten in my career and try to use at every new job is to read the codebase. Become familiar with how the flow of the logic works. Try to understand how everything works at a high level while having the ability to dig down at a more micro level when necessary. If you don&rsquo;t understand how something works or why a choice was made, ask! This will set you up to be able to contribute to the code directly, even if it&#39;s just small bug fixes. At some point the codebase will probably be refactored, so knowing how it works in detail will set you up for being able to recreate its functionality.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c3">Our next section we&rsquo;ll dive into ingesting data, as this is usually the starting point for building data pipelines. From there we&rsquo;ll move onto how to store and transform this data, and then eventually how to use it. Having some programming skills will be useful in doing hands-on work related to these topics. </span></p>
<h3 class="c1" id="h.gvj00rfalo5p"><span class="c5">Additional Resources</span></h3>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://www.w3schools.com/&amp;sa=D&amp;source=editors&amp;ust=1690745729755926&amp;usg=AOvVaw2SidoVhNT4FSP2h1wLkXsH">https://www.w3schools.com/</a></span><span class="c3">&nbsp;this is a great resource for learning the basics of every language covered above as well as others. It&#39;s a nice reference point even once you&rsquo;ve tackled the basics. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://sqlzoo.net/wiki/SQL_Tutorial&amp;sa=D&amp;source=editors&amp;ust=1690745729756370&amp;usg=AOvVaw2LqvRZ0kvb4fufd243W0Sb">https://sqlzoo.net/wiki/SQL_Tutorial</a></span><span class="c3">&nbsp;this site is a good way to continue practicing sql problems. It provides a textbox where you can submit your sql code and see the result, and will even give you the answer if you can&rsquo;t figure it out.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://www.learnpython.org/&amp;sa=D&amp;source=editors&amp;ust=1690745729756642&amp;usg=AOvVaw33inFdZNesSn3A4F1iwBqJ">https://www.learnpython.org/</a></span><span class="c3">&nbsp;similar to SQLZoo, this site provides a python shell where you can submit your code to see if it&#39;s correct or not. The site also offers a tutorial for Java, Scala, SQL and Shell</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://leetcode.com/&amp;sa=D&amp;source=editors&amp;ust=1690745729756900&amp;usg=AOvVaw274vyT_goDTBLv4vZDRK4i">https://leetcode.com/</a></span><span class="c3">&nbsp;once you start feeling strong in a certain language, head over to leetcode to start solving problems. These problems are commonly used in junior programming interviews, so the more you&rsquo;re able to solve and memorize the higher your chances of getting hired will be. Don&rsquo;t stress too hard on them though, as long as you can talk through a problem the interviewer will generally show leniency even if you don&rsquo;t come to the most efficient solution.</span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://www.geeksforgeeks.org/sorting-algorithms/&amp;sa=D&amp;source=editors&amp;ust=1690745729757212&amp;usg=AOvVaw2ZnjGBFr1_B8txz9XF9Aqw">https://www.geeksforgeeks.org/sorting-algorithms/</a></span><span class="c3">&nbsp;Sorting algorithms are important to understand as these are also commonly seen in interviews. Usually if you get a firm understanding of at least one of them, you&rsquo;re good to go. On the job this knowledge will mostly be useless though as all major languages have libraries that implement them. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-o-notation&amp;sa=D&amp;source=editors&amp;ust=1690745729757612&amp;usg=AOvVaw32evMMvfbrVhTOyI0FY64u">https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-o-notation</a></span><span class="c3">&nbsp;Big-O notation is important to understand as it is commonly brought up in interviews. Once you have a decent understanding of data structures and algorithms, you should learn how to use Big-O notation to describe how efficient an algorithm is. </span></p>
<p class="c0"><span class="c3"></span></p>
<p class="c2"><span class="c15"><a class="c8" href="https://www.google.com/url?q=https://www.garfieldtech.com/blog/language-tradeoffs&amp;sa=D&amp;source=editors&amp;ust=1690745729757978&amp;usg=AOvVaw0PS8oSoWPx4vTi1fqMNIrh">https://www.garfieldtech.com/blog/language-tradeoffs</a></span><span class="c3">&nbsp;this blogpost provides more insight into different programming paradigms and their tradeoffs. </span></p>
<p class="c0"><span class="c3"></span></p>
</span>
`

export default freeContent;