Statistics is the discipline that studies the collection, organization, displaying, analysis, interpretation and presentation of data. Statistics is a branch of mathematics that is recommended to be a prerequisite for data science and machine learning. Statistics is a very broad field but we will focus in this section only on the most relevant part. After completing this challenge, you may go onto the web development, data analysis, machine learning and data science path. Whatever path you may follow, at some point in your career you will get data which you may work on. Having some statistical knowledge will help you to make decisions based on data, data tells as they say.
What is data? Data is any set of characters that is gathered and translated for some purpose, usually analysis. It can be any character, including text and numbers, pictures, sound, or video. If data is not put in a context, it doesn't make any sense to a human or computer. To make sense from data we need to work on the data using different tools.
The work flow of data analysis, data science or machine learning starts from data. Data can be provided from some data source or it can be created. There are structured and unstructured data.
Data can be found in small or big format. Most of the data types we will get have been covered in the file handling section.
The python statistics module provides functions for calculating mathematical statistics of numerical data. The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.
In the first section we defined python as a great general-purpose programming language on its own, but with the help of other popular libraries (numpy, scipy, matplotlib, pandas etc) it becomes a powerful environment for scientific computing.
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with arrays.
So far, we have been using vscode but from now on I would recommend using Jupyter Notebook. To access jupter notebook let's install anaconda. If you are using anaconda, most of the common packages are included, so you don't have to install more packages.
🎉 CONGRATULATIONS ! 🎉