Setting Up a Data Science Environment on Linux
A quick guide to setting up a Data Dcience Environment on Linux.
- The terminal with zsh
- The Python programming language
- The conda virtual environment
- th VS Code editor
- The git and Github version control
- The Jupyter notebook
- The Cookiecutter Data Science project organization
- A typical workflow for a Data Science project
This article is a quick guide showing how to setting up a minimal Linux environment for your Data Science projects.
You should be familiar with the most usual bash commands for navigating thourgh your file structure and to install programs.
Here is what I will cover in this guide :
- The terminal with zsh
- The Python programming language
- The conda virtual environment
- th VS code editor
- The Jupyter notebook
- The git and Github version control
- The Cookiecutter Data Science project organization
Then, I will conclude with a simple and typical workflow to follow when starting a new Data Science project.
The terminal with zsh
A terminal is used to interact with your computer via text.
The terminal we’re going to be setting up is zsh, and specifically the Oh My Zsh framework.
To install the Oh My Zsh framework, you have to first install a zsh terminal :
sudo apt install zsh
You have to exit out of your terminal and open it up again for the changes to take effect.
To verify the zsh installation, simply run :
zsh --version
Then, to install Oh My Zsh, copy-paste the command into your terminal:
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
Exit out of the terminal and open it back up again. We’re done !
The Python programming language
If you’re there you know some Python. So this part will be short.
Here, I recommend to install a new version of Python using Miniconda.
Go to the Miniconda home page and download either the “Miniconda3 Linux 64-bit” or “Miniconda3 Linux 32-bit” installer for Python 3.8.
Open up a terminal and navigate to where you downloaded the file.
Then run :
bash Miniconda3-latest-Linux-x86_64.sh
Close your terminal and re-open, then run :
conda env list
You should get a short list printed out.
Finaly, check which Python is your default :
which python
You should see a path printed out that has “miniconda” in it. We’re done !
The conda virtual environment
A good Data Science practice is to basically separate installations of Python that have their own set of packages that they have access to.
This means each project you work on can have its own set of packages, and you don’t have to worry about your projects having conflicting package requirements.
Using conda to create a virtual environment :
Run this command to create a new conda environment named “ds-projects” with pandas installed :
conda create -n ds-projects pandas
If you run conda env list again, and you should now see “ds-projects” listed.
To Activate your conda virtual environment, simply run :
conda activate ds-projects
You are now using a new virtual environment to work in.
Hence, you can install a new package (seaborn for example) with the command :
conda install seaborn
To deactivate your conda environment :
conda deactivate
Finallly, you can remove your virtual environment by running :
conda remove --name ds-projects --all.
It’s a best practice to create a new virtual environment for each project that you work on.
th VS Code editor
The VS Code editor is popular among data scientists and developers. It has very useful functionality right out of the box with all of the features you would expect like multi-line select, an integrated terminal, and debugging tools.You can use extensions to make it as powerful as you need it to be.
Installing VS Code is very easy. Simply go to the VS Code website, click “Download”, and install using the instructions.
To use it, open up a terminal. Type :
code .
Find the “extensions” icon on the left-hand bar and click it. You should see a search bar where you can search for extensions.
Search for a Python extension (you can just search “python”), and install it.
If you’re new to VS Code, I suggest to familiarize yourself with by creating a file structure, a Hello World Python program that you will run in the integrated terminal.
The git and Github version control
A best practice is to create a new git “repository” for each project you work on.
A git repo is simply where all of your code for a project lives.
To install it, simply type :
sudo apt install git-all
Exit your terminal and open a new one.
Then, run git –version to ensure that the install worked correctly
In your terminal, run these two lines to configure your git for the first time so that git knows who you are :
git config --global user.name "Your Name Here"
git config --global user.email youremail@example.com
Create a project with some code to use with our first git repo.
For example, create such a file structure :
README.md
.gitignore
test_script.py
Initialize a git repo for your project, and take a snapshot (“commit”) of your code.
Run git init in your terminal to initialize a git repository in your project directory.
Run git add . in your terminal, which adds all of the files in the current directory to the “staging” area of git.
Run git commit -m “First commit.”
GitHub Usage Instructions :
Create a GitHub account.
Create a repo on GitHub.
Push your code to GitHub.
Your code and commits are now safely stored on GitHub.
The Jupyter notebook
Jupyter notebooks have become incredibly popular with data scientists over the last few years, and for good reason—they’re a great way to analyze data, run some experiments, and document your results in a way that others can follow along with. With notebooks, you create individual cells where you can either write and run Python code, or write Markdown code to document your findings.
Create a new conda environment.
In your terminal, run :
jupyter notebook
You should see some text printed to the console telling you that Jupyter notebooks is starting up and running.
Create a new notebook and play around with it. We’re almost done !
The Cookiecutter Data Science project organization
Cookiecutter Data Science is essentially a template project directory that you can use when you start new data science projects.
A good suggestion for starting out is to just use the main directories that scripts that you need.
A good starting place is to only use the “data”, “notebooks”, and “src” folders.
To install via conda, run this command in your terminal :
conda install -c conda-forge cookiecutter
Use cookiecutter to download and create a data science project template :
cookiecutter https://github.com/drivendata/cookiecutter-data-science
Delete anything you don’t want.
A typical workflow for a Data Science project
Open up your terminal.
Create a new conda environment for your project :
conda create -n my-project-env pandas jupyter scikit-learn matplotlib seaborn
conda activate my-project-env
Create a new project directory using cookiecutter :
cookiecutter https://github.com/drivendata/cookiecutter-data-science
Open up your new project directory in VS Code :
code my-project-directory
Open up a terminal in VS Code, initialize a new git repo, and take a first snapshot.
git init
git add .
git commit -m “First commit.”
Create a new repo on GitHub, then follow the instructions to push your code from your computer project directory to that repo :
git remote add origin https://www.github.com/yourname/your-repo-name.git
git push origin master
And now you’re ready to go for your next data science project.
Happy coding !