Programming Languages For Data Scientists Must Know

Becoming a data scientist does not require an IT background. From the networks I follow on LinkedIn, many people have changed careers from various backgrounds to become data scientists. And that’s only natural, considering that companies around the world are flocking to the digital industry. In early January 2023, I tried to transition from designer to data scientist.

Alright, let’s get straight to the topic. Here are the programming languages that data scientists need to know. What are they?

Python

Python, created by Guido van Rossum, is a high-level data structure programming language that is simple, clear, and logical, yet effective for object-oriented programming (OOP).

It is recommended to install Python version python3.10 or higher to get full support. Python, which is a multi-paradigm programming language, also offers several functional programming supports similar to lisp, such as filter, map, reduce, set, and generator expressions.

Python also offers a wide range of library. For example, pandas, numpy, scipy, scikit-learn, matplotlib, seaborn, and many more for data science needs. You can try Python using the link above and practice your programming skills every day.

python
Python 3.10.9 (main, Dec  6 2022, 18:44:57) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print("Hello World")
Hello World

SQL

SQL (pronounced: sequel) or Structured Query Language is a programming language for accessing data in relational databases (RDBMS). Almost all database servers use SQL for data management.

Data scientists will inevitably deal with databases, such as creating databases, processing data, and translating data to make it more understandable. This is necessary for determining stakeholder decisions.

When using SQL, MariaDB (a derivative of MySQL) is used for data management because it is more flexible to develop openly and is not tied to proprietary licensed products (Oracle). MariaDB also has more storage engines than MySQL.

sudo mysql -u root -p
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.6.11-MariaDB MariaDB Server
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| hervymart          |
| hervypraktek       |
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
6 rows in set (0,097 sec)

MariaDB [(none)]>

R

The R programming language is better known as a programming language for statistics and graphical visualization. Created by Ross Ihaka and Robert Gentleman at the University of Auckland, it is now developed by the R Development Core Team.

The R language, under the GNU GPL license, has become the de facto standard among statisticians for statistical software development, and is widely used for statistical software development and data analysis.

The R language can be used with Jupyter or RStudio Posit to make it easier for data scientists to process data.

R
R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> print("HelloWorld")
[1] "HelloWorld"

Julia

The Julia programming language is one of the languages used for efficient numerical analysis and data visualization. Similar to Python, Julia is a high-level language with syntax that is relatively easy for beginners.

Julia is arguably Python’s competitor, as its performance is much faster than Python’s. This is because Julia is compiled with the LLVM framework for JIT compilation, which is comparable to the speed of C. In addition, Julia can be integrated with VIM, Jupyter, and Julia VScode. Julia also has Julia Packages for libraries needed by data scientists, and interestingly, Julia has FluxML, a library specifically for machine learning.

julia
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.5 (2023-01-08)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

julia> print("Hello World")
Hello World
julia>

Scala

Scala (Scalable language) was started in 2001 by Martin Odersky and is a general-purpose, high-level, multi-paradigm programming language.

Scala, like Java, is an object-oriented programming language that is safe and supports functional programming. Scala runs on the Java platform (Java virtual machine) and is compatible with existing Java programs.

object HelloWorld extends App {
  println("Hello, World!")
}

Compiling:

scalac HelloWorld.scala

Running binary:

scala HelloWorld
Hello, World!

Conclusion

Regardless of the language used, tailor it to the needs and criteria of the data at hand for greater effectiveness. The following programming examples are organized by subheading, starting with the easiest and most common ones, making them easier for beginners to learn.

What is the best language?

All programming languages are the best as long as they meet the needs and specifications, because each language has its own advantages and disadvantages. Personally, I recommend not focusing on the language itself, but rather understanding the basic concepts of programming algorithms first, which is more important.