Skip to main content

Information Management for Data Science (305-0-20)

Instructors

Emre Besler

Meeting Info

Frances Searle Building 1421: Mon, Wed, Fri 10:00AM - 10:50AM

Overview of class

The Information Management for Data Science course aims to give students an extensive skillset to upload, clean, process, store and utilize data from various sources. Starting with the main libraries and data structures in Python, it moves on to advanced techniques to obtain data. Namely, it will cover HTML text from web sites using CSS and Xpath techniques, interacting with Application Programming Interfaces (APIs) using Javascript Object Notation (JSON) files and the corresponding libraries. Students are expected to have fundamental Python skills from STAT 303-1 or CS 110. The course then moves on to relational databases and how to store/obtain data from them using Structured Query Language (SQL). Students are not expected to have any prior knowledge on SQL; it will be introduced from scratch and applied during the lectures. After a certain understanding of SQL is established, database design will be the last main topic of the course.

Registration Requirements

Prerequisites: STAT 201-0 or COMP_SCI 110-0 and STAT 202-0 or STAT 210-0 or STAT 232-0

Enrollment in this course is restricted to Data Science majors

Learning Objectives

At the completion of this course, students should be able to:

Identify data parts that misleading, wrong, irrelevant or redundant according to the task at hand and process the dataset they uploaded accordingly.
Create new variables from the data they have, in a new data type if necessary.
Visualize the data in an interactive and visually aesthetic manner.
Scrape different types of data from online sources and process it for further analysis.
Handle SQL queries to obtain data that is spread across multiple and relational databases.
Obtain data from a mobile application or a website and process it for numerical analysis
Design relational databases according to the needs of the datasets at hand.

Teaching Method

The lectures will be in the following format:

1. The first part will focus on the main concepts and theory of the lecture's topic. The big picture idea and the necessary mathematical background will be introduced here.

2. The second part will be an in-class coding session. The instructor will code some examples in class, after which the students will have their own time to work on their assignments with the instructor's help.

3. After the lecture, the students will be encouraged to post their questions on an online platform of the instructor's choice, such as Piazza, Canvas or Campuswire. The instructor will go through them and prepare the first part of the next lecture.

Evaluation Method

There will be 6 homework assignments, (10% each) an in-class midterm exam, (20%) and an in-class final exam (20%)

Class Materials (Required)

No Textbook, a laptop for coding is necessary

Class Notes

The whole course is based on the Python and SQL programming languages. We recommend everyone to install the Anaconda distribution. Once Anaconda is installed, you need find Jupyter Notebook and Spyder in it. Installation of SQLite Studio is also necessary.

Class Attributes

Formal Studies Distro Area

Enrollment Requirements

Enrollment Requirements: Registration in this course is reserved for Data Science Majors only