Information Management for Data Science (305-0-20)
Instructors
Emre Besler
Meeting Info
University Hall 122: Tues, Thurs 3:30PM - 4:50PM
Overview of class
The Information Management for Data Science course aims to give students an extensive skillset to upload, clean, process, store and utilize data from various sources. Starting with the main libraries and data structures in Python, it will focus on Python functions, methods and attributes to process the data at hand. The course then moves on to advanced techniques to obtain data. Namely, it will cover HTML text from web sites using Xpath techniques, and Javascript Object Notation (JSON) files online. The course then moves on to relational databases and how to obtain data from them using Structured Query Language (SQL). Students are not expected to have any prior knowledge on SQL; it will be introduced from scratch and applied during the lectures.
Registration Requirements
Prerequisites: STAT 201-0 or COMP_SCI 110-0 and STAT 202-0 or STAT 210-0 or STAT 232-0
Enrollment in this course is restricted to Data Science majors
Learning Objectives
At the completion of this course, students should be able to:
Identify data parts that misleading, wrong, irrelevant or redundant according to the task at hand and process the dataset they uploaded accordingly.
Create new variables from the data they have, in a new data type if necessary.
Visualize the data in an interactive and visually aesthetic manner.
Scrape different types of data from online sources and process it for further analysis.
Handle SQL queries to obtain data that is spread across multiple and relational databases.
Obtain data from a mobile application or a website and process it for numerical analysis
Design relational databases according to the needs of the datasets at hand.
Teaching Method
Each 80-minute class time will be divided into a 70-minute lecture and 10 minutes to work on the weekly set of exercises. Concepts will be introduced in the lecture part, and students will work on exercises in the last 10 minutes. Students are encouraged to ask questions and collaborate during the in-class work time. Everyone must bring their own laptop in each class, as coding in Python and SQL will be required. Installation of any Python environment (Anaconda Navigator will be the one used in class.) and SQLite Studio is necessary.
Evaluation Method
There will be 5 homework assignments (35%), 9 weekly exercise sets (35%), a midterm exam, (15%) and a final exam (15%).
Class Materials (Required)
No Textbook, a laptop for coding is necessary
Class Notes
The whole course is based on the Python and SQL programming languages. We recommend everyone to install the Anaconda distribution. Once Anaconda is installed, you need find Jupyter Notebook and Spyder in it. Installation of SQLite Studio is also necessary.
Enrollment Requirements
Enrollment Requirements: Preregistration is reserved for Data Science Majors only.
Prerequisite: STAT 201-0 or GEN_ENG 150-0 or GEN_ENG 151-0 and STAT 202-0 or STAT 210-0 or STAT 232-0 or PSYCH 201-0 or IEMS 201-0 or IEMS 303-0