Projects

Machine learning engineer nano degree

In December 2017, I decided to enroll into the Udacity Machine Learning Engineer Nano Degree program. At first, I was unsure if it would be the right fit, since I had substantial training in supervised learning in grad school. Nonetheless, I hoped to familiarize myself further with unsupervised learning, reinforcement learning, and deep learning; I also hoped to get more experience using machine learning methods in python. My overall experience was positive. The course solidified existing knowledge and acquainted me we more advanced topics such as reinforcement learning and convolutional neural networks. I challenged myself to go beyond applying familiar algorithms such as regression trees and lasso regressions and to dig into new methods including stochastic gradient descent and gradient boosting. My main complaint about this program is that the python coding for the projects was very limited since much of the code was pre-written. Similarly, the projects did not delve deep into some of the more technical aspects of the methods. All my submissions can be found on GitHub.

For my capstone project, I chose the Kaggle data set from the DonorsChoose.com Application Screening competition. DonorsChoose.com empowers public school teachers from across the country to request much-needed materials and experiences for their students. It receives hundreds of thousands of project proposals each year for classroom projects in need of funding. The objective of this project was to predict proposal acceptance based on the proposal information. The data included essays and meta data such as subject, resources requested, and state for each proposal. I chose the project because it aligned with my interest in education and because it allowed me to apply text analysis methods to extract data from the proposals, unsupervised learning methods for dimensionality reduction, and supervised learning algorithms to predict acceptance.

I chose two algorithms to reduce the dimensionality of the text data: non-negative matrix factorization and principal component analysis. As for supervised learning algorithms, I used random forests, stochastic gradient descent with support vector machines, and logit regressions as a baseline model. The model parameters were determined using the area under the ROC curve in the grid search. The final models were tested on an independent test sample. Overall, the models did not perform well. The bench line model, logistic regression, reached an accuracy similar to that of a simple majority rule and area under the ROC curve similar to random classification. The random forest and SDG with principal components extracted from the text data performed slightly better based on the area under the ROC curve but worse based on accuracy. Given that the majority of proposals was approved, this suggest that these models put more weight on classifying rejected proposals correctly than approved proposals. Although the models were not particularly successful, the project was a great learning experience. All jupyter notebooks are on GitHub.

Scraping survey files from Qualtrics with Splinter

Our lab works a lot with the online survey platform Qualtrics. Recently, we had to archive survey files and response data for more than 200 surveys. To save our research assistants from manually going to each survey and downloading the files, I volunteered to write a python script that would download everything over the Qualtrics API. At that point I did not realize, that the API does not have functionality to download survey files; it only allows to pull response data. To still make good on my word, I wrote a web scraper based on the Splinter module that simulates interactions with the Qualtrics website to download the files. The program opens Firefox, goes to the Qualtrics site, logs on, and then downloads all survey files specified in a list separately. The script can be found on GitHub.

Sending gift card codes to Qualtrics online survey respondents with python

Social science researchers collect much of their data through online surveys. In many cases, they offer incentives to the participants. At CEPA Labs at Stanford, our survey participants receive a gift card code from Amazon. However, sending these gift card codes to respondents is challenging. In Qualtrics, our online survey platform, we can embed a code for each potential respondent, and then trigger an email with the code attached after their survey is completed. While this is a very convenient feature, it has one substantial drawback. We need to purchase a code for each potential respondent up front, yet many participants may never answer. There is an option of waiting until the survey is closed and then purchasing the codes for all respondents at once. However, respondents tend to become impatient if they do not receive their code in a timely manner and start reaching out to you and possibly to the IRB office. This creates administrative work and might reduce response rates if potential respondents can talk to each other about their experience.

Given these problems, we decided to send codes using python and the Qualtrics API. This way, we can send codes immediately but do not need to purchase a code for each potential respondent upfront. We used Amazon Incentives API, which allows its users to request gift card codes on demand. Codes are generated on the fly, and the amount is charged to our account. I wrote a python script that checks Qualtrics continuously for new responses and then sends each new respondent a code. In a loop, it downloads (new) responses, writes their contact information to an SQL database, assigns a code, and adds it to the database, and then sends the codes by email or text message.

You can find more detail in my blog post that describes a python program that gets the codes from a .csv file and sends them out by email. The original program fetched the codes from the Amazon Incentives API and sent them over the EZtexting.com API. Both versions and other scripts for survey management can be found on GitHub.