T-61.5910 - Research Project in Computer and Information Science P
Fall 2015
List of suggested projects
Project #1
Big data analysis for crime and fraud detection (Nordea)
Description
This project will give students the opportunity to apply data mining and machine learning techniques for the purpose of detecting crime and fraud in large, multi-typed, financial data.
The students will be given access to Nordea data in a controlled environment
and will be supervised by Nordea data-analysis experts.
Supervisors
From the Nordea side:
Sauli Pahlman (Sauli.Pahlman@nordea.com),
Dmitri Guirenko (dmitri.guirenko@nordea.com),
and from the Aalto side: Aristides Gionis (aristides.gionis@aalto.fi)
Student
Albert Arockiasamy
Project #2
Analysis of recruitment data
Description
tyopaikat.oikotie.fi
is a job-market portal, where employers post job advertisements,
and candidates create profiles and apply to job openings.
The service belongs to Sanoma Corporation.
In this project, students will be given the opportunity to analyze data from the
tyopaikat.oikotie.fi system,
in order to discover interesting patterns in the data,
and develop ideas that could potentially improve the services of the portal.
The students will be supervised by Sanoma data-analysis experts.
Possible projects include:
- Identifying jobs/industries that are "trending", promising in near future.
- Identifying discrepancies between jobs and applicants.
- Extracting typical career paths.
- Improving advanced search -- what keywords should we include in job listing to attract candidates.
- Finding what features make a job listing more attractive.
- Correlating job listing/searches data with unemployment data and other economic indicators.
- Performing discrimination analysis.
Note that the data is mostly in Finnish, so understanding Finnish is a plus.
Supervisors
From the Sanoma side:
Mika Ruokonen (mika.ruokonen@sanoma.fi),
and from the Aalto side:
Michael Mathioudakis (michael.mathioudakis@hiit.fi),
Indre Zliobaite (indre.zliobaite@aalto.fi).
Student(s)
Project #3
Molecular-level modelling and visualisation of custom-designed 3D DNA nanostructures
Description
DNA nanotechnology
is a multidisciplinary, emerging field of
technology that aims to create nanoscale structures by a process of
self-organisation, using DNA as the construction material.
A recent development within this field is the synthesis of 3D DNA
nanostructures from very general mesh-based designs, by a process that
could be compared to
"3D printing at the nanoscale".
This technique was pioneered by our research group at the Aalto
Computer Science department, in collaboration with prof.
Björn Högberg's biochemistry laboratory at the Karolinska Institutet in Stockholm.
The design tool is currently implemented as a module of the
"vHelix"
DNA nanostructure design package developed at the Högberg Lab.
While we now have a complete design pipeline that leads from a
high-level graphical mesh model to the eventual DNA sequences that
self-organise into the desired structure, we are still lacking a good
understanding of a number of molecular-level defects that appear in
electron microscopy images of the nanoscale structures. To study
these, and consequently improve the methodology, it would be highly
useful to export the static nucleotide-level designs from the vHelix
software to a dynamic, molecular-level modelling and simulation
environment such as the oxDNA package,
that provides also more advanced simulation and visualisation
capabilities.
Tasks
- Study the so called
"DNA origami"
technique and the algorithmic
methods involved in translating a 3D graphical mesh model into the DNA
strands that hybridise into the desired structure.
- Learn about the vHelix and oxDNA modelling and simulation packages.
- Implement a software module that exports a vHelix nanostructure model
into the oxDNA package.
- Experiment, using oxDNA, numerically and visually with some of the
DNA nanostructure designs presented in the article mentioned in Task (1) above.
Supervisors
Pekka Orponen (Pekka.Orponen@aalto.fi), Abdulmelik Mohammed (Abdulmelik.Mohammed@aalto.fi)
Student
Project #4
Gaussian processes
Description
Gaussian process is flexible non-parametric model useful in probabilistic machine learning. GPstuff Matlab/Octave toolbox is a versatile collection of Gaussian process models and computational tools required for inference (in top 5% most downloaded machine learning open source software at
mloss.org).
There are many new ideas waiting to be implemented in the software.
Prerequisites
Bayesian statistics, Programming skills (MATLAB)
Supervisor
Aki Vehtari (aki.vehtari@aalto.fi)
Student
Project #5
Large scale positive unlabeled learning for biological compounds
Description
Our knowledge of compounds of biological interests is very limited.
Out of 50,000,000 compounds in
PubChem (Molecule database), we only know around 300,000 are of biological interests; the remaining
compounds may or may not of biological interest and the goal of this project is to predict, among of the
remaining ones, which are of biological interest based on their molecular fingerprint (features). The problem
perfectly suits the Positive-Unlabeled (PU) learning setting where the the label information for most of the
data examples are missing and we only know small portion of data with positive labels (no negative labels).
To deal with ~50,000,000 training instances, large scale machine learning techniques or platform such as
Spark could be used.
Supervisor
Huibin Shen (huibin.shen@aalto.fi)
Student
Paul Tardy
Project #6
Bayesian networks for big data
Description
Bayesian networks are a widely-used class of probabilistic graphical models. They are used to represent joint distributions of several random variables. Typically, they are learned from data. Learning Bayesian networks is NP-hard and thus exact algorithms do not scale up to hundreds of variables. Therefore, in a big data setting one has to resort to heuristic algorithms which do not give any quality guarantees. But how well do these algorithms perform in practice?
The task in this project is to conduct a small simulation study on learning Bayesian networks with big data. This includes generating data, learning networks with selected algorithms and assessing performance.
Supervisor
Pekka Parviainen
(pekka.parviainen@aalto.fi)
Student
Emre Celikten
Project #7
Detection of emotional affect in images
Description
Every image conveys an affective message. An image can cause you to feel happy or unhappy depending on its content. Detecting this affective message and the emotional state of the person exposed to the image have been an interesting research question. One approach to this problem is using the people's eye movements in order to detect the influence of image and affective state. In this project, we are planning to study various classification approaches for this purpose.
We are exploring both scenarios, in the first part, we analyze the contribution of classifiers for the task of detection of an image emotional content and try to identify the influential features. Then, using the results of the first part of the project, we try to build a model that predicts the emotion of an observer watching the images used in the first part of the study.
The steps towards successful completion of this project are the following:
- Literature review, to address the following questions:
- What are the existing methods for the task?
- How does the eye movement contribute?
- What features can be used and motivating the choice of features.
- Analysis of the features and their correlation in the data.
- Creating a baseline system.
- Studying various classifiers for the task.
- Analysis of the performance of classifier.
- Building a model based on the findings of the first part of the project.
- Implementing an interface for user interaction with the photos and recording their gaze.
- Testing the model in real-time.
- Writing a report on the findings.
The project requires interest in HCI, knowledge of machine learning, programming skills, and patience.
You may need to gather some data for the second part.
Supervisors
Hamed Tavakoli (hamed.r-tavakoli@aalto.fi),
Jorma Laaksonen (jorma.laaksonen@aalto.fi)
Note
Preference will be given to students from the EIT Digital Master School's Human Computer Interaction and Design
(HCID) major for this project.
Student
Filippo Forti
Project #8
Analysis of jigsaw puzzle solving using gaze
Description
Analysis of human gaze while doing various tasks has been an interesting topic for psychologists and human-computer interaction scientist over the past decades. There has been a huge interest in task decoding using machine learning approaches in recent years. In this project, we are aiming at analysis of the gaze behavior during solving a jigsaw puzzle. What is the characteristic of the users' gaze while picking up a piece and solving the problem? Can we find what causes a user to make mistake by analyzing his/her gaze?
To answer this questions, we first need to implement a user friendly gaze-based jigsaw game. Then, we can proceed with the analysis of the data that we gather. In this project, we mostly focus on the first part of the project since the successful analysis highly depends on it.
The steps towards completion of this project are the following:
- Literature review.
- Proposing an intuitive user interface design for the puzzle.
- Implementing the interface and making sure it is working with no error.
- Gathering some data of the users who are solving the game.
- Analyzing the gaze data of the user.
- Reporting the findings.
The project requires interest in HCI, statistical analysis, machine learning , programming skills, and patience.
Note: there could be two type of interfaces, one that relies on a multi-touch display and using eye-tracking glasses for solving the problem, and a desktop interface that uses desktop eye trackers. The first one is more challenging, though you have the possibility of utilizing touch information as well.
Supervisors
Hamed Tavakoli (hamed.r-tavakoli@aalto.fi),
Jorma Laaksonen (jorma.laaksonen@aalto.fi)
Note
Preference will be given to students from the EIT Digital Master School's Human Computer Interaction and Design (HCID) major for this project.
Student
Project #9
Video annotation using Gaze
Description
Video annotation is a time consuming work. It requires a lot of time and effort. Traditionally, one shall manually select the object of interest in several key-frames and track it within those key-frames. In this project, in order to reduce the amount of interaction and speed-up the process, we are annotating videos using gaze as an input where the user looks for a specific object while watching a video. The project involves challenges in the domain of object detection and tracking.
The steps towards completion of this project are the following:
- Literature review.
- Making an object detector that detects objects of interest.
- Building a tracker to keep track of the detected object.
- Implementing a gaze interface to record user's gaze.
- Combining the gaze information and object detector information to perform semi-automatic annotation and improve the annotation results.
The project requires interest in HCI, knowledge of machine learning, programming skills, and patience. Some knowledge of computer vision is highly recommended.
Supervisors
Hamed Tavakoli (hamed.r-tavakoli@aalto.fi),
Jorma Laaksonen (jorma.laaksonen@aalto.fi)
Note
Preference will be given to students from the EIT Digital Master School's Human Computer Interaction and Design (HCID) major for this project.
Student
Mohamed Soliman
Project #10
Multimedia recommender system using Linked Open Data
Description
Over the recent years, the Web has evolved from a collection of mostly text-based content to a large multimedia database. Today we are able to access many different types of multimedia, such as, images, video, audio, live TV streams, as well as animated and interactive content. On the other hand, current systems that recommend multimedia content have the limitation that they recommend only a specific type of multimedia objects rather than blending different types in a personalized manner. This limitation is due to the complexity of analyzing and correlating multiple types of multimedia objects with the user interests.
The Linked Open Data initiative opens new possibilities for the implementation of multimedia recommender systems. Linked Data is a way of exposing and sharing data as resources on the Web and interlinking them with semantically related resources. For multimedia content, metadata is a key factor for efficient management, organization, and retrieval. Metadata are not only used to describe low-level attributes of multimedia objects (such as length, resolution, or color depth), but also to describe high-level semantic features (such as genre classification or information about depicted persons).
The objective of this project is to explore the potential of Linked Open Data for the task of recommending diverse multimedia content.
Supervisors
Cristina Gonzalez-Caro (cristina.gonzalez-caro@aalto.fi),
Jorma Laaksonen (jorma.laaksonen@aalto.fi)
Student
Project #11
Analysis of citation networks
Description
We have a collection of citation data from Google scholar for almost 5 months and a citation network of scientists. The task in this project would be to study the pattern of growth of citations in connection with various network related factors (like centrality).
Background
We are looking for a motivated student with experience in Python.
Background in some sort of large scale network analysis would be a plus.
Supervisor
Kiran Garimella (kiran.garimella@aalto.fi)
Student
Siddharth Rao
Project #12
Timeseries-methods for forecasting the economy
Description
We have a dataset of key macroeconomic variables from the past 20 years, the data ranging from GDP of major economies to commodity prices and even weather data. Being able to forecast these variables ahead of time would be interesting and potentially even profitable. The task of the student is to analyze this data using relevant methods, including traditional time-series models (e.g. ARIMA) and common machine-learning methods (e.g. random forest). The student will become during the project familiar with various time-series models and machine learning methods and techniques.
Supervisor
Pyry Takala (pyry.takala@gmail.com)
Student
Pablo Alonso and Ronnie Pereira
Project #13
Automatic essay scoring
Description
Most essays are still today corrected by hand, and represent a significant effort to teachers and teaching assistants. Recently, Kaggle organized a competition on essay scoring.
The student will apply a set of standard machine learning / natural language processing methods to this task.
Time permitting, the student can also take advantage of recently developed advanced methods,
for instance analyzing the essays using word embeddings created by neural networks.
Supervisor
Pyry Takala (pyry.takala@gmail.com)
Student
Project #14
Text classification
Description
A very common, practical natural language processing task is to classify text into different categories.
This is necessary for instance when recommending a document based on its contents, when automating office workflow, when grading exams automatically, or, classically, when writing a spam filter.
Classification can be between two or multiple classes, classes can have hierarchies, and in some tasks a document can fall under multiple classes.
The student's task will be to apply state-of-the-art language processing / classification methods to the
20NG dataset
and report results on this task.
A tutorial that can help the student get started can be found on the scikit-learn website.
Time permitting, the student can test more methods or test methods also on a different dataset,
e.g., the hierarchical LSHTC-challenge.
Supervisor
Pyry Takala (pyry.takala@gmail.com)
Student
Project #15
Clustering of Android Applcations
Description
We are provided big database with more than 100,000 samples.
Files are presented as strings and the dimensionality is quite high: more than 1 million features.
Research could also involve work on the groupping of the features, dimensionality reduction and trying to apply deep learning techniques.
Benefits
- Getting to know the research group and potentially being hired as a summer intern.
- Collaboration with FSecure and attending meetings with them.
- Supervision by Luiza Sayfullina
and Emil Eirola.
- Working with a real dataset on a challenging problem.
We are giving the priority to second year students, as some basic knowledge in machine learning is required.
We might select two students for that task.
Supervisors
Luiza Sayfullina (sayfullina.luiza@gmail.com) and Emil Eirola (emil.eirola@arcada.fi)
Student(s)
Antti Savolainen
Project #16
Analysis of first name popularities in historical Finland
Description
Popularity of common Finnish first names (such as Maria and Johannes) goes in cycles of more than 100 years (source: Wikipedia), and different names are popular in different parts of Finland. Your task is to analyze and discover this kind of spatio-temporal patterns from a dataset consisting of about 5 million birth records between years 1600-1850.
Supervisor
Eric Malmi (eric.malmi@aalto.fi)
Student
Teemu Kärkkäinen
Project #17
Topic modeling for rap lyrics analysis
Description
What kind of topics do rappers talk about in their lyrics? Your task is to address this question computationally using topic modeling techniques such as Latent Dirichlet allocation. Additionally, you can study how the topics of the lyrics have evolved over time. For instance, does the amount of profane language decrease if a rapper becomes a parent?
Supervisor
Eric Malmi (eric.malmi@aalto.fi)
Student
Project #18
Speech synthesis for building an automated rap bot
Description
DeepBeat is a rap lyrics generator developed at Aalto and University of Helsinki,
which has recently gained a lot of
media attention.
Your task is to give voice to DeepBeat by applying speech synthesis.
Your project consists of two steps:
- Do a literary review on singing synthesis.
- Implement a "rap synthesizer" by applying any existing, free speech synthesis software such as eSpeak.
The main challenge in the second part is to be able to adjust the timing of the words so that the rhyming syllables are spoken out at a certain time of the bar. Additionally, you may want to adjust pitch for certain words in order to emphasize the rhyming parts.
Prerequisites
Good programming skills.
Knowledge of signal processing and an interest in rap are considered an advantage but are not required.
References
E. Malmi, P. Takala, H. Toivonen, T. Raiko, A. Gionis
"DopeLearning: A Computational Approach to Rap Lyrics Generation".
Supervisor
Eric Malmi (eric.malmi@aalto.fi)
Student
Project #19
Smartphone User Behavior
Description
We use our smartphone all the time and are scared by running out of battery while using Google Maps in a remote place. But can smartphone usage behavior be quantified? Does the battery level impact on usage? Do users perform "manual" power management? When do people charge their phones?
This topic involves analyzing traces from the
DeviceAnalyzer dataset.
Required expertise includes a machine learning background, with knowledge about mobile energy consumption as a plus.
References
Supervisors
Mario Di Francesco (mario.di.francesco@aalto.fi),
Jesse Read (jesse.read@aalto.fi)
Student
Aleksandra Neupokoeva
Project #20
Analyzing V(D)J recombination in human T-Cell Receptors
Description
T-Cell Receptor (TCR) diversity is a primary component of the Adaptive Immune Response.
This diversity is produced by a process of random genomic rearrangement known as V(D)J recombination.
This involves events such as single nucleotide mutations as well as insertions and deletions in a fixed region of the genome responsible for coding these receptors.
In our group, we have sequencing data from a few patients targeted on the hyper-variable CDR3 domain of the TCRs.
These data sets contain both the genomic and amino-acid sequences and also the number of cells containing each of these sequences. In total, there are even up to a million unique genomes in each data set.
The student has already familiarized himself with the data while working on his Master's thesis, and has analyzed the diversity of the CDR3 domain based on the counts of unique genome instances.
The next step in this analysis would be to understand how much these unique sequences actually differ from each other (due to recombination).
This problem can be approached by computing distance measures such as the Hamming, Levensthein (edit-distance) etc. between pairs of sequences.
Given the large number of unique genome sequences, one can not simply compute the distances between all pairs of sequences.
Therefore, some filtering or alternative techniques, such as locality sensitive hashing must be used.
For obtaining network-level understanding of the differences, we can use different clustering techniques and/or graph filtering methods such as the minimal spanning tree.
The precise methods to be used are not yet fully decided (they would be refined along the way), but the final goal of the project would be to produce network visualizations of the genome differences that could help out in understanding how the recombination of the T-cell receptors take place (for example, which sequences could function as precursors for others etc.).
Supervisor
Rainer Kujala, Doctoral Student Complex Systems Group
Student
Kunal Aggarwal
Project #21
User demographics prediction
Description
How well can we predict user demographics (age, gender, income, etc.)
based on which apps the user has installed and which device the user owns?
This information would be valuable, e.g., to app developers who wish to better understand their user base.
The dataset for this project is provided by Verto Analytics which is a
quickly growing media research company based in Otaniemi and New York City.
This project gives you a chance to work on a real-world problem and a unique dataset.
Supervisor
Eric Malmi (eric.malmi@aalto.fi), Aalto University / Verto
Analytics
Student
Janaki Koirala
Project #22
Making deep networks more robust to sensor malfunction
Description
Imagine an automation system that has multiple sensors with enough
redundancy to operate in principle, even if any one of the sensors is
not working. Feedforward neural networks are in trouble when part of the
input is missing. A possible solution would be to have a separate
network for each sensor, only combining the signals just before the
softmax nonlinearity at the output (assuming a classification task). One
could first train only the bias at the softmax layer and then each
network separately in combination with the fixed bias. This would
closely correspond to the naive Bayes classifier that treats inputs
independently. One could continue exploring possibilities, including
training the network with artificial patterns of malfunctions
(corresponding to dropout regularization), and comparing combining the
signals from different sensors at early, middle, or late stages of the
neural network. Experimentation could be done using the standard MNIST
classification benchmark test, while the different sensors would be
simulated by treating four quadrants of the input image as four sensors.
The idea is novel (to my knowledge) and could lead into a publishable
research paper.
Supervisor
Tapani Raiko (tapani.raiko@aalto.fi)
Student
Project #23
Automatic caption generation for images using recurrent neural networks
Description
Supervisor
Jorma Laaksonen (jorma.laaksonen@aalto.fi)
Student
Shetty Rakshith
Project #24
Visualizing manifold data using limited steps geodesic distance
Description
Supervisor
Teemu Roos
Student
Zhao Yang
Project #25
Fraud detection on Holvi data
Description
Supervisor
Hongyu Su
Student
Sérgio Isidoro