/gionis/topics2015.shtml

T-61.5910 - Research Project in Computer and Information Science P

Fall 2015

List of suggested projects

Project #1

Big data analysis for crime and fraud detection (Nordea)

Description

This project will give students the opportunity to apply data mining and machine learning techniques for the purpose of detecting crime and fraud in large, multi-typed, financial data.

The students will be given access to Nordea data in a controlled environment and will be supervised by Nordea data-analysis experts.

Supervisors

From the Nordea side: Sauli Pahlman (Sauli.Pahlman@nordea.com), Dmitri Guirenko (dmitri.guirenko@nordea.com), and from the Aalto side: Aristides Gionis (aristides.gionis@aalto.fi)

Student

Albert Arockiasamy

Project #2

Analysis of recruitment data

Description

tyopaikat.oikotie.fi is a job-market portal, where employers post job advertisements, and candidates create profiles and apply to job openings. The service belongs to Sanoma Corporation.

In this project, students will be given the opportunity to analyze data from the tyopaikat.oikotie.fi system, in order to discover interesting patterns in the data, and develop ideas that could potentially improve the services of the portal. The students will be supervised by Sanoma data-analysis experts.

Possible projects include:

Identifying jobs/industries that are "trending", promising in near future.
Identifying discrepancies between jobs and applicants.
Extracting typical career paths.
Improving advanced search -- what keywords should we include in job listing to attract candidates.
Finding what features make a job listing more attractive.
Correlating job listing/searches data with unemployment data and other economic indicators.
Performing discrimination analysis.

Note that the data is mostly in Finnish, so understanding Finnish is a plus.

Supervisors

From the Sanoma side: Mika Ruokonen (mika.ruokonen@sanoma.fi), and from the Aalto side: Michael Mathioudakis (michael.mathioudakis@hiit.fi), Indre Zliobaite (indre.zliobaite@aalto.fi).

Student(s)

Project #3

Molecular-level modelling and visualisation of custom-designed 3D DNA nanostructures

Description

DNA nanotechnology is a multidisciplinary, emerging field of technology that aims to create nanoscale structures by a process of self-organisation, using DNA as the construction material. A recent development within this field is the synthesis of 3D DNA nanostructures from very general mesh-based designs, by a process that could be compared to "3D printing at the nanoscale".

This technique was pioneered by our research group at the Aalto Computer Science department, in collaboration with prof. Björn Högberg's biochemistry laboratory at the Karolinska Institutet in Stockholm. The design tool is currently implemented as a module of the "vHelix" DNA nanostructure design package developed at the Högberg Lab.

While we now have a complete design pipeline that leads from a high-level graphical mesh model to the eventual DNA sequences that self-organise into the desired structure, we are still lacking a good understanding of a number of molecular-level defects that appear in electron microscopy images of the nanoscale structures. To study these, and consequently improve the methodology, it would be highly useful to export the static nucleotide-level designs from the vHelix software to a dynamic, molecular-level modelling and simulation environment such as the oxDNA package, that provides also more advanced simulation and visualisation capabilities.

Tasks

Study the so called "DNA origami" technique and the algorithmic methods involved in translating a 3D graphical mesh model into the DNA strands that hybridise into the desired structure.
Learn about the vHelix and oxDNA modelling and simulation packages.
Implement a software module that exports a vHelix nanostructure model into the oxDNA package.
Experiment, using oxDNA, numerically and visually with some of the DNA nanostructure designs presented in the article mentioned in Task (1) above.

Supervisors

Pekka Orponen (Pekka.Orponen@aalto.fi), Abdulmelik Mohammed (Abdulmelik.Mohammed@aalto.fi)

Student

Project #4

Gaussian processes

Description

Gaussian process is flexible non-parametric model useful in probabilistic machine learning. GPstuff Matlab/Octave toolbox is a versatile collection of Gaussian process models and computational tools required for inference (in top 5% most downloaded machine learning open source software at mloss.org). There are many new ideas waiting to be implemented in the software.

Prerequisites

Bayesian statistics, Programming skills (MATLAB)

Supervisor

Aki Vehtari (aki.vehtari@aalto.fi)

Student

Project #5

Large scale positive unlabeled learning for biological compounds

Description

Our knowledge of compounds of biological interests is very limited. Out of 50,000,000 compounds in PubChem (Molecule database), we only know around 300,000 are of biological interests; the remaining compounds may or may not of biological interest and the goal of this project is to predict, among of the remaining ones, which are of biological interest based on their molecular fingerprint (features). The problem perfectly suits the Positive-Unlabeled (PU) learning setting where the the label information for most of the data examples are missing and we only know small portion of data with positive labels (no negative labels). To deal with ~50,000,000 training instances, large scale machine learning techniques or platform such as Spark could be used.

Supervisor

Huibin Shen (huibin.shen@aalto.fi)

Student

Paul Tardy

Project #6

Bayesian networks for big data

Description

Bayesian networks are a widely-used class of probabilistic graphical models. They are used to represent joint distributions of several random variables. Typically, they are learned from data. Learning Bayesian networks is NP-hard and thus exact algorithms do not scale up to hundreds of variables. Therefore, in a big data setting one has to resort to heuristic algorithms which do not give any quality guarantees. But how well do these algorithms perform in practice?

The task in this project is to conduct a small simulation study on learning Bayesian networks with big data. This includes generating data, learning networks with selected algorithms and assessing performance.

Supervisor

Pekka Parviainen (pekka.parviainen@aalto.fi)

Student

Emre Celikten

Project #7

Detection of emotional affect in images

Description

Every image conveys an affective message. An image can cause you to feel happy or unhappy depending on its content. Detecting this affective message and the emotional state of the person exposed to the image have been an interesting research question. One approach to this problem is using the people's eye movements in order to detect the influence of image and affective state. In this project, we are planning to study various classification approaches for this purpose.

We are exploring both scenarios, in the first part, we analyze the contribution of classifiers for the task of detection of an image emotional content and try to identify the influential features. Then, using the results of the first part of the project, we try to build a model that predicts the emotion of an observer watching the images used in the first part of the study.

The steps towards successful completion of this project are the following:

Literature review, to address the following questions:
What are the existing methods for the task?
How does the eye movement contribute?
What features can be used and motivating the choice of features.
Analysis of the features and their correlation in the data.
Creating a baseline system.
Studying various classifiers for the task.
Analysis of the performance of classifier.
Building a model based on the findings of the first part of the project.
Implementing an interface for user interaction with the photos and recording their gaze.
Testing the model in real-time.
Writing a report on the findings.

The project requires interest in HCI, knowledge of machine learning, programming skills, and patience. You may need to gather some data for the second part.

Supervisors

Hamed Tavakoli (hamed.r-tavakoli@aalto.fi), Jorma Laaksonen (jorma.laaksonen@aalto.fi)

Note

Preference will be given to students from the EIT Digital Master School's Human Computer Interaction and Design (HCID) major for this project.

Student

Filippo Forti

Project #8

Analysis of jigsaw puzzle solving using gaze

Description

Analysis of human gaze while doing various tasks has been an interesting topic for psychologists and human-computer interaction scientist over the past decades. There has been a huge interest in task decoding using machine learning approaches in recent years. In this project, we are aiming at analysis of the gaze behavior during solving a jigsaw puzzle. What is the characteristic of the users' gaze while picking up a piece and solving the problem? Can we find what causes a user to make mistake by analyzing his/her gaze? To answer this questions, we first need to implement a user friendly gaze-based jigsaw game. Then, we can proceed with the analysis of the data that we gather. In this project, we mostly focus on the first part of the project since the successful analysis highly depends on it.

The steps towards completion of this project are the following:

Literature review.
Proposing an intuitive user interface design for the puzzle.
Implementing the interface and making sure it is working with no error.
Gathering some data of the users who are solving the game.
Analyzing the gaze data of the user.
Reporting the findings.

The project requires interest in HCI, statistical analysis, machine learning , programming skills, and patience.

Note: there could be two type of interfaces, one that relies on a multi-touch display and using eye-tracking glasses for solving the problem, and a desktop interface that uses desktop eye trackers. The first one is more challenging, though you have the possibility of utilizing touch information as well.

Supervisors

Hamed Tavakoli (hamed.r-tavakoli@aalto.fi), Jorma Laaksonen (jorma.laaksonen@aalto.fi)

Note

Preference will be given to students from the EIT Digital Master School's Human Computer Interaction and Design (HCID) major for this project.

Student

Project #9

Video annotation using Gaze

Description

Video annotation is a time consuming work. It requires a lot of time and effort. Traditionally, one shall manually select the object of interest in several key-frames and track it within those key-frames. In this project, in order to reduce the amount of interaction and speed-up the process, we are annotating videos using gaze as an input where the user looks for a specific object while watching a video. The project involves challenges in the domain of object detection and tracking.

The steps towards completion of this project are the following:

Literature review.
Making an object detector that detects objects of interest.
Building a tracker to keep track of the detected object.
Implementing a gaze interface to record user's gaze.
Combining the gaze information and object detector information to perform semi-automatic annotation and improve the annotation results.

The project requires interest in HCI, knowledge of machine learning, programming skills, and patience. Some knowledge of computer vision is highly recommended.

Supervisors

Hamed Tavakoli (hamed.r-tavakoli@aalto.fi), Jorma Laaksonen (jorma.laaksonen@aalto.fi)

Note

Preference will be given to students from the EIT Digital Master School's Human Computer Interaction and Design (HCID) major for this project.

Student

Mohamed Soliman

Project #10

Multimedia recommender system using Linked Open Data

Description

Over the recent years, the Web has evolved from a collection of mostly text-based content to a large multimedia database. Today we are able to access many different types of multimedia, such as, images, video, audio, live TV streams, as well as animated and interactive content. On the other hand, current systems that recommend multimedia content have the limitation that they recommend only a specific type of multimedia objects rather than blending different types in a personalized manner. This limitation is due to the complexity of analyzing and correlating multiple types of multimedia objects with the user interests.

The Linked Open Data initiative opens new possibilities for the implementation of multimedia recommender systems. Linked Data is a way of exposing and sharing data as resources on the Web and interlinking them with semantically related resources. For multimedia content, metadata is a key factor for efficient management, organization, and retrieval. Metadata are not only used to describe low-level attributes of multimedia objects (such as length, resolution, or color depth), but also to describe high-level semantic features (such as genre classification or information about depicted persons).

The objective of this project is to explore the potential of Linked Open Data for the task of recommending diverse multimedia content.

Supervisors

Cristina Gonzalez-Caro (cristina.gonzalez-caro@aalto.fi), Jorma Laaksonen (jorma.laaksonen@aalto.fi)

Student

Project #11

Analysis of citation networks

Description

We have a collection of citation data from Google scholar for almost 5 months and a citation network of scientists. The task in this project would be to study the pattern of growth of citations in connection with various network related factors (like centrality).

Background

We are looking for a motivated student with experience in Python. Background in some sort of large scale network analysis would be a plus.

Supervisor

Kiran Garimella (kiran.garimella@aalto.fi)

Student

Siddharth Rao

Project #12

Timeseries-methods for forecasting the economy

Description

We have a dataset of key macroeconomic variables from the past 20 years, the data ranging from GDP of major economies to commodity prices and even weather data. Being able to forecast these variables ahead of time would be interesting and potentially even profitable. The task of the student is to analyze this data using relevant methods, including traditional time-series models (e.g. ARIMA) and common machine-learning methods (e.g. random forest). The student will become during the project familiar with various time-series models and machine learning methods and techniques.

Supervisor

Pyry Takala (pyry.takala@gmail.com)

Student

Pablo Alonso and Ronnie Pereira

Project #13

Automatic essay scoring

Description

Most essays are still today corrected by hand, and represent a significant effort to teachers and teaching assistants. Recently, Kaggle organized a competition on essay scoring.

The student will apply a set of standard machine learning / natural language processing methods to this task. Time permitting, the student can also take advantage of recently developed advanced methods, for instance analyzing the essays using word embeddings created by neural networks.

Supervisor

Pyry Takala (pyry.takala@gmail.com)

Student

Project #14

Text classification

Description

A very common, practical natural language processing task is to classify text into different categories. This is necessary for instance when recommending a document based on its contents, when automating office workflow, when grading exams automatically, or, classically, when writing a spam filter. Classification can be between two or multiple classes, classes can have hierarchies, and in some tasks a document can fall under multiple classes. The student's task will be to apply state-of-the-art language processing / classification methods to the 20NG dataset and report results on this task. A tutorial that can help the student get started can be found on the scikit-learn website. Time permitting, the student can test more methods or test methods also on a different dataset, e.g., the hierarchical LSHTC-challenge.

Supervisor

Pyry Takala (pyry.takala@gmail.com)

Student

Project #15

Clustering of Android Applcations

Description

We are provided big database with more than 100,000 samples. Files are presented as strings and the dimensionality is quite high: more than 1 million features. Research could also involve work on the groupping of the features, dimensionality reduction and trying to apply deep learning techniques.

Benefits

Getting to know the research group and potentially being hired as a summer intern.
Collaboration with FSecure and attending meetings with them.
Supervision by Luiza Sayfullina and Emil Eirola.
Working with a real dataset on a challenging problem.

We are giving the priority to second year students, as some basic knowledge in machine learning is required. We might select two students for that task.

Supervisors

Luiza Sayfullina (sayfullina.luiza@gmail.com) and Emil Eirola (emil.eirola@arcada.fi)

Student(s)

Antti Savolainen

Project #16

Analysis of first name popularities in historical Finland

Description

Popularity of common Finnish first names (such as Maria and Johannes) goes in cycles of more than 100 years (source: Wikipedia), and different names are popular in different parts of Finland. Your task is to analyze and discover this kind of spatio-temporal patterns from a dataset consisting of about 5 million birth records between years 1600-1850.

Supervisor

Eric Malmi (eric.malmi@aalto.fi)

Student

Teemu Kärkkäinen

Project #17

Topic modeling for rap lyrics analysis

Description

What kind of topics do rappers talk about in their lyrics? Your task is to address this question computationally using topic modeling techniques such as Latent Dirichlet allocation. Additionally, you can study how the topics of the lyrics have evolved over time. For instance, does the amount of profane language decrease if a rapper becomes a parent?

Supervisor

Eric Malmi (eric.malmi@aalto.fi)

Student

Project #18

Speech synthesis for building an automated rap bot

Description

DeepBeat is a rap lyrics generator developed at Aalto and University of Helsinki, which has recently gained a lot of media attention. Your task is to give voice to DeepBeat by applying speech synthesis.

Your project consists of two steps:

Do a literary review on singing synthesis.
Implement a "rap synthesizer" by applying any existing, free speech synthesis software such as eSpeak.

The main challenge in the second part is to be able to adjust the timing of the words so that the rhyming syllables are spoken out at a certain time of the bar. Additionally, you may want to adjust pitch for certain words in order to emphasize the rhyming parts.

Prerequisites

Good programming skills. Knowledge of signal processing and an interest in rap are considered an advantage but are not required.

References

E. Malmi, P. Takala, H. Toivonen, T. Raiko, A. Gionis "DopeLearning: A Computational Approach to Rap Lyrics Generation".

Supervisor

Eric Malmi (eric.malmi@aalto.fi)

Student

Project #19

Smartphone User Behavior

Description

We use our smartphone all the time and are scared by running out of battery while using Google Maps in a remote place. But can smartphone usage behavior be quantified? Does the battery level impact on usage? Do users perform "manual" power management? When do people charge their phones? This topic involves analyzing traces from the DeviceAnalyzer dataset. Required expertise includes a machine learning background, with knowledge about mobile energy consumption as a plus.

References

Daniel Wagner, Andrew Rice and Alastair Beresford, "Device Analyzer: Understanding smartphone usage", 10th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (MOBIQUITOUS) 2013
Denzil Ferreira, Anind K. Dey, and Vassilis Kostakos, "Understanding human-smartphone concerns: a study of battery life", 9th International Conference on Pervasive Computing (Pervasive) 2011

Supervisors

Mario Di Francesco (mario.di.francesco@aalto.fi), Jesse Read (jesse.read@aalto.fi)

Student

Aleksandra Neupokoeva

Project #20

Analyzing V(D)J recombination in human T-Cell Receptors

Description

T-Cell Receptor (TCR) diversity is a primary component of the Adaptive Immune Response. This diversity is produced by a process of random genomic rearrangement known as V(D)J recombination. This involves events such as single nucleotide mutations as well as insertions and deletions in a fixed region of the genome responsible for coding these receptors.

In our group, we have sequencing data from a few patients targeted on the hyper-variable CDR3 domain of the TCRs. These data sets contain both the genomic and amino-acid sequences and also the number of cells containing each of these sequences. In total, there are even up to a million unique genomes in each data set.

The student has already familiarized himself with the data while working on his Master's thesis, and has analyzed the diversity of the CDR3 domain based on the counts of unique genome instances. The next step in this analysis would be to understand how much these unique sequences actually differ from each other (due to recombination). This problem can be approached by computing distance measures such as the Hamming, Levensthein (edit-distance) etc. between pairs of sequences. Given the large number of unique genome sequences, one can not simply compute the distances between all pairs of sequences. Therefore, some filtering or alternative techniques, such as locality sensitive hashing must be used. For obtaining network-level understanding of the differences, we can use different clustering techniques and/or graph filtering methods such as the minimal spanning tree.

The precise methods to be used are not yet fully decided (they would be refined along the way), but the final goal of the project would be to produce network visualizations of the genome differences that could help out in understanding how the recombination of the T-cell receptors take place (for example, which sequences could function as precursors for others etc.).

Supervisor

Rainer Kujala, Doctoral Student Complex Systems Group

Student

Kunal Aggarwal

Project #21

User demographics prediction

Description

How well can we predict user demographics (age, gender, income, etc.) based on which apps the user has installed and which device the user owns? This information would be valuable, e.g., to app developers who wish to better understand their user base.

The dataset for this project is provided by Verto Analytics which is a quickly growing media research company based in Otaniemi and New York City. This project gives you a chance to work on a real-world problem and a unique dataset.

Supervisor

Eric Malmi (eric.malmi@aalto.fi), Aalto University / Verto Analytics

Student

Janaki Koirala

Project #22

Making deep networks more robust to sensor malfunction

Description

Imagine an automation system that has multiple sensors with enough redundancy to operate in principle, even if any one of the sensors is not working. Feedforward neural networks are in trouble when part of the input is missing. A possible solution would be to have a separate network for each sensor, only combining the signals just before the softmax nonlinearity at the output (assuming a classification task). One could first train only the bias at the softmax layer and then each network separately in combination with the fixed bias. This would closely correspond to the naive Bayes classifier that treats inputs independently. One could continue exploring possibilities, including training the network with artificial patterns of malfunctions (corresponding to dropout regularization), and comparing combining the signals from different sensors at early, middle, or late stages of the neural network. Experimentation could be done using the standard MNIST classification benchmark test, while the different sensors would be simulated by treating four quadrants of the input image as four sensors. The idea is novel (to my knowledge) and could lead into a publishable research paper.

Supervisor

Tapani Raiko (tapani.raiko@aalto.fi)

Student

Project #23

Automatic caption generation for images using recurrent neural networks

Description

Supervisor

Jorma Laaksonen (jorma.laaksonen@aalto.fi)

Student

Shetty Rakshith

Project #24

Visualizing manifold data using limited steps geodesic distance

Description

Supervisor

Teemu Roos

Student

Zhao Yang

Project #25

Fraud detection on Holvi data

Description

Supervisor

Hongyu Su

Student

Sérgio Isidoro