Dana Zhang's Research Blog

Friday, April 23, 2010

Currently in Oak Ridge

Just before Easter, I was notified my submission to CSIIRW '10, 6th Annual Cyber Security and Information Intelligence Research Workshop was accepted. After funding was approved with less than 2 weeks before the start of the workshop, organising travel from Australia to the United States was a bit hectic. But I'm pleased to say, everything worked out and after over 20 hours of flying, I am here in Tennessee.

Location: The workshop itself is held at Oak Ridge National Laboratory, a national research centre with an interesting history. Initially established in 1943, ORNL was part of the secret Manhattan project to pioneer a method for producing and separating plutonium. Apparently I'm sitting near a nuclear reactor right now? The laboratory is in Oak Ridge, where the whole town seems to have been built in support of the research laboratories in the area.

Currently, the lab facilitates six major areas of research: neutron science, energy, high-performance computing, systems biology, materials science at the nanoscale and national security. The workshop that I will be presenting at falls under national security. But while attendees were at ORNL, they had the opportunity to take a tour around the facilities and have a look at both the Jaguar and the Kraken, the first and third fastest supercomputers in the world. We were also shown the type of simulations the computers ran to support the research performed by other parts of the laboratory. Very amazing indeed. Feel free to read up more about the research lab on their official website: http://www.ornl.gov/

Content: As the title of the workshop suggests, the focus was on Cyber Security and Information Security. The plenary speakers spoke on a range of issues including national security, system security and web security. Keynote bios can be found here: http://www.ioc.ornl.gov/csiirw/keynotebios.html. As some of these areas hasn't been the primary focus of my research in the past several years, it raised very many interesting issues that I had not considered. What is the strategy that should be taken to make security less beneficial to the "bad guys" and in more favour of the "good guys"? An aikido approach to redirect threats can be taken; use the force of the attacker to beat them at their own game. We should be making detection systems online and capable of analysing larger volumes of data. Design for failure and have a recovery plan! The keynote speakers really made this conference for me.

The paper sessions looked at design, malware, network, privacy and metrics, enterprise, survivability, formal methods and trust. Most times I had difficulty deciding which room to go to. I usually ended up in the network/malware stream, listening to malware classification, and any sort of categorisations that used data mining tools.

There were also some interesting posters out in the lobby area, available to be read at your leisure during the entire event.

My Work: The paper I had accepted and presented this morning was titled: Graph Based Strategies to Role Engineering. It's the foundations of my current research in graph based role engineering for definition of a set of roles that accurately reflect the internal functionalities of an enterprise for RBAC. To identify the roles, we first map users, permissions and roles to nodes and user-to-permission, user-to-role, role-to-role and role-to-permission assignments to edges in a directed acyclic graph (DAG). There are three graphs:

UPGraph

URPGraph

UHRPGraph

There are three different cost models:
Role minimisation: cost(G)= c₁|V_R|
Edge minimisation: cost(G)= c₂|E|
Role and Edge minimisation: cost(G)= c₁|V_R| + c₂|E|
where c_x are the static costs of role and assignment administration, |V_R| is the number of role nodes in the graph and |E| is the number of edges in the graph.

Using both the graph model and the cost metrics, we propose a heuristic strategy for optimisation. Please check the paper for more details on the heuristic and some preliminary results.

Wednesday, March 03, 2010

RMIT and The University of Melbourne's Google Sponsored Girl Geek Coffee Club

My how time flies. It's Semester 1 2010 already. While we had a little bit of a break in second semester 2009, RMIT and UoM's Google Sponsored Girl Geek Coffee Club is back and ready to chill, chit chat and consume coffee (coffee provided for free by Google of course!).

We currently booked it in on the afternoon of the 24th March in the 4th week back at uni at Bar Commercio. The ICT building is on 111 Barry Street and new and slick looking commerce building is right behind it. Why not make use of the new facilities?

If you're a female and currently studying a technical related degree at RMIT or UoM, come along and have some coffee and meet some other girls in your field. With the Anita Borg Scholarship Applications currently open, it's also a great opportunity to speak to past winners about their experiences.

You can find the full details of the club here: http://sites.google.com/site/melbournecoffeeclub/

But the important stuff you need to know are as follows:
Date : 24th March 2010 (Wednesday)
Time : 3 - 4 pm
Venue : Bar Commercio, Gnd Floor 198 Berkeley Street, Carlton 3053, The University of Melbourne
Cost : Free of charge

Please RSVP through http://bit.ly/aQd6X9.

See you on the 24th!

P.S. Girl Geek Coffees are also held at Monash University. Register to be a member to hear about their next event.

The 2010 Google Australia and New Zealand Anita Borg Memorial Scholarship

The Anita Borg Scholarship for Women in computer is now open!

Check out the website here: http://www.google.com.au/anitaborg/

It's a great opportunity to discuss the issues faced by women in technology and meet some like minded individuals. If you're female and enrolled to study a degree in a technical field, please consider applying. Applications are due May 1st 2010, but don't leave it until the last minute!

Monday, May 04, 2009

Training and Learning Bonus for Postgraduate Scholarship Holders

As part Australian Government’s Household Stimulus Package, postgraduate scholarship holders are now also eligible for a one off payment of $950. More information can be found through centerlink:
http://www.centrelink.gov.au/internet/internet.nsf/individuals/hsp_postgrad.htm

Since scholarships are tax free, many students missed out on the regular $900 package as no tax was paid during the 2007-2008 fiancial year. Hoorah PhD friends, you can now get $950 ($50 more than regular folk)! Only catch that's made some fellow PhD colleagues ineligible yet again is the $20,427 lower bound on other/non APA/non APAI/non NHMRC scholarships. The Melbourne University Scholarships Office is emailing/has emailed letters of proof to scholarship recipients.

As for me, I received my $900 a couple of weeks ago; I worked for 4 months at Google during the 2007-2008 financial year, bumping me into the range where I was eligible for the regular stimulus incentive. I don't think I can receive two packages. Although, it doesn't explicitly exclude me as I am quite sure I did not receive a Training and Learning Bonus or a Back to School Bonus payment. I tried calling centrelink but the English lines are constantly busy and it felt like I was put on hold indefinitely on the multilingual/Chinese language lines.

Tuesday, April 07, 2009

Leveraging Lattices to Improve Role Mining

@inproceedings{colantonio08leveraging,
author = {Alessandro Colantonio and Roberto Di Pietro and Alberto Ocello},
title = {Leveraging Lattices to Improve Role Mining},
booktitle = {Proceedings of The Ifip Tc 11 23rd International Information Security Conference (SEC'08)},
year = {2008},
isbn = { 978-0-387-09698-8},
pages = {333--347},
location = {Milano, Italy},
publisher = {Springer},
address = {Boston},
}

There has been recent works that use lattices for role mining [1][2]; this paper analyses role mining lattice properties. Findings are used to remove data redundancies and compress lattice representation. This can speed up the search for a role set. Optimisations are tested using Apriori and rationalised using RBAM(Role Based Association Rule Mining). Less roles are found faster.

One of the most basic ways to generate frequent patterns is by using lattices. Applying this to role engineering, the lattice represents all possible roles from given user permission assignments. However, lattices can become very large. This paper maps frequent pattern concepts to RBAC and creates RBAC representation for the data mining concepts. Lattice properties are described using the new RBAC representation and when roles can be deleted is discussed. For example, lattices can produce multiple related roles of the same frequency. Only the role of maximal size needs to be kept.

Compression techniques are then applied to Apriori for RBAC and tested on real data from an undisclosed domain. In the dataset, there are 954 users and 1108 permissions. Using Apriori with minimum support of 10%, 299 roles that were assigned 16 permissions were identified. 890 users were assigned these 16 permissions. Using RB-Apriori, 109 roles were found faster. However, the quality of RB-Apriori roles is not discussed, but 299 roles with 16 permissions? Doesn't sound ideal.

Sunday, February 01, 2009

Back Again

I have recently returned from a stint at Amazon.com in Seattle, where I was offered a software development position on the security team. During my time there, I was introduced to the Amazon.com infrastructure and their core values. There is a lot of reliance on who you know and what is known by the the people you know. This is more reliable than the available documentation. Most things focused on the bottom line. Errors measure it and motivation is driven by it. It's very different to the my other industry experience.

In terms of project, I was placed on what can be called a practical data mining project. It wasn't quite what I had expected. I performed hands on data warehouse queries that resulted in large volumes of data. Within the data, I was to find correlations. It was an interesting experience. I worked mostly by myself on this isolated project.

Tuesday, August 19, 2008

Recognition of Authorship at Melbourne University

What constitutes a substantial contribution sufficient to warrant recognition as an author/co-author?

Minimum requirement for authorship should be in accord with the 'Vancouver Protocol'. Authorship is substantial participation, where the following conditions are met:

conception and design, or analysis and interpretation of data; and
drafting the article or revising it critically for important intellectual content; and
final approval of the version to be published.

Participation solely in the acquisition of funding or the collection of data does not justify authorship.

From Student FAQ in relation to research at The University of Melbourne.

Tuesday, June 24, 2008

A Cost-Driven Approach to Role Engineering

@inproceedings{1364198,
author = {Alessandro Colantonio and Roberto Di Pietro and Alberto Ocello},
title = {A cost-driven approach to role engineering},
booktitle = {SAC '08: Proceedings of the 2008 ACM symposium on Applied computing},
year = {2008},
month = {March},
isbn = {978-1-59593-753-7},
pages = {2129--2136},
location = {Fortaleza, Ceara, Brazil},
publisher = {ACM},
address = {New York, NY, USA}
}

This paper proposes association mining with cost analysis for role engineering (RBAM - Role Based Associate Rule Mining). A cost function that reduces the number of roles and role relationships as well as an attribute cost of the role is used. The attribute cost represents available business semantics that are available. In absence of high level information, role and role relationship cost is used. Association mining is performed on roles to identify inheritance relationships.

The following metrics are presented
support of a role - percent of users assigned permissions in the role
actual support of a role - percentage of users assigned the role
grade of a role - number of permissions assigned to a role
confidence of two hierarchically related roles - ratio of number of users assigned to superrole to the number of users assigned to subrole

Cost components are analysed and the cost of deleting a role is evaluated in accordance with their cost model.

Their approach is as follows:
Using a priori, generate a lattice of all possible combinations of assigned permissions as roles above a frequency threshold, removing roles with low support. Remove roles that no users are directly assigned to. Remove roles if doing so does not modify the access control matrix and the cost improves.

Monday, June 23, 2008

Migrating to Optimal RBAC with Minimal Perturbation

Jaideep Vaidya, Vijayalakshmi Atluri and Qi Guo. Migrating to Optimal RBAC with Minimal Perturbation. In SACMAT’08: Proceedings of the thirteenth ACM symposium on Access control models and technologies, Estes Park, Colorado, June 2008.

A new variation to the role mining problem (RMP) is introduced: minimal perturbation RMP. When roles within an enterprise already exist, minimal perturbation RMP aims to identify an optimal set of roles that is also similar to the current configuration. A role migration cost based on role similarity is incorporated with their existing subset enumeration algorithm.

A method for measuring role similarity using the Jaccard coefficient is proposed. In general terms, Jaccard coefficient is calculated as the ratio of the intersect of two sets to the union of those to sets to produce a value between 0 and 1 that measures similarity. 0 for no similarity (different) and 1 for high similarity (same). In terms of roles, role is a set of permissions. The similarity of a role now becomes the ratio of the intersect of two permission sets to the union of two permission sets to produce a value between 0 and 1.

Given a proposed role and the collection of existing roles, the similarity metric of the role is the best Jaccard coefficient between the proposed role and a role that exists within the collection. That is, the Jaccard coefficient between the proposed role and every role in the collection is calculated and the best coefficient is used as the similarity.

To measure the similarity of between the collection of proposed roles and the collection of existing roles, the similarity metric of each role in the proposed role collection is calculated and averaged.

The metric is coined with FastMiner, an iterative role mining process that looks at user permission assignment intersections. A greedy heuristic that chooses roles with the most coverage (user + permission assignment) and highest similarity until all permission assignments is covered is used.

Trade-offs between minimizing only roles and searching for only the most similar roles is also analysed.

There are lots of examples in the paper.

Tuesday, May 20, 2008

On Formalizing and Normalizing Role-Based Access Control Systems

@article{power08formalizing
author = {David Power, Mark Slaymaker, Andrew Simpson},
title = {On Formalizing and Normalizing Role-Based Access Control Systems},
journal = {The Computer Journal},
year = {2008},
publisher = {Oxford University Press},
url = {http://comjnl.oxfordjournals.org/cgi/content/abstract/bxn016},
}

This paper presents a formal model using Z of core RBAC components as well as hierarchical RBAC and with exclusive role constraints. Different types of inheritance is discussed as well has how equivalence between RBAC systems can be defined using the model and how normalisation to produce simpler yet semantically equivalent RBAC systems can be performed.

The motivation for creating a new model formalization is based on limitations ANSI standard for RBAC as well as Li et al.'s model for RBAC[1]. Most of the work stems from Li et al.'s model, which is refactored using Z schema language and then manipulated.

The first manipulation that is explored is normalisation, the process of redefining the structure of the system to a simpler form without removing any meaning. The process for core RBAC is as follows:

Reduce infrastructure to a flat user-permission relation

Assign each user a unique role

Merge roles with identical permission sets into one role

Role hierarchies are described (including role dominance, derived relationship, immediate predecessor and limited role hierarchy) and added to the schema. The normalisation for hierarchical RBAC is as follows:

Reduce infrastructure to a flat user-permission relation

Assign each user a unique role

Merge roles with identical permission sets into one role

Place two roles in an ordering iff one is an immediate predecessor of the other

Remove redundant permissions

Remove roles that have no permissions

Alternative normalisations without step 2 of the process and modifications to hierarchy construction are also proposed.

Exclusive role constraints are described and included into the normalisation. Exclusive role constraints are similar to separation of duty constraints except they are motivated by least privilege rather than enforcing separation of duties. Static mutual exclusive roles (SMER) define the number of roles from a set a subject can be assigned. Dynamic mutual exclusive roles (DMER) define the number of roles from a set a subject can activate in a session. Equivalences of constraints are compared as sets through roles in the absence of users. Alternative equivalences using permission constraints instead of role constraints that are more compatible with the proposed models are also discussed.

Finally, different types of inheritance are specified more clearly and discussed.

[1] Li, N., Byun, J.W. and Bertino, E. (2005) A Critique of the ANSI Standard on Role-Based Access Control. Technical Report CERIAS TR 2005-29, Department of Computer
Science, Purdue University.

Tuesday, May 06, 2008

Mining Roles with Semantic Meaning

@inproceedings{molloy08semantic,
author = {Ian Molloy and Hong Chen and Tiancheng Li and Qihua Wang and Ninghui Li and Elisa Bertino and Seraphin Calo and Jorge Lobo},
title = {Mining Roles with Semantic Meanings},
booktitle = {SACMAT'08: Proceedings of the thirteenth ACM symposium on Access control models and technologies},
year = {2008},
month = {June},
address = {Estes Park, Colorado},
}

In this paper:

What semantic analysis can be performed based on data availability/dimension.
Using "Formal Concept Analysis", a hierarchical miner is developed.
Performs role mining with user attributes as well as user permission information.

Data Dimension: user permission, user attribute (user's job title), permission parameter (database permission, read access), permission update (logs of how permissions have changed over time), permission usage (what users are using what permissions and when). What additional information can be offered with each extra dimension is discussed.

Formal Concept Analysis: applied to role mining, a formal context is triple (G, M, I) where G = set of users, M = set of permissions, I = relationship between users and permissions. A concept of the context (G, M, I) is a pair (X, Y) where Y is the set of all properties shared by all objects in X and X is a set of all objectes that share all properties in Y. X = extent and Y = intent. (X, Y) can be subconcept of (X', Y') iff X⊆X' or Y⊆Y'.

For example, {{u1, u2, u3}, {p1, p2, p3, p4, p5}} can be a concept, allowing concepts to represent roles. Each user is assigned exactly one role and each permission is assigned exactly one role. To reduce large concept latices, weighted structural complexity is introduced. Weighted structural complexities gives different costs/weights to different components of RBAC (wr * number roles, wu * number of user assignments, vp * number of permission assignments, and so on) . Optimal RBAC state has minimal weighted structural complexity.

Hierarchical Miner: greedy algorithm that iterates all possible roles and prunes roles if doing so reduces the cost of the RBAC state. Algorithm terminates when no more oprations can be performed.

Attribute mining creates "attribute roles" using a candidate role set and user attributes to help describe roles. Each role/permission set can turned into multiple attribute roles (a role can be give multiple attribute descriptions, each attribute description is an attribute role). Because of this, a role to user assignments are based on a edge to complexity ratio metric.

Monday, May 05, 2008

Fast Exact and Heuristic Methods for Role Minimization Problems

@inproceedings{ene08fast,
author = {Alina Ene and William Horne and Mikola Milosavljevic and Prasad Rao and Robert Schreiber and Robert E. Tarjan},
title = {Fast Exact and Heuristic Methods for Role Minimization Problems},
booktitle = {SACMAT'08: Proceedings of the thirteenth ACM symposium on Access control models and technologies},
year = {2008},
month = {June},
address = {Estes Park, Colorado},
}

This paper addresses the RMP (Role Mining Problem) as a graphing problem. To minimise the number of roles while ensuring no user to permissions assignments are modified, user permission assignments are graphed and reduced. Role set solutions are then derived from the reduced graph. To address the issue of both minimising number of roles and number of edges, my graph optimisation approach was implemented, tested on their data sets and analysed. (Horrah).

The basic access control matrix is presented as an undirected bipartite graph G. Vertices are either in U (users) or P (permissions). It was noted that for a role, users and permissions assigned to the role induce a biclique in G. Since roles cover all permissions and users, the set of all roles R is a biclique cover of G. The role minimisation problem maps to finding the minimal biclique cover (MBC) of G, which is also NP-hard and known to be hard to approximate.

MBC is then reduced to minimum clique partition and chromatic number. Graph G' is created where edges in G are vertices in G'. An edge in G' is present iff endpoints of correspondence edges of G include a biclique in G. A clique in G' corresponds to a biclique in G. The clique cover number of G' corresponds to the biclique cover number of G. The biclique cover number of G is the chromatic number of G'.

Based on these proofs/reduction/mapping of problems, lower bound biclique cover algorithms are presented that can find best clique covers when each biclique is a star (equivalent to finding minimal number of roles). Rank adjacency is also used to estimate biclique cover number. When proposed iterative method for exact solutions for finding bicliques are too time consuming, greedy algorithms for finding bicliques are also proposed. Reducing the number of roles based on edge count is also used. Interesting results using the applied techniques on their data tests is then discussed. Assignment of users to groups before assignment to roles can also be done to reduce the size of the graph.

Algorithms were tested on 7 data sets, none of which seem to have RBAC implemented. Exact details of the datasets were not disclosed: number of users, number of permissions (probably for privacy reasons). Results of algorithm on the data are tabulated. One role that is found in their largest dataset is assigned to 2804 users and has 20 permissions. Four-fifths of the roles consist of a single user and all of their permissions. Heuristic approaches are quite fast and a combination of their approaches comes close to exact calculations. Bounds for heuristics are measured based on their test results.

They note that their results still have some what under defined meaning for roles semantically. What do the roles mean, why is the infrastructure good? Also in some of their data sets, the number of roles is the same as the number of users or permissions.

Thursday, May 01, 2008

Approximation Algorithm vs Heuristic

Approximation algorithm: identifies approximate solutions to problems (mostly often NP-complete and NP-hard problems) to a certain bound. Or maybe provide optimality in only certain situations.

Heuristic: A computational method that uses trial and error methods to approximate a solution for computationally difficult problems. It does not aim to find the optimal solution, sacrificing optimality for improved runtime.

Optimal Boolean Matrix Decomposition: Application to Role Engineering

@inproceedings{lu08optimal,
author = {Haibing Lu and Jaideep Vaidya and Vijayalakshmi Atluri},
title = {Optimal Boolean Matrix Decomposition: Application to Role Engineering},
booktitle = {IEEE 24th International Conference on Data Engineering},
year = {2008},
month = {April},
address = {Cancun, Mexico},
}

This paper models boolean matrix decomposition through binary integer programming. The problem is mostly described in the context of role engineering and RMP, the Role Mining Problem. The same problems are described with the addition of edge-RMP. Edge-RMP sounds like the goal of my graph optimisation, a graphing approach that minimises number of roles as well as role assignments/edges. This paper re-describes RMP, min-noise RMP, δ-approximate RMP now with edge-RMP. The contribution of the paper is the transformation of the problems into a set of equalities and inequalities models for binary integer programming. Different constraints on the models are applied to the problem to represent different variations of RMP.

Once again, the access control matrix A is described as the composition of B ⊗ C. That is, B and C is the decomposition of A. The are translated into binary integer programming models and heuristics for solving the problems are described.

Basic RMP approach: use FastMiner to identify candidate roles and create role to permission matrix based on results. Prune permission sets of users to that each permission set is unique and create access control matrix. Use constraints (to ensure no addition/remove of user/permissions) to identify user to permission assignment matrix.

Similar/slightly modified approaches are discussed for the edge-RMP and δ-approximate RMP and min-noise RMP. The approaches rely on FastMiner for producing good candidate roles.

The main contribution of the paper is theoretical representation of problem and how greedy heuristics can be used in modelled binary integer problems when candidate roles are provided. Experimental results show good run times and accuracy for their generated test sets.

Tuesday, April 15, 2008

Another Internship?

My interviews over the past 6 months have landed me 2 internships. I have recently returned from my first where I had a fantastic time. During my internship, another offer was been made to me. This time, the temptation comes from Amazon.com in Seattle. I had a few interview screens before the offer came. The conversations were mostly technical but at the same time, I gained some insight into the functionalities and products Amazon offered. They don't only sell books. They are so much more! I was glad with the opportunity to work for such a well known and established company. I had worked with their Web services platform in the past but after my talks with various members of their different and diverse software development teams, I have decided I wish to work on their security and identity management team. This also ties in nicely with my current area of research. My last internship was in data mining and this one will be in security; perfect.

After this second offer, I withdrew my applications from the other companies that I had been in negotiations with. Two internships for a 3-4 year postgraduate research degree is quite sufficient, if not excessive according to Australian standards. I will have to put my research on hold again for another 3 months. I have set my start data at Amazon to the end of the year so I may have some time now to do some solid research. When I come back, I will tidy everything up and write everything up to create the final thesis.

Monday, April 14, 2008

Role Mining - Revealing Business Roles for Security Administration using Data Mining Technology

@inproceedings{kuhlmann03rolemining,
author = {Martin Kuhlmann, Dalia Shohat, Gerhard Schimpf},
title = {Role Mining - Revealing Business Roles for Security Administration using Data Mining Technology},
booktitle = {SACMAT '03: Proceedings of the eighth ACM symposium on Access control models and technologies},
year = {2003},
address = {Como, Italy},
publisher = {ACM Press},
}

This is the earliest work that discuss the application of Data Mining techniques to assist Role Engineering. The paper goes through more as a case study of possible RBAC deployment within different organisations using SAM(Security Administration Manager software framework). Details of how data mining is not explored in great detail. The contribution of this paper focuses more on the feasibility of data mining application for role engineering. There is no discussion on what techniques for data mining would be better, analysis of their mining results or what the data mining actually does/means. Their "data miner" is a black box machine that produces statistical and semantic information that is used to assist role definition.

One example for case study was a bank organisation that has 45 000+ employees, distributed across 14 00 branches with 40 types of systems supporting 65 000 user ids and 47 000 user groups. I found it interesting that there are about 20 000 more user ids than there are employees. New users are assigned roles based on user attributes.

Other case studies show their technique was capable of finding existing SAM models from basic data. The models took 2 months to manually define, 2 hours to use data mining to identify. (What about accuracy? Were other incorrect models identified?) An evaluation cost was performed, stating potential cost savings of 60% during role creation and 50% during role maintenance given cost of manual analysis and some growth assumptions.

In their system, roles are separated into two categories: organisational roles and functional roles. Organisational roles define basic access privileges and functional roles describe access rights in relation to additional functions or tasks. Roles also contain attributes or rules that are true for all users to are assigned to the role.

Process:
The data mining techniques that are mentioned are association rules and clustering from the IBM Intelligent Miner for Data. Iterative role finding using only a fixed set of users and assumes each user only has one account on each system. Uses user to system information, user attributes for system, existence of groups or roles in system, resource authorisations and global user information. Assumes all used is correct (pre-processing to remove incorrect data).

Clustering is performed on user attributes to receive organisational roles. Association is performed to create group connections for organisational roles and functional roles.

Reports from clustering and association rules are used to create roles.

Issues: how does the selection of the subset of users to perform data mining occur? How does the data mining happen? What kind of clustering is used? What are you finding associations in? It is not clear the data mining is performed as the IBM data miner was used as a black box and results were used as is.

It's nice to see results of deployment in real businesses.

Monday, April 07, 2008

Back to PhD

I have returned from my internship in Sydney and am now back to research. I had a fantastic time. I personally think I had the best project to work on out of all the other interns. It tied in perfectly with my data mining interest. There was a slight administrative hurdle that we never got over (just around) but I completed the project and was even able to produce a demo for my final presentation. This would not have happened without the help and support of my mentor and development team. It was a fantastic learning experience and I'm glad I had the opportunity to be involved on such a fun project with such a cool team. I was even offered supportive feedback after the project, which was helpful and sincere.

In terms of infrastructure, the code base was well documented and maintained, with high visibility within the company. It was easy to access tools and find assistance. The focus felt like it was on creating functional and innovative products, making the big G an enjoyable environment to work in.

However, even after this experience, I am still uncertain about the direction I wish to follow after my PhD. Both academia and industry are equally appealing. Luckily I still have a couple of years to decide. In the mean time, it is back to my PhD and my research topic. I'll take a while to get back into things but I'm sure I'll get there in the end. I'm currently looking at the code that I was working on 6 months ago. I'm glad I wrote comments but I am still more confused than anything else.

The toughest thing about being away from the research is coming back to realise the new and recent developments have made some of my work invalid. That's a shame. I guess the next step is to try and catch up with where the research area has progressed to and carry on from there. Time to do a lot of reading!

Monday, December 10, 2007

Internship

After my interview trials, I have become successful in at least one position: between December 2007 and March 2008, I will intern as a Software Engineer in Google, Sydney. During my time here, I will work on a project related to Google Maps. But still Data Mining related, which is where my interests lie.

The internship means I will be taking a break from my PhD. Instead of reading papers, I will be writing code. Instead of writing papers, I will be writing code. This will be a change indeed. The coding that I have done in the past have all been very standalone applications. Programming in the large has never been something I have been familiar with. I look forward to the opportunity to work on software that is part of a larger application.

Hopefully this experience will help me decide on what I would like to do after my PhD; would I like to remain in research or do I want to work in industry?

Internships are a big thing in the US. From what I hear, most PhD students work in companies over the summer. I'm not so sure about Australia though. There are some vacation work positions but are more limited in number than in the US.

I think internships are a great way to see what life can be like after university so students can make a more informed decision about their future when they finish their degree. Seeing the practical side of what is taught at university can ground understanding of fundamentals and enhance skill development in the future.

I look forward to the opportunity of working for one of the biggest and best companies in the world. Hopefully I won't eat too much while I am here. It is difficult to say no to free food, especially for a PhD student.

Sunday, November 18, 2007

Interview Questions

Over the past 12 months, I have done several face-to-face as well as phone interviews for industry based internships. As someone who loves interviews and interesting problems, most of the interviews were quite enjoyable. Admittedly I feel more relaxed during non-technical interviews but technical ones are still fun because they present an opportunity to be exposed to possibly challenging problems that you may not have encountered before. Or they may even be the same problem but in an entirely different context.

Anyway, in no way eluding to which companies interviewed me, the following were a few of the fun mind challenges I was faced with:

Battleships:

If you were implementing a battle ship game, what data structures would you use to store information about the grid, information about the ships, information on the scoring?
Initially think about a grid that's 20x20 with 12 ships of different sizes. How would implementation change if the grid is 1000x1000 or 100000x100000 with 12 ships of different sizes? Now consider the larger grids with 100 ships or more of different sizes?
How do you determine when a player has won the game? How should you keep track of the scoring? Make sure the other player can't cheat! For example, if a player hits a particular position that contained a part of a ship, continuously hitting (spamming) that position causes him to win. This could happen if you had a counter for the number of hits and didn't keep track of distinct battleship hits.

Reverse:

Given a string, reverse it.
Given a number, reverse it.
Given a byte, reverse it.
Can you do it in place?
Can you make it more efficient?
How can you test your functionality?

Latex:

My interviewer noticed I used latex to generate my resume. He asked how I felt about latex and what I found annoying. I mentioned that I noticed " and " did not turn out if I just typed " and ". To have quotations at the correct angles, `` and '' needed to be written. This problems comes from this. Given a tex file, go through and replace all the '' and '' with `` and '', assuming the text was well formed. That is, there are an even number of " in the text.
Can you do it in place and in O(n) where n is the number of characters in the text?

Division:

Do some division without the division operator.
Why is recursion suboptimal?
What would be more efficient?

Thursday, November 15, 2007

Melbourne University Postgraduate Scholarships (local students)

Most people who do postgraduate research degrees at the University of Melbourne are on scholarships. More information can be found here. There's links there to apply online for the coming year. Deadlines are usually 31st October the previous year. The time line for local students can be found here and the international one is not too difficult to locate. There's different yet similar ones for international students. This entry will mostly talk about local students.

Depending on how fantastic you are and general area of research and personal situations, there are many types of scholarships to apply for. Generally, the most common one is the APA - Australian Postgraduate Award. It's about $19 000 per annum. No guarantees but if you get over 85 average, you're in with a good chance. Over 90, better chance. Less than those, not so good. In missing out on those, there are MRS, the Melbourne Research Scholarships. These are slightly less than the APA but are also about $19 000 per annum. People around the 85 mark who missed out on the APA might get a MRS.

You apply for the MRS and APA at the same time online. They prefer you to apply for enrolment as well. Although I have heard of students who have received scholarship but hadn't applied for the degree yet. Obviously they can't have the money until they enrol. I think they prefer you/force you to do it so it's easier on the paperwork.

MRS and APA are university wide. In computer science, another main source of funding that I had experience with is NICTA. In my year and the years before it, NICTA were quite generous with their scholarships (plentiful). The amount is also about $19 000 (maybe slightly less) if you missed out on a APA or MRS and about a $8000 top up if you got an APA or an MRS. These work in conjunction with MRS/APA. Applications for NICTA scholarships are separate from what I remember. If you get an APA/MRS, they are happy to give you a top up because they get some IP rights (you sell your research soul as some of my friends say). If you don't get an APA/MRS, they will give you just the normal $19 000 for your research IP. But starting last year, new NICTA recipients must also take some coursework. Not a lot, just maybe 1 a semester for each semester of your degree. Also NICTA is generous (monetary) with travel stipends. They cover a lot.

They are the common ones. The other ones are more dependant on your supervisor/project area. There are departmental scholarships and also APAI ones if you are interested in a project that a supervisor happens to have the funding for. These are usually made by private arrangement between the student and the supervisor and project that both parties are interested in that has been allocated funding. It usually has more to do with the supervisor. There's industry based scholarships as well. There are many different ways to arrange these. With my industry scholarship, the company contacted my supervisor because he was an expert in the area they wanted research in. And I was a student who was particularly interested in that area so it was a fit. There's usually more money involved with APAI and industry scholarships, about $25 000 to $30 000 per annum. Industry ones usually don't sit well with NICTA. Because both want IP and don't like to share. So for me, I chose industry and let my NICTA go. Depending on your industry, they may also have some travel allowance as well.

Top ups: as I mentioned before with NICTA, they give you top ups if you have MRS/APA. CSIRO can also offer the same deal but once again, it's often very project/supervisor dependent. You usually have to have a collaborator at CSIRO who is interested. CSIRO also doesn't work well with NICTA or industry (IP again).

All research students are entitled to department and graduate school funding. I've written more information about these types of funding here and here.

All PhD scholarships are tax free. And tuition is not really an issue. There are certain number of government places for research students and everyone I've ever know to do a research degree has been given a placement. So the learning is for free and you have enough to live on.

Tuesday, October 09, 2007

APAC07 - APAC Conference and Exhibition

Advanced Computing, Grid Applications and and eResearch
8-12 October 2007 - Rendezvous Observation City Hotel, Perth, Western Australia

Opening Ceremony

John O'Callaghan, Chair talked about the origins of APAC and introduced the addresses. John Zillman, APAC board, welcomed everyone to Perth, bantered about the weather and spoke of the past, present and future of Apac. Francis Logan discussed the future of supercomputing. Finally Mal Bryce talked about the history of Internet. After proclaiming that he was the "oldest rat in the barn", reminisced about past technologies that he personally lived through. Mal describes iVec and Apac as some sophisticated computer concepts, big pipes, big grunt, big warehouses and a great deal of people development.

The first keynote was Thom Dunning, who also presented on the previous day at the Student Forum. Similar topics (in truncated form) were covered. The stress was on future strategic directions of super computing/data intensive computer on the petascale. Examples used were mainly Blue Waters and synoptic telescope.

The second keynote was Mike Netzband. He spoke about high performance computing at Chevron, the challenges of HPC, innovations and frontier projects. From a HPC point of view, Chevron most costly processes are seismic imaging and processing and reservoir modelling. Processing and storage started with IBM mainframes and punch cards in 1982, to super computers and now Linux desktops and other mixed environments with multiple tiers of vendors for equipment with dual, quad core clustering. Challenges that never seem to go away are expandability and improving the system is a constant process. Innovations need to be adaptive to constant change, effective with clear purpose. Current targets include collaboration with third party partnerships and more investigation in grid computer and alternative high performance processing.

Overall

The most obvious and distinctive aspect of this conference was the large number of industry sponsors. While I knew there would be some key industry representatives present, I was not prepared for the large number of displays, tables and marketing stands at the venue. Lunches was sponsored by Intel, IBM had a strong presence, SGI were giving away USBs, caps as well as ice cream while ISA seemed to be everywhere. While the first three I had heard about (big names), the third I wasn't so sure. These were the gold sponsors and after lunch, they was a whole session dedicated to presentations from the Gold sponsors. Of course there were other sponsors, the most memorable for me were Cray and Sun Microsystems. After some discussions with other attendees, it was suggested that APAC conference attendees come from the companies that account for majority of the big buyers of large processing power. So it makes sense that many of the suppliers of large processing power would want to be here promoting their hardware. It is mostly hardware on display/advertised. We decided the organisers have done very well. Marketers get what they what (advertising to their target demographic) and conference attendees get what they want (sharing of work/research, more information product availability that may help their work/research and free ice cream, golf tees and usb keys).

Other than the marketing from sponsors, the conference ran as per norm: parallel sessions, pre and post conference workshops, presentations of work, promotion of collaboration... It was very interesting to see what other people were doing in industry with the techniques developed in research. Even before the conference started, there were iVec and AARC open houses where we were shown demonstrations of some existing projects and hpc applications. It's a very good conference for large scale applications, processing of large amounts of data and distributed management of information and resources in industry.

Monday, October 08, 2007

APAC07 Student Forum

8 October 2007 - Rendezvous Observation City Hotel, Perth, Western Australia

Keynote: Thom Dunning - NCSA Director

Thom's background was originally in Chemistry and is now the director for NCSA (National center for Super Computing Applications). For his keynote, he talked about many of the projects that are currently being worked on at NCSA. Many of the applications had chemistry or biology backgrounds but the need of management, transportation and manipulation of the data has created the need of high powered processing and computing in these environments. Many interesting applications were discussed:

modelling of infectious diseases and the initial challenge of identifying viruses
modelling earthquakes (MAEviz), predicting when they will occur and efficient recovery
telescope imagery using LSST (large synoptic survey telescope) - 3 Gigapixel camera, transportation and data processing of images for identification/detection of interesting events (e.g. supernovas, meteors coming our way)
automated genome comparison and matching. Manual matching can take up to 8 months for each match.
tornado simulation

Sessions: student presentations

After the keynote, 13 students from around the country presented their research. Presentations range from eScience research topics in biology and chemistry to load balancing and large scale P2P optimisation. I presented my work for role engineering for role based access control.

Information about the forum can be found here. I am uncertain how persistent this link will be. The main purpose was to bring students from around Australia working in related areas together to network.

Monday, September 17, 2007

The Role Mining Problem: Finding a Minimal Descriptive Set of Roles

@inproceedings{vaidya07rmp,
author = {Jaideep Vaidya and Vijayalakshmi Alturi and Qi Guo},
title = {The Role Mining Problem: Finding a Minimal Descriptive Set of Roles},
booktitle = {SACMAT '07: Proceedings of the twelth ACM symposium on Access control models and technologies},
year = {2007},
address = {Sophia Antipolis, France},
publisher = {ACM Press},
}

This paper formally describes the aim of role engineering with data mining through RMP, the role mining problem as well as two variations of the problem: δ-approx RMP and the minimal noise RMP. All three problems are shown to be NP-complete.

The RMP is defined as the the problem of finding the optimal set of roles from existing user permissions. The paper defines the optimal/good set of roles as the set of minimal roles that represent the initial access control matrix. Like me, they define the discovery of roles as a decomposition of the access control matrix. (A = B ⊗ C)

To measure matrices, difference metrics for matrix normals are employed. That is, a count of differences between binary matrix values. In RMP, the composition of the user role matrix and the role permission matrix should be exactly the same as the the access control matrix. That is, the decomposition of the access control matrix should not give extra or remove existing permissions. In δ-approx RMP, access control matrix resulting from the mining is allowed to differ from the original access control matrix by δ. The mining is bound by δ. In minimal noise RMP, the δ is minimized while ensuring the number or roles does not exceed a certain threshold. The mining is bound by the number of roles.

The RMP and it's variations are then mapped to the set basis problem and shown to be NP-complete. The RMP problem and its variants are then mapped onto existing problems (minimum tiling in databases and discrete basis) and analysed. A solution for the minimum tiling problem is a greedy approximation algorithm that finds largest tiles first. This is synonymous to finding roles with the largest size first. Semantically, this is probably not quite right for RBAC. This approximation is bound to O(logmn) of optimal where m is number of users and n is number of permissions. The minimal noise RMP is mapped to the discrete basis problem and basis usage problem. The discrete basis problem has been shown to have no approximation in constant factor polynomial time. The discrete basis problem is synonymous to the optimal assignment of minimal noise role to users in the minimal noise RMP context.

Thursday, August 09, 2007

Role Engineering using Graph Optimisation

@inproceedings{zhang07graph,
author = {Dana Zhang and Ramamohanarao Kotagiri and Tim Ebringer},
title = {Role Engineering using Graph Optimisation},
booktitle = {SACMAT '07: Proceedings of the twelth ACM symposium on Access control models and technologies},
year = {2007},
address = {Sophia Antipolis, France},
publisher = {ACM Press},
}

Role engineering is the definition of roles for Role Based access control. Initial approaches used elicitation of job functionalities and business requirements for role creation. Due to the costly and time consuming process of the manual analysis, more recent approaches have moved to automated extraction. While most automated approaches have data mining techniques, this paper explores the optimal decomposition of the access control matrix through graphing techniques.

All user permission assignments can be represented as an access control matrix. Role based access control can be described as the decomposition of the access control matrix to a user-role matrix and a role-permission matrix. That is A = B ⊗ C. Where A is the access control matrix, B is the user to role assignment matrix and C is the role to permission assignment matrix. Many decompositions exist. The challenge comes from producing the optimal user-role and role-permission matrix. Optimality is dependant on given metrics.

In this paper, the problem is described as a matrix decomposition problem and the solution produced by specifying metric that reduce the user-role and role-permission relationships (synonymous to a reduction of administration requirements on user permission management) and reduce the number of roles (synonymous to a reduction in administration requirements of roles).

The problem can easily represented as a graph and the optimisation process is a series of graphing operations with the aim of reducing the number of nodes and edges in the graph (or number of roles and role relationships respectively).

The algorithm was tested on user permission assignments within a public domain to produce Role Based Access Control infrastructures that offer improved access control administration for the system. The test set used was of medium to small size and problems of local minimum have not yet been addressed.

Friday, June 22, 2007

SACMAT Discussion Panels

22 June 2007 - Sophia Country Club, Sophia Antipolis, France

Panel Discussion - I: Access Control for Assured Information Sharing

Solutions for access control for assured information systems were discussed as a panel session. In general, it was agreed that a variety of solutions need to be provided for sharing needs. Considerations include:

Trust
Ownership - how this issue is dealt with
Responsibility - to share the knowledge and protect everyone

This aspect of research has become especially important after 9/11. We potentially had all the information within different departments to prevent/respond/reduce the severity of the attacks. But the information was restricted and information was not shared between systems.

The solution also needs to be adaptive - generalised event based management.

Some questions and issues raised during the panel:

Why can't you have 1 solution for different scenarios, does it have to be a case by case basis?
Enforcing sharing obligations infringes decision making
Inherently, people do not trust computers
DAC - restrict access as much as possible.
Selective data sharing - share when you can get credit

Panel Discussion - II: Directions for Access Control and Policy Management

The areas that I paid the most attention to was role engineering. There is a lot of interest in this area, particularly in industry. The main issue for role engineering is definition of a structure is correct and that is good. Measure of correct is simple, measure of good is more difficult. On the day, it was agreed that generally, you should have less roles than users. Otherwise the infrastructure is useless, it would be more optimal to assign permissions to users directly. However, it was also discussed the presence of abstract roles. That is, roles that are not assigned to any users. Are they still useful? They may assigned the design of the infrastructure in hierarchical RBAC. In retrospect, if abstract roles exist, it may be acceptable for the number of roles to be larger than the number of users.

SACMAT Keynote

20 June 2007 - Sophia Country Club, Sophia Antipolis, France

Keynote: Jorge Cuellar - Siemens Corporate Technology, Munich, Germany
Thoughts on Application Layer Access Control

In this keynote, Jorge presented some formal methods representation for the application layer of access control. He motivated the need for formal representation through various research initiatives currently in progress at Siemens, Germany:

eHealth Care - ensure confidentiality of patient records
Planes - distrusted software
Citizen's portal - card to authenticate, management of access to different software.

He also discussed the rational for the need for security using UML system diagrams.

eTransactions were also discussed. Current security for e-Transactions are not enough for eCommerce needs. SSL are currently not made for eTransactions. What kind of access control do we need? What ever is chosen, ease of implementation is an important consideration. The hidden logic needs to consider confidentiality, integrity and atomicity.

Monday, May 28, 2007

Closet: an efficient algorithm for mining frequent closed itemsets

@inproceedings{pei00closet,
author = {Jian Pei and Jiawei Han and Runying Mao},
title = {Closet: an efficient algorithm for mining frequent closed itemsets},
booktitle = {ACM SIGMOD Workshop on Research Issues in
Data Mining and Knowledge Discovery (DMKD 2000) },
year = {2000},
}

Mining frequent closed itemsets has the same power as mining the complete set of frequent itemsets. This reduces redundant rules to be generated and increases both efficiency and effectiveness of mining. If frequent pattern mining was used to generate roles, this effectively means generation of redundant roles are reduced.

The idea behind this approach is to use conditional databases in a divide and concur format. Given a list of all frequent items above min_sup, find conditional database of item in reverse order of support. Closed itemsets can be extracted iteratively from these conditional databases. Each conditional database may be further divided into more conditional databases.

The proposed approach extends the FP-growth tree approach for pattern discovery. Implementation requires FP tree and pruning for increased efficiency. Comparison with A-Close that uses Apriori to identify closed itemsets is basically the comparison between FP tree and Apriori.

The paper is not the easiest to read. It is filled with more theory and proofs than needed for a comprehensive understanding of the algorithm.

Thursday, May 17, 2007

Java Swing

This week I am working with Java GUIs. This will tie in with prototypes and applications that other people have built. While I my code does not need to fit nicely, or interact from a GUI point of view at all, it might be nice to take some time and look at the Swing aspect of Java.

I do teach Java, maybe learning something extra of Java won't hurt. That's the best thing about a PhD, I think: the ability to take the week and go through Java tutorials. Or any tutorials.

Thursday, May 10, 2007

Role Mining with Orca

@inproceedings{schlegelmilch05orca,
author = {Jurgen Schlegelmich, Ulrike Steffens},
title = {Role Mining with ORCA},
booktitle = {SACMAT '05: Proceedings of the tenth ACM symposium on Access control models and technologies},
year = {2005},
address = {Stockholm, Sweden},
publisher = {ACM Press},
}

This paper proposes ORCA, a java visualisation tool that performs hierarchical clustering on permission assignments for definition of role concepts/role engineering. User interaction can add information during hierarchy construction to assist clustering.

Mentioned caveats in role mining:

Noise in data: cleansing/anamolies must be removed
Multi-role systems: systems support multiple roles
Multi-role users: users can be assigned multiple roles
Multiple identities per user
No semantics in data mined roles

These issues are mentioned but not solved (with the exception of 3) by ORCA.

Method ideology: place permissions in a cluster if a significant number of users have them. Each cluster has a set of permissions and a set of users assigned to the cluster.

Technique:

Each permission starts as a permission set cluster in C.
Identify pairs of clusters from C with maximum user assignment intersect and maximum union.
Create a role from the new permission set union. If more than one exist, randomly pick one.
Remove original pair of clusters from C so previous clusters can no longer be used/selected to create new roles.
New role created as a super clusters of previous clusters and add to C.

Problems in approach:

each permission can only belong to one path of the role hierarchy, modifications to reduce this constraint can produce inconsistencies and large number of additional yet not required roles.
time consuming, must generate all cluster pair intersects at each iteration
choosing one pair can remove the possibility of other pairings, making ordering important
when more than one pair has merging potential (more than one maximal pair for new cluster creation) one is chosen at random

Novelness:

At that time, it was the first role mining approach that didn't used a generic set of tools developed general pupose data mining. It was role engineering specific. The data mining used applied heirarchical clustering for permissions to create roles. While no new data mining contribution was made, it was a new application of data mining to enterprise security.
The visualisation tool sounds pretty, describing what is being done at each step, what is in each cluster and different colour intensities are used for clusters with more user assignments. It sounds pretty but also confusing to use in situations with many entities. Some learning is required to understand what the data mining does and what it means in context of the different possible clusters. Some options for finding different correlations is present: highlighting clusters who have users who fit a certain criteria. Less information is given on exactly how this information can be used to create roles.
Visualizations also allows for the concept of neighbouring clusters, where clusters of permissions are similar are placed next to each other for viewing in ORCA. Similarity between clusters is measured by the propotion of permissions that are the same within two clusters.

Thursday, May 03, 2007

Poster Day

Every year, all postgrad researchers and early research staff are recommended to present a poster for Poster Day. Poster afternoon was on the 27th April and I returned to research on the 23rd April. The final copy was due while I was away so I submitted a copy of what I wanted printed and laminated before I left. It was a nice session for all the pgrad research students and some members of the research staff so they could display their work and have discussions. I got some interesting comments, especially about how access control can be ensured in P2P gaming environments. I will have a play of a P2P gaming technology to see what I think. I haven't had much experience with it.

Conference Notes

There were some interesting comments raised by other researchers at ICDE in relation to my work. Data Mining community is generally interested in what model of data mining I can produce. Interesting perspective from a data mining view as opposed to a security applications one. After learning about my problem, they wanted to know the novelty of my approach. That is, how does it benefit the data mining community. I think their general view is data mining benefits everyone, people should be working towards creating interesting novel approaches for new types of data mining. Application of mining techniques to some problem will be easy after the idea is developed. I guess my next approach is to show how my approach is new and novel and modify it based on my access control data to suit data of this type. Special data mining for RBAC...

ICDE 2007 Keynotes

IEEE 23rd International Conference on Data Engineering
17-19 April 2007 - The Marmara Hotel, Istanbul, Turkey

Keynote: Yannis Ioannidis

On the first day, Yannis Ioannidis gave a presentation on Emerging Open Agoras of Data and Information. He spoke about competitive markets, sharing of information and the motivation behind allowing the dissemination of information between parties through bargaining or other types of negotiation. I think the use of the term "Agora" was particularly appropriate based on our location (Istanbul). At the end of this presentation, he was asked to compare an Angora model with that of a Mall (US shopping center). Fun times.

Keynote: Ricardo Baeza-Yates

On the second day, Ricardo Baeza-Yates presented Challenges in Distributed Web Retrieval and discussed the issues of increasing Web data and using distributed machines to cope with scalability.

Keynote: Laura M. Haas

On the final day, Laura M. Haas spoke about Information for People, a presentation targeted at ICDE researchers to consider the HCI component of research development. I think this was the presentation I remember most vividly. She gave visual presentations and examples of some algorithm researchers who used GUI components to their research for representation of results. This visually helped anyone who was unfamiliar with the data to make clearer judgements. For example, instead of just looking at the numbers, std, and means, plot them and allow for dynamic modification of graph to anaylse trends in numerical data better. This of course was done in colour. Pretty. And she was right, it did help the comprehension of what was going on a lot.

However, I'm not sure how well it was received on my side of the room. There was some general consensus that while it's a nice idea to make our algorithms "pretty", the reason why HCI and Data Management were two distinctive areas was because the are completely two distinctive areas. I don't think there's another way to put it. So I think the best example I have is the OSI model. You're trying to merge the physical layer with the application layer. It's not quite that extreme but it's close. At ICDE, there were sessions and industry reports on more efficient caching techniques. How do we make that more GUI/user friendly? I think at this point in time in research, there is still a strong distinction between these things and there needs to be, otherwise the scope you start dealing with is just enormous. The point is abstraction and to work with manageable portions. It's a nice idea, but I'm not sure how practical it would be.

Sunday, April 15, 2007

DMBI 2007 Keynote

Workshop on Data Mining and Business Intelligence
15 April 2007 - The Marmara Hotel, Istanbul, Turkey

Keynote: Jiawei Han - University of Illinois at Urbana-Champaign
Research Frontiers in Advanced Data Mining and Business Intelligence

The purpose of this key note address was to give an general overview of current research areas in Data Mining. Technical detail was not covered. Instead, a highlight of the challenges and direction of existing data mining related topics that Jiawei had personal exeriences with was discussed.

Pattern mining was the first topic discussed. Frequent pattern mining was the first approach with apriori, then FPGROWTH, Eclat and so on. Closed mining developed from this due to the large number of patterns that can be generated by frequent pattern approaches. FPClose, Charm and max pattern mining were mentioned. Also mentioned is correlation mining (PageRank from Google) and compression of patterns and more compact representation of large result sets.

Information network analysis was also discussed. Graph mining and mining data with links or cross-relational mining falls into this category. One recent approach that was developed by one of Jiawei students was based on the disabmiguation of different people with the same name. In DBLP, there are 14 distinct people all with the name Wei Wang. All of them are in computer science and a large propotion is in data mining. It is difficult to determing which Wei Wang is being referred to, just by looking at publication type. The solution proposed is to analyse co-author data. Based on this, fairly successeful classifications were made. 13 distinct users were identified (one was misclassified as someone else) and there were 2-3 paper misclassifications for the other 13 people.

Another area of data mining is stream data mining such as internet network traffic. The aim of stream data mining is prompt classification such as in network intrusion detection systems. One recent work is the prompt update of stream data cubes based on statistical analysis.

Mining moving object data is also interesting. While the focus is not yet on prompt classification, moving ships, cars and different objects can identify anomalies or predict future direction. For example, analysis of ship movements can potentially identify anomalies that could be terrorist vehicles or illegal shipping operations. Analysis of trajectory of hurricanes can predict movement and study of seasonal animal movement can be done before construction of major highways for the least amount of disturbance to wildlife.

Spatial, temporal and multimedia data mining need to consider obstacles before correct classification can be performed.

Text and Web mining considers links between pages and key words. It is important to extract the correct information. There is lots of information available. One way to extract important information is to identify key structure components. Graphics algorithms can be used to identify sections of useful information as indicators for the actual information.

Data mining system and software engineering deals with the clustering and classification of software bugs, where they are detected and where the location of the bug actually resides.

Finally, data cube oriented and multi dimensional online analytic process was discussed and separated into four areas: regression cubes, prediction cubes, integration cube and ranking query processing and high dimensional OLAP.

Thursday, April 05, 2007

Preparation for Turkey

This week, I prepared slides and gave a department/postgrad seminar for the work that I plan to present in SEIW (Declarative monitoring of declarative Web services). This presentation, while being good preparation, was also for fulfilment of departmental funding requirements.

After the presentation, I received interesting questions relating mostly to the performance of having a monitoring tool. The extra data collection will increase response rates of service requests, as with returned data quantity. But with WSOL, WSOI, the returned information is piggy packed into SOAP headers so additional data returned is minimal and additional time requirements can be minimized if measurement modules are placed within the client or server (as opposed to a third party). The department presentation attracted related researchers from both Swinburne and RMIT. While I'm not working in this area as much anymore, it was nice to bring people with similar interests together.

Researchers from Monash and RMIT who were interested in this area of research attended the presentation.

Paper reference:
Kerry Taylor, Paul Brebner, Michael Kearney, Dana Zhang, Kelly Lam and Vladimir Tosic. Towards Declarative Monitoring of Declarative Service Compositions. In Proceedings of The 2nd International Workshop on Services Engineering (SEIW07) in conjunction with 23rd International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey, April 2007.

Thursday, March 29, 2007

First PhD Paper

Last week, I received notification that my paper to SACMAT07 has been accepted as a short paper. My industry sponsor should be able to fund me for this travel as it is directly related to my PhD research. Some money has been set aside for me every year for travel purposes by my industry sponsor. SACMAT have emailed with paper requirements so I will try to fix the paper asap when they do. The latex template is as expected and they have given advice and information on terms, categories and crdata. I think it is due on the 15th but I will try to get it out of the way this weekend to distribute to relevant parties and allow time for fixes before I leave. I have shortened and addressed most of the issues but there is still a page of writing to delete before it meets the page limit requirements. Deleting words that you have put thought in to writing is the most difficult.

Friday, March 23, 2007

Official PhD Student

This week I received notification of PhD confirmation.

The application for conversion from Masters to PhD stated 6-8 processing time. I needed to be confirmed before date of travel which was 11th April. Since I was not ready to convert before my leave in February, I had to do it immediately after my return for the 6-8 week allowance. However, I found out on Wednesday through the School of Graduate Studies that I have been confirmed. The confirmation process requires a departmental seminar, a conversion report, a progress report and a meeting with the student's committee (along with all the relevant forms). I fulfilled all the criteria and handed in the forms on 27th of February and on Wednesday, I went to SGS to discover that I have been confirmed on their records. I was pleasantly surprised that in less than 3 weeks I have been confirmed and registered officially as a PhD student.

Synthetic Datasets vs Real Data

In research, one part of a new approach is to justify it mathematically. Another part is to experiment to show that it matches your predictions.

For my previous algorithms, I've mostly been testing with /etc/passwd and /etc/group files located within the department unix systems. After testing on this data, the only real way to show the usefulness of our results is via manual analysis. On other industry data (which I will get to eventually), this is also the only form of analysis if the enterprise had no original form of RBAC implemented.

For simulated data, I have completed the synthetic data generator that creates flat role hierarchies given the number of users, permissions and roles and the max number of roles/user and permissions/role. Testing on our frequent pattern approach exploited one of the issues with FPRM that I kind of knew about. However, I haven't not added noise or hierarchical layers into the data for additional testing analysis. The test data confirmed my suspicion that an enormous number of candidate roles are generated if each user is assigned many permissions, each set of permissions similar to another user but not exactly the same. I need to analyse closed/maximal algorithms. I have not yet used the synthetic data on my graph optimisation approach.

Ideally, the best data to work with would be access control permissions from an enterprise that has RBAC implemented. That way, we can compare our results with the one of actual implementation and compare/contrast the differences.

Thursday, March 15, 2007

LDIF Files

While coding the synthetic data generator and I have also started analysing ldif active directory export files of various companies. Access to these files are granted by my industry sponsors as helpful resources for my PhD.

In analysing the ldif files, I am having some difficulty loading it into JXplorer. The error I get is javax.naming.CommunicationsException:connection closed[Root exception is java.io.IOException: connection closed] remaining name 'DC=ca, DC=com'. I think one of my co-workers suggested that it was because some of the objects were missing object classes. O_O. I guess there are also schema issues because each enterprise has their own AD schema and exporting the file simply extracts the data, not the formatting. But you'd think it would be able to work out a general idea of the tree.

I will keep working out how to load it into the directory on my VM if I can. Otherwise, I can manually (perl/python script) extract the user/group information that I need. There are perl scripts and modules that allow you to directly connect to an active directory or an x500 but they will not install properly into the university machines (I have limited privileges) and refuse to do so on my own Unix environment. I have successfully got it work in on my windows machine but I have not had a play with it. It might just end up easier to extract information from the files manually with a parser.

Thursday, March 08, 2007

Synthetic Test Data Generation

This week I have mostly been working in the synthetic data generation test suite. The purpose of the data is to create RBAC infrastructures, extract only the user-permission assignments from them and mine back the original roles.

While I'm not entirely convinced of the usefulness of this (skewed roles, random roles represent real enterprises?), the research community seem to think it's useful. So the plan of my algorithm is as follows:

Inputs:

number of users
number of roles
number of permissions
average number of roles per user
standard deviation of roles per user
average number of permissions per role
standard deviation of permissions per role

Constructing roles to users and permissions to roles then extracting fundamental relationships should not be too difficult. The only issue is how real is our model. What kind of distribution could our model take? So to start with, the number of roles to users and permissions to roles are normally (Gaussian) distributed. (Most thing are normal right?) After the number of roles and permissions are picked for each user and role respectively, that number of roles and permissions are picked according to a normal distribution given the mean and std. That is, which permission and which role is picked at based on a Gaussian number generator.

I initially had some problems with the Gaussian distribution, requiring an inverse error function to transform randomly generated numbers into a normal distribution. But then I was directed to some C code that does magical stuff. I was also made aware that Java's normalised Gaussian number generator could easily be translated to a number of whatever mean and std I gave it, so that's making the coding process easier.

Alternative, permissions or roles could be assigned randomly to roles and users respectively, given a max number of permissions per role and roles per user. Either option is feasible since the actual distribution is unknown.

Thursday, March 01, 2007

12 month project summary

The main focus if this project is investigation into various techniques of bottom up role engineering for the development of a comprehensive infrastructure for Role Based Access Control. Since commencement of the research project, detailed analysis of the research area has been performed and two algorithms that address open issues in existing algorithms have been proposed.

Future work exists for both of the proposed algorithms. The first approach is a frequent pattern approach for role engineering and this can be extended to identify closed or maximal frequent itemsets. This reduce the number of generated candidate permission sets. Using this approach would also eliminate the need for removing roles that were not explicitly assigned to any users. Results returned from frequent closed permission set mining could then be used to produce the role hierarchy.

The second algorithm is the graph optimisation approach and can be tested using different optimisation rules with different metrics. Doing so may identify interesting properties of each algorithm.

Both of these approaches can also be tested on more data. The current project is being sponsored by CA (formerly Computer Associates). CA has provided some policy data that provides information on users and access rights from large corporations. Investigations on actual permission data can be done to justify the practicality of our proposed approaches. This testing on CA gathered data for extraction of roles from real companies may involve the creation of a more general purpose prototype that moves the theoretical ideas of the project onto some more solid practical foundations.

While practical applications of the algorithms are important, more testing on theoretical foundations also need to be done. This requires a large set of synthetic test data. This test data can be generated artificially by creating permissions, users and roles in a RBAC model and extracting the direct user permission relationships. This data can then be manipulated by different algorithms in the attempt to recreate the original artificial RBAC model with users, permissions and roles.

The lack of sufficient test data needs to be addressed within the whole area of role engineering, not just for our algorithms. One of the future goals of this project is to create this large synthetic data test suite. This data can be used for our testing to verify our algorithms. The data can also be made available to other researchers in role engineering to test their algorithms.

The result of such a publicly available test suite is a set of results that are comparable. Different approaches on a comprehensive comparison technique can be proposed. One such technique is to place each result into a graph and compare subtree sections. Another proposal can be to create an algorithm that can map from one RBAC infrastructure to another for comparison.
The majority of existing algorithms assume clean noiseless data. Our proposed frequent pattern approach deals with incorrect assignments to an extent but analysis of access logs can further our clean our data. Preprocessing algorithms may involve analysis of access logs to removing inactive permissions before bottom up role engineering algorithms are performed.

The improvement of previous approaches and the creation of pre and post processing for the role engineering techniques will be our main focus for future research in this project.

The intention is to convert the current Masters by Research into a PhD. A department seminar has been given and approval has been granted by the research committee. Due to the conversion, final thesis preparation will occur during 2009 for the new PhD thesis deadline.

Masters to PhD Conversion

I returned to work on the 18th of February (I went on leave for the first half of Feb) and my return, I have been working furiously towards my official conversion from Masters by Research to PhD. At The University of Melbourne, the process usually takes place between 12-16 months after the commencement of a Masters by Research. I officially started with the university on 1st March 2006 so I thought I'd like to have it done by 1st March 2007. Also because for extra funding to my next conference/workshop, I have to be a confirmed PhD student before I can apply. So anyway, today is 1st March 2007 and I have been given the okay to continue as a PhD student. The forms are in works and it is now just a waiting game.

So how does the process work? Well, there needs to be a department seminar that is advertised as a conversion talk. Then there is a meeting with your committee members. For my conversion, I had the meeting right after the seminar. I know friends who have or are converting right now say they've had their meeting already but they haven't given a seminar or they've given their conversion seminar but they haven't had their meeting yet. But I just did everything all in one go so it was all fresh in everyone's mind. I think that's the best way to go about it. Either way, you have to have written and given your conversion report to your committee before the meeting. This was what has kept me busy for the past week or so. I originally wrote 5-6 pages for my conversion report but James, my new supervising member suggested it would be better if I put more effort into it. So I changed it to 65 pages. Yeah, a bit of a jump, but as long as you talk about the work that you've accomplished in the past 12 months and what your plans for the next 2-2.5 years are, it's all good. You also need a project time line. I drew up a gantt chart for what I hope to accomplish in the next 2 years and then discussed it. I also talked about when I want to accomplish what and future work in my current algorithms and what I can do about generating test data and such. So that was the report. I sent it to Rao, James and Tim (my committee).

After my report, I prepared my slides. I didn't have enough slides and I regret not making some more. I finished in about 35 minutes for a 50 minute presentation. However, questions went for about 15 minutes afterwards to that was good. I had lots of interest from the audience so I'm very happy. At first I was a little bit worried about the turnout. I was hoping for about 5-6 people but there was at least 30 there. However, this meant there was a wider range of questions from different research backgrounds as well. Some of the questions were very interesting and raised issues I had not previously thought about. Usually, the person giving the conversion seminar has to take notes on all the questions that is asked.

Straight after my seminar, I went upstairs to a little meeting room with my committee. They asked me more questions when I answered more satisfactorily than my seminar questions. They asked me to leave the room and then about 5 minutes later, they asked me to return and said congratulations. They asked if I have forms for them and I said no. If I was to do the conversion again, that's what I would have done differently: to speed up the form process, I would have filled out my sections before my meeting and during the meeting, if they agreed to let me convert, they could sign off on it right then and there. But anyway, I filled out the forms later that day and appended the committee meeting minutes as well as a list of questions that were asked at my seminar. I left these forms at reception yesterday and now I'm just sitting tight and hoping things go through before 15th April (the Istanbul workshop).

Thursday, January 25, 2007

NP complexities

As an extension to the work that I have completed in January for my last paper submission, I wish to show the problem itself is NP-hard. But first is first, I have to learn what this NP business is and analyse some proofs. So I am currently reading about NP problems. I borrowed a very old looking book by Johnson from the library that looks like it will be very useful. In the decomposition of my A matrix into B and C, we would like to show that the problem itself is NP-Hard. To do so would be very good because we can quit using algorithms to solve it and work on heuristics.

In order to show the problem is NP-Hard, there are two steps that need to be performed:

Show Problem X is NP
Reduce an existing NP-Complete problem to it.

Task 1 is often simple. To accomplish task two, an arbitrary instance of a NP-Complete problem needs to be transformed into an instance of X such that the answer in the NP-Complete problem is positive if and only if the answer to the corresponding X is positive.

This is where the Johnson book comes in, to find a problem to reduce.

Thursday, January 18, 2007

Graduate School Funding

School of graduate studies have a Melbourne Abroad Travelling Scholarship. The give up to $1250. That would cover everything.

The main three criteria for the graduate school scholarships are as follows

be a confirmed PhD student
be travelling overseas and for at least 10 days
entail at least two distinct purposes that are clearly relevant to the thesis

MAT supplies additional funding to students who wish to travel overseas for purposes related but not essential to the completion of a their PhD. Some general eligibility include never being granted a MAT before, being confirmed as a PhD student, having received funding but not sufficient funding from other sources as well and having more than one purpose for travel of at least 10 days. The student needs to get permission to study overseas/leave for the period as well. The scholarship is offered two rounds every year. Applications close 31st March and 31st August every year. The most difficult part of the application is writing your justification, working out your exact expenditure and getting all the relevant parties to sign off on everything. There's the justification for travel, approval from your supervisor, approval from head of department, analysis of travel risk forms and finally the permission to have research leave/overseas study and such. But it's not too tedious of a process if the addition funding can be granted. About 160 scholarships are offered every year so 80 per cut off date. Looking at the website for the last scholarship offers, about 100 people applied so only 20 people were unsuccessful.

Wednesday, January 17, 2007

Departmental Travel Funding

During the period between my undergraduate and postgraduate studies, I applied and got accepted for vacation work in Web services at CSIRO in Canberra. Most people take the 3 months off to relax, drink beer and take a break from studies. I went to Canberra to read about and implement Web services.

This time twelve months ago I was working towards a paper for monitoring composite Web services, focusing on the challenges, considerations and difficulties when doing so. The paper was submitted twice before it was accepted this time round to SEIW in conjunction with ICDE in Istanbul, Turkey. I will be the representative going half way around the world to present it. It has also been requested as a journal paper but we don't think we have the time or the resources to do the additional implementation and new work additions.

As it is not work directly related to my PhD, I will refrain from using my industry sponsors as a form of funding, although they said if I but their institution on, it might be okay. But for now, I am currently trying to get all the funding to go over there from the department. The department of computer science and software engineering will give me $1500 if I meet the following requirements:

I am an author of the paper
My university affiliation under my name is The University of Melbourne, Department of Computer Science and Software Engineering
The paper will go into a publication that has an isbn
I give a department/postgraduate seminar explaining my works.

Mine is booked in for 30th March, the Friday that isn't a public holiday before I leave for Turkey. Postgraduate seminars a preferred to be on a Friday. I contacted CSIRO and they said it wasn't a problem for me to add my affiliation instead of CSIRO under my name, and they can cover registration for me.

Hopefully I can get some more money from the school of graduate studies for the additional expenses.

Paper reference:
Kerry Taylor, Paul Brebner, Michael Kearney, Dana Zhang, Kelly Lam and Vladimir Tosic. Towards Declarative Monitoring of Declarative Service Compositions. In Proceedings of The 2nd International Workshop on Services Engineering (SEIW07) in conjunction with 23rd International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey, April 2007.

Friday, January 05, 2007

Paper Submission

I'm currently still writing my next paper. I have an extension until 15th January, a week after the original submission date, 8th January. My fellow research friends suggested there might be an extension because the submission site was still down a week before the 8th. This gives me more time to work on my results, and improve them for the paper.

I've gotten most of the sections of writing worked out and am now working on the implementation. Some of the transitive closures with the graphings are a little tricky. There are many exceptions for like if a role is connected to other roles and it wasn't to merge with another role, what happens to all the roles beneath it? How can they automatically be merged? I haven't quite decided what I want to do yet. But as I write, I think through the problems more and work out what's going on as I go.
Another way to describe trees is as an acyclic directed graph. Rao was drawing and the nodes we have originally are users and permissions. The edges we have are assignment of permissions to users. It is the pictorial representation of what we have when we enter an existing environment before the implementation of role based access control. As you can imagine, there are edges everywhere. Each user can have an edge between any number of permissions. If we were to place nodes/edges in between permissions and users, we can reduce the number of edges. The graph with the least number of edges produces the graph with the best role hierarchy structure. How can we prove this? Graph theory to the rescue. These additionally inserted nodes between permissions and users are the roles.

Access control and also be represented using matrices. The most common type of access control matrix representation is the access control matrix that lists users and permissions. Matrices also represent graphs. This access control matrix can directly be translated to the initial graph format that contains users and permissions and nodes and edges between permissions and nodes. Each column or row of the matrix is a permission or user and each value within the matrix represents an edge between a user and a permission if the value is set to 1 and no edge otherwise. The role hierarchy after edge reduction can represent two matrices, one that contains roles and users and one that contains roles and permissions. Mathematically speaking, this can create an operation on the two matrices that produce our initial one. The problem comes from identifying the best user-role matrix and user-permission matrix given the operation. The “best” two matrices can be defined as the combination of two matrices that have the least number of edges or the least number of 1’s within the matrix. That’s something interesting to think about.

Laurence sat in on the meeting because he was doing something very similar with his graphs. He wasn’t doing it for role discovering in a security based environment but the addition of nodes within a graph to reduce the number of edges was something that he was attempting to do in his own field of research. It was also noted that the concept of different levels in the hierarchy can be achieved by allowing self referencing values in one of our operation matrices. In the roles-permissions matrix, roles can be related to other roles. This is implicit of a hierarchy.

Update, ideas formalised in:
Dana Zhang, Kotagiri Ramamohanarao and Tim Ebringer. Role Engineering using Graph Optimisation. In Proceedings of the twelfth ACM Symposium on Access Control Models and Technology (SACMAT'07), Sophia Antiopolis, France, June, 2007