Friday, March 23, 2007

Synthetic Datasets vs Real Data

In research, one part of a new approach is to justify it mathematically. Another part is to experiment to show that it matches your predictions.

For my previous algorithms, I've mostly been testing with /etc/passwd and /etc/group files located within the department unix systems. After testing on this data, the only real way to show the usefulness of our results is via manual analysis. On other industry data (which I will get to eventually), this is also the only form of analysis if the enterprise had no original form of RBAC implemented.

For simulated data, I have completed the synthetic data generator that creates flat role hierarchies given the number of users, permissions and roles and the max number of roles/user and permissions/role. Testing on our frequent pattern approach exploited one of the issues with FPRM that I kind of knew about. However, I haven't not added noise or hierarchical layers into the data for additional testing analysis. The test data confirmed my suspicion that an enormous number of candidate roles are generated if each user is assigned many permissions, each set of permissions similar to another user but not exactly the same. I need to analyse closed/maximal algorithms. I have not yet used the synthetic data on my graph optimisation approach.

Ideally, the best data to work with would be access control permissions from an enterprise that has RBAC implemented. That way, we can compare our results with the one of actual implementation and compare/contrast the differences.

No comments: