A number of key challenges have been identified for the next stage of TCIDA delivery. The three key business problems are supplier onboarding, analytics delivery, and credit risk analysis. Intrinsic to all these problems is the notion of matching of one document or record to another. Critical to the performance of any matching algorithm is the apposite selection of features (or the evaluation of present characteristics), which engenders questions of data standardisation and canonical or normalised forms (to improve the evaluation of those characteristics).
A principle goal of TCIDA research is to assimilate external data for the purposes of data validation and data enrichment. This may be available through application programming interfaces (APIs) or require extraction from html formatted web pages. In the small scale, this can be achieved by hand written interfaces or wrappers (converting web pages into relational form). However, large scale extraction will need the learning of web site structure and web page structure to automate this process; leading to questions of natural language processing (NLP), such as named entity resolution, and ontological reasoning.
The processes of data validation (and cleansing) and data enrichment from multiple data sources should be supported with a robust data provenance model. This should allow for the appropriate methods to audit source and data transformations and provide measures of accuracy, completeness, credibility and relevancy. The aim is to create a system for the automatic aggregation and collation of data to support the core activities of Tungsten.
TCIDA work will be focused at improving the supplier onboarding process with a view to both analytics delivery and credit risk analysis by developing these underlying technologies. In the first instance primarily looking at addresses as a key attribute of supplier entities and how varying representations may be best evaluated for comparison. This will be combined with supporting work to access external data sources for the purposes of validating addresses and, where possible, verification. This combined with our earlier work on matching together with standardising transformations of such fields as postcode, VAT registration, telephone numbers etc. will give accurate identification of suppliers requiring minimal human intervention in the onboarding process. Additionally, we will look at collecting primary contact details from such sources as LinkedIn combined with other company details (from various web resources), pertaining to due diligence for credit risk analysis, to feed into the sales and approvals (for TEP) processes.
There is also a redevelopment of the Tungsten infrastructure which in the first instance is moving the customer relationship management to a new cloud based system: Salesforce. This complements the aims TCIDA had for developing the SmartAlec 4.0 infrastructure, although presents us with something of a moving target of how to interface with the new system. Our initial discussions with the teams involved have been positive and we believe there are sufficient mechanisms to interface with the new platform. This should provide a flexible approach allowing us to adapt to changes in business processes which support a more streamlined onboarding of suppliers.
Future work will be more aligned to the delivery of analytics. Preliminary work would perhaps address product matching and the apposite selection of features in this context, supported by external data for enrichment. Purchase order line item matching (perhaps mainly from a combinatorial perspective) may prove to be a useful exercise and offer a variety of descriptions for the items ordered. This would then be followed by the more technically challenging clustering, time series analysis and product ontology generation, delivering substantial benefits to analytics delivery while allowing for such areas as market analysis and product classification to be addressed.
The redevelopment of the Tungsten infrastructure again compliments our planned delivery for this. It is envisaged that data for the analytics delivery will be fed from the core data systems (for invoice processing) into a separate data store, and this will be provided to TCIDA by the relevant Tungsten teams. TCIDA will take an incremental approach to delivering functionality to Tungsten through scalable web interfaces. This may simply begin by providing company and product search functionality that is expanded to show greater and greater information as the mechanisms become available.
Data Standardisation (or normalization) is the process of reducing data to a canonical form. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure. This applies to formatted fields (such as VAT, Postcode) and unformatted fields (where there may be abbreviations or units of measure).
Record Matching (or linkage) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier.
Entity Resolution (here) refers to the process that further identifies the components of a record; that is, identifying the various features within a record for matching.
Combinatorics is a branch of mathematics concerning the study of finite or countable discrete structures. Aspects of combinatorics will prove useful for various specific problems pertaining to Analytics and Matching in general; for example, where there are numerous data constraints to be satisfied.
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields. Clustering can be fine grained (many small groups) or course grained (fewer larger groups).
Classification is a general process related to categorization, the process in which ideas and objects are recognized, differentiated, and understood. A classification system is an approach to accomplishing classification. In machine learning it is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analysing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. It is a very natural approach to reducing dimensionality for feature selection and feature extraction; other methods, such as principal component analysis (PCA) will also be considered.
Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input. This work will mainly be focused at Information Retrieval and Information Extraction and their various components; for example, coreference resolution, word sense disambiguation, relationship extraction, etc.
Ontologies provide a common vocabulary of an area and define, with different levels of formality, the meaning of the terms and the relationships between them. Ontology Generation is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between those concepts from a corpus of natural language text, and encoding them with an ontology language for easy retrieval.
A web spider (or crawler) is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. Web scraping is a computer software technique of extracting information from websites. This (scraping) will require sophisticated techniques to automatically determine the structure of a web site and the structure of its pages so that the information can be identified, stored and utilised.
An application programming interface (API) is a set of routines, protocols, and tools for building software applications. There are numerous Web APIs which present valuable sources of information. We will evaluate those which seem potentially most useful and implement appropriate technology to harness these data sources.
There is a commercial advantage to report in real time for spend analysis. This will allow purchasing managers to raise concerns on particular invoices prior to payment being made. With appropriate consideration of the information architecture this can be achieved allowing incremental improvements to be reflected in the reporting as they arise.
As an example of the system's functionality we were shown how it reports potential savings to buyers on the same product supplied by the same supplier. At the prototype demonstration it appeared that the system effectively estimates potential savings by computing how much would have been paid if all products had been bought at the minimum price (and then subtracting that amount from that which was actually paid). The team remain concerned that such an approach may [at least occasionally] give rise to an exaggerated view of potential savings to buyers: for example, the data set may contain a single outlier, representing, say, a 'special offer' out of many transactions with a much cheaper price and it will normally be unreasonable to use this singleton as a basis for comparing all other purchases. Furthermore, the price of items may fluctuate seasonally and it would be unreasonable to expect to pay the summer price for tomatoes in the winter.
We suspect that if a SAB system merely highlighted variations from the minimum price, this feature might eventually be ignored by its users. We suggest that for customers to take price variation seriously a more sophisticated approach is required; one that can take all of the above factors into account.
In much the same way that the Google 'page-rank' algorithm gave rise to a better reflection of the importance of specific web pages (and hence prompted the long term shift of web search services from Alta-Vista to Google) we believe that a similarly clever algorithm for ordering possible savings could offer a much better reflection of the importance of individual price variation to the user.
As soon as the SAB system is extended to include less specific analysis (e.g. the task of comparing prices of an identical product supplied to the buyer by different suppliers), the application of advanced artificial intelligence techniques (from areas such as 'quantum linguistics', 'data mining', 'machine-learning' and 'clustering') cannot easily be avoided 4.4E.g. Instead of simply reporting variances of minimum prices, more sophisticated algorithms could inform buyers which products were most likely to yield the largest savings (taking into account seasonal fluctuations etc.) and offer the user the chance to ignore outliers in performing the analysis. In addition, we suggest that inflation and other market forces should also be taken into account in presenting more accurate estimated potential savings to buyers.
It would also seem natural to allow the users to influence the overall reporting of possible savings. This could include, for example, the ability to up load cost centre codes, accounting codes, their own product classifications, or contract data (agreed pricing of products from various suppliers). In this context the team suggest investigating the extent to which AI technology could be used in a predictive manner to help reduce the burden of maintaining such dimensions as new product items are supplied or new suppliers are engaged.
Clustering is essential in useful spend-analytics. The task of moving from identical to similar items is very different and requires a variety of techniques many of which fall under the general heading of artificial intelligence. For example a SAB system may be required to perform a more general analysis about pens. In order to do this, we need to find all products in our system which come under that category. This is a very hard task and can never be performed to 100% accuracy except with very small data sets. In fact in order to help solve the problem we may have to look outside our local dataset possibly even resorting to spidering the Web in a search for hints about how to classify products whose internal descriptions are not sufficiently helpful. Without clustering, the same product supplied by a different supplier will be regarded as a different product; to identify them as the same is a very different problem. When are two products produced by different suppliers in fact the same? This sort of question is solved using algorithms from artificial intelligence and can only be answered probabilistically. Furthermore clustering is essential whenever we want to ask questions in a more general way. Without clustering, we may be able to ask questions like 'How is supplier X performing this month?' but if we want to ask questions like: 'How is supplier X performing this month compared to other similar suppliers?' things become much more complex. We need to be able to find ways of clustering 'similar' suppliers. Presumably, inter-alia, 'similar' suppliers sell 'similar' products. Deciding if two different products are similar, however, is an even more difficult problem than deciding whether they are identical. This problem may require the use of external data produced as the result of spidering and state of art semantic text analysis such as 'quantum linguistics'.
Improved search functionality A nice feature demonstrated was the search functionality when filtering the result set by product. However, this relied purely on selecting products containing the search terms in their invoice descriptions. The system has many possibilities for improvement but these require a degree of 'semantic' understanding (e.g. that the word 'transit' should be treated synonymously with 'carriage'). The search functionality would also be improved by using enriched data from external sources such as fuller product descriptions from supplier catalogues. Similarly a fine-grained clustering of classification of products could be used to broaden searches over specific types of product. We see this as a series of incremental steps to provide the buyers with the search functionality that they require. User behaviour to improve results We can also add knowledge by analysing user supplied data and user behaviour. User supplied data allows for a more tailored interface to the user, but also when aggregated across all users gives semantic information from the human perspective which may be leveraged in many ways. Similarly user behaviour can also be mined providing an important feedback loop for the relevant learning algorithms. For example, noting which products are most frequently grouped together for comparison gives an additional mechanism for addressing which products are similar. This can then be used to adjust the parameters for the ranking of products in a search.
The data for one product item from one supplier to one buyer is generally so sparse that accurate analysis and predictions are not possible (a problem statisticians might call over-fitting). Of course, something is probably better than nothing, but the value will be limited and without care expectations could be artificially raised. The notion of similarity as discussed under clustering or classification of products allows for hierarchical modelling of products. High level groupings with lots of members have lots of data and smoother behaviour giving rise to better models. Lower level groupings have fewer members and less data, but their models should be in influenced (and smoothed) by the models of the higher level groups to which they belong.This enables better predictions to be made at these lower levels by allowing influence from above.
The market modelling will enable comparative analysis of the various products and product groupings. Thus building up a network of correlations over the market place, with strong correlation between much more related parts of the market but also some which are more distant (and perhaps unexpected). Temporal properties may also be examined; for example where growth in one is usually followed by growth in another. This together with standard time series techniques should provide a rich toolbox for trend analysis and predictive forecasting.