The major steps involved in fitting the data after it was collected were the following:
- Remove trades of Size 0 from datasets
- Convert all trade timestamps to UNIX format
- Add a random millisecond delay to trade timestamps
- Classify trades as “small” or “large”
- Pass desired datasets through R for modeling
A noticeable feature of some of the datasets were the fact that some of them had trades with a quantity of 0 listed being exchanged. Of course this made no sense, so datasets were first cleaned up to remove trades of size 0. For a given dataset there would never be more than 6 of these trades present in the data.
A potential explanation for this is due to busted trades. Sometimes in trading, a clear electronic error occurs between the exchange involving price or quantity. A busted trade is when this trade is no longer honored by the exchange when there is a clear unfair discrepancy in price/quantitiy.
All that was required to pass through the R program for parameter estimates was our timestamps. R requires that all timestamps be passed in UNIX format, which is the number of seconds since 1/1/70. The dataset timestamps were converted to UNIX using Microsoft Excel.
Another requirement of R for modeling is that timestamps must be unique. Because this feature was not seen in any of the datasets given that multiple trades can occur in the same second, I utilized a strategy used by many professionals which is adding a random millisecond delay to all trade timestamps using the RAND function in Microsoft Excel.
Classifying the data was done using Matlab. To classify data, I used the Gaussian Mixture Model method of classification. The GMM sets up a bimodal distribution, and assigns a probability that every data point belongs to one of two clusters. In our case this was perfect because the two clusters we were trying to achieve were “small” and “large” trades. The results of this were then wrote to the datasets so that they could be split up into datasets of just “large” or “small” trades for day or night. Below is a visual representation of how a distribution looks for the Gaussian Mixture Model.
Once trades were classified, we could pass timestamps of classified trades through R after we separated them by size. The function in R would then return parameter estimates for the Hawkes Model. The package used in R that supported a Hawkes Self-Exciting function was the ptproc package. The function used on datasets in R was defined to be the following:
Once this function was defined in R, a dataset could be passed through and parameter values for the Hawkes Process were returned. The evalCIF function could then be used to return the conditional fitted intensities of a dataset based on the calculated parameters – this is what was compared to the empirical intensity.