Glossary
While all efforts have been made to communicate clearly, some terms are used in the following report which are domain specific. Below are some common terms and their definitions as they pertain to this paper.
Word | Definition |
Dry Bulb Temperature | Air temperature in Kelvin |
Feature | An attribute of the dataset. Also known in database theory as an attribute, and commonly seen in the form of a column. Some features are inherent in the data; others must be computed (these are often called labels). |
Foreign Key (FK) | An attribute in a table that ties a record uniquely to a record in another table in a 1-1 or 1-many relationship. |
Mean Absolute Error (MAE) | A performance statistic computed as: |
Primary Key (PK) | An attribute in a table that uniquely identifies a record. Commonly implemented as an unsigned integer for low storage and retrieval costs. |
Relation | In database theory a set of tuples which share a set of labeled attributes. Commonly viewed as a table. |
Tuple | In database theory a set of related facts; commonly viewed as a row. Also referred to as a record, which is an atomic piece of information. |
Database Design
1st Iteration Intra Model Performance Table
subset | model | operation | projection | mae | stddev | iqr |
100 | linreg | average | 24 | 4.631831543 | 7.267661744 | 7.376935522 |
1000 | linreg | average | 24 | 2.143428653 | 3.899018518 | 3.351450163 |
10000 | linreg | average | 24 | 2.030554289 | 2.657135622 | 3.2303772 |
100000 | linreg | average | 24 | 1.994362735 | 2.637388048 | 3.142809138 |
100 | linreg | average | 168 | 5.913616302 | 8.252230627 | 9.359491752 |
1000 | linreg | average | 168 | 2.572654771 | 5.303453545 | 4.023931935 |
10000 | linreg | average | 168 | 2.338230608 | 3.025636025 | 3.790127042 |
100000 | linreg | average | 168 | 2.32681975 | 3.00391192 | 3.781893953 |
100 | linreg | max | 24 | 50.64446932 | 67.76835863 | 77.90457765 |
1000 | linreg | max | 24 | 14.13080713 | 25.57563651 | 18.26589205 |
10000 | linreg | max | 24 | 13.32337275 | 24.25696593 | 17.37206546 |
100000 | linreg | max | 24 | 12.58745831 | 23.29352607 | 16.12725384 |
100 | linreg | max | 168 | 55.96865409 | 78.26706052 | 80.61073331 |
1000 | linreg | max | 168 | 20.35751872 | 52.43354441 | 26.1918654 |
10000 | linreg | max | 168 | 17.64412576 | 27.90933131 | 23.93082071 |
100000 | linreg | max | 168 | 17.5438013 | 28.00114329 | 23.63088294 |
100 | linreg | min | 24 | 8.94821248 | 11.8499813 | 14.48369101 |
1000 | linreg | min | 24 | 3.630860988 | 12.80957702 | 5.53195469 |
10000 | linreg | min | 24 | 3.317300065 | 4.279692232 | 5.36504644 |
100000 | linreg | min | 24 | 3.286721203 | 4.247615987 | 5.312396743 |
100 | linreg | min | 168 | 10.40786872 | 13.34390973 | 16.92560316 |
1000 | linreg | min | 168 | 826951148.6 | 33334747900 | 6.951128047 |
10000 | linreg | min | 168 | 4.040246507 | 5.175617132 | 6.678711019 |
100000 | linreg | min | 168 | 4.009690602 | 5.121298401 | 6.603335997 |
100 | rf | average | 24 | 2.760442915 | 3.531526854 | 4.399495536 |
1000 | rf | average | 24 | 2.232302384 | 2.914016873 | 3.564670586 |
10000 | rf | average | 24 | 2.11029389 | 2.801957525 | 3.301544268 |
100 | rf | average | 168 | 2.735508517 | 3.517131784 | 4.450532264 |
1000 | rf | average | 168 | 2.480721127 | 3.187435833 | 4.0349717 |
10000 | rf | average | 168 | 2.480108425 | 3.205892087 | 3.992762653 |
100 | rf | max | 24 | 14.22043586 | 27.76593003 | 13.11885 |
1000 | rf | max | 24 | 9.674497262 | 20.62727379 | 9.433325 |
10000 | rf | max | 24 | 8.130446313 | 16.22976069 | 8.78005 |
100 | rf | max | 168 | 27.00145684 | 43.0775496 | 33.286 |
1000 | rf | max | 168 | 14.86749773 | 27.88180684 | 15.347525 |
10000 | rf | max | 168 | 13.05649929 | 23.04437046 | 14.00775 |
100 | rf | min | 24 | 3.633476838 | 4.615422392 | 5.9898 |
1000 | rf | min | 24 | 3.3779545 | 4.376241856 | 5.448625 |
10000 | rf | min | 24 | 3.279107259 | 4.262930004 | 5.274225 |
100 | rf | min | 168 | 4.70533 | 6.064141173 | 7.57625 |
1000 | rf | min | 168 | 4.16066327 | 5.368543541 | 6.71275 |
10000 | rf | min | 168 | 4.05572153 | 5.271970466 | 6.51225 |
100 | ridge | average | 24 | 3.098310021 | 4.147110535 | 5.160014631 |
1000 | ridge | average | 24 | 2.15449149 | 4.769758985 | 3.31390058 |
10000 | ridge | average | 24 | 2.234714442 | 2.986644779 | 3.605391636 |
100000 | ridge | average | 24 | 2.007740807 | 2.628262112 | 3.19764219 |
100 | ridge | average | 168 | 3.079077471 | 4.269716253 | 4.80456551 |
1000 | ridge | average | 168 | 2.418976865 | 3.427145717 | 3.872898081 |
10000 | ridge | average | 168 | 2.356801016 | 3.038834447 | 3.809819451 |
100000 | ridge | average | 168 | 2.336227296 | 3.011133991 | 3.800794591 |
100 | ridge | max | 24 | 24.47592318 | 38.14257757 | 35.00257022 |
1000 | ridge | max | 24 | 12.5754371 | 37.926706 | 14.65336718 |
10000 | ridge | max | 24 | 14.36215771 | 26.3561325 | 17.8562204 |
100000 | ridge | max | 24 | 12.50123122 | 23.20131167 | 15.97997729 |
100 | ridge | max | 168 | 34.54528896 | 49.97080019 | 50.25217266 |
1000 | ridge | max | 168 | 19.28877169 | 49.93754326 | 25.24681336 |
10000 | ridge | max | 168 | 17.7943851 | 28.01418666 | 24.21251711 |
100000 | ridge | max | 168 | 17.47328446 | 27.87214109 | 23.60672248 |
100 | ridge | min | 24 | 4.106146543 | 5.199621811 | 6.824983235 |
1000 | ridge | min | 24 | 3.488770104 | 4.476913878 | 5.636792866 |
10000 | ridge | min | 24 | 3.348207229 | 4.348413345 | 5.402443727 |
100000 | ridge | min | 24 | 3.275361169 | 4.227956557 | 5.314580584 |
100 | ridge | min | 168 | 4.949920094 | 6.284755109 | 8.321237635 |
1000 | ridge | min | 168 | 4.206105848 | 5.48801513 | 6.8713299 |
10000 | ridge | min | 168 | 4.039781869 | 5.154984591 | 6.67984543 |
100000 | ridge | min | 168 | 4.018035449 | 5.133184139 | 6.618504286 |
100 | sgd | average | 24 | 3.056592705 | 3.890190526 | 4.913693062 |
1000 | sgd | average | 24 | 2.144314574 | 2.995426857 | 3.373765784 |
10000 | sgd | average | 24 | 1.969138831 | 2.859141633 | 3.028862511 |
100000 | sgd | average | 24 | 2.016884972 | 2.645805447 | 3.206370431 |
100 | sgd | average | 168 | 3.09022719 | 4.258520113 | 4.943657472 |
1000 | sgd | average | 168 | 2.517679952 | 3.482076453 | 4.072744993 |
10000 | sgd | average | 168 | 2.334856811 | 3.119236849 | 3.707557699 |
100000 | sgd | average | 168 | 2.337246015 | 3.011933936 | 3.795810279 |
100 | sgd | max | 24 | 18.23503843 | 34.29850597 | 23.89585424 |
1000 | sgd | max | 24 | 13.47896046 | 24.62646673 | 17.30176546 |
10000 | sgd | max | 24 | 10.67221511 | 29.67377625 | 10.22257095 |
100 | sgd | max | 168 | 35.23448871 | 47.5295067 | 54.83831383 |
1000 | sgd | max | 168 | 19.08934225 | 30.49607477 | 25.80554427 |
10000 | sgd | max | 168 | 972.385479 | 1355.847489 | 1543.708972 |
100 | sgd | min | 24 | 3.924853796 | 5.005220747 | 6.527904439 |
1000 | sgd | min | 24 | 3.638021589 | 4.676746577 | 6.033199488 |
10000 | sgd | min | 24 | 3.28900021 | 4.656878363 | 5.186712738 |
100 | sgd | min | 168 | 4.58477292 | 5.894275553 | 7.51592956 |
1000 | sgd | min | 168 | 4.503050078 | 5.707805696 | 7.477669474 |
10000 | sgd | min | 168 | 4.044766103 | 5.214737283 | 6.639052405 |
100000 | sgd | min | 168 | 4.012253118 | 5.126297319 | 6.596457717 |
Code Repository
The code, along with example files, can be found on GitHub at the following public repository:
https://github.com/danieledwardknudsen/weather
Summary of Tools Used
Tool | Brief Description | Summary and Use |
Python3 (Python) | Programming Language | Coordination and analytics code was written in the Python 3.6 programming language. Python was chosen because it is intuitive and allows rapid scripting, and because it has mature data science and ML libraries. |
SQL Server Express (MSSQL) | Database | MSSQL is a database management system (DBMS) used to store and manipulate bulk data. |
Transactional-Structured Query Language (T-SQL) | Programming Language | Code that directly manipulated the database was written in T-SQL, a dialect of SQL. |
Pandas | Python Library | Pandas is an open-source data science library that allows processing of large datasets. |
Scikit-Learn (sklearn) | Python Library | Sklearn is an open-source Python library that allows customization and training of dozens of types of regressors and other models. |
Jupyter Notebook | Scripting Tool | Jupyter Notebook is a tool that allows rapid script development and data manipulation from within a web-browser, and was used for data exploration and analysis. |
Microsoft Excel | Spreadsheet Tool | Flat file spreadsheets and data visualization were done via Excel. |
Table 1 – Summary of Software Tools Used. Python tools were ised for analytics, and SQL tools were used for data movement and manipulation.