Data – the source of truth that governs our decision making these days. With accurate data, we can make nearly flawless decisions on how a website’s UX should behave, how well your idea matches with the target audiences / segments (in marketing terms), what kinds of features should your website or products hold to win the hearts of customers.
But ever thought of how to collect and treat freshly created data?
Typically data come from many sources
- server logs
- application logs (could be home grown web applications or canned products)
- IoT – smart devices (like some health metrics recorded by smart phone apps)
- manual / human input entries
due to the variations in the data sources, you would need to cater a way to collect these raw data. The ideal approach should be a generic tool for absorbing them in a fairly simple setup. As luck would have it, eh? In the market, Logstash and the Beats family actually serve quite well on this scenario. On the coming blog, I would explore the capabilities of Logstash and Beats; but for now let’s get back to our “data” problem.
A point of pain… sometimes the quality of the data captured might not be in a “pleasant” state for further processing.
Take an example, what if the data / log involved multi-line? How do you expect your data ingestion tools to handle such a scenario? Another example is when the data involves non english characters like Japanese and Chinese, could your data ingestion tools handle the encoding correctly?
All these are challenges, especially when you are digging into data sources with more than 10,000s of log lines per minute. A single log line with abnormal data would crash the data ingestion process.
CSV – angels and devils
CSV format is one of the most convenient interface file formats in the industry. Many big data provider websites would provide downloading testing data in csv format. Data in csv files are often “raw” and original, this is good for most of the cases except for multi-lines, non ascii encoding and special characters so raw data would become “dirty“. Dirty data might introduce some side effects to the ingestion process.
- log line with abnormal data would simply be skipped data inconsistencies
- data ingestion tool might unfortunately skipped many log lines due to abnormal data data inconsistencies
- some data ingestion tool actually crashes when it could not handle abnormal data system level error service down
It is clear that dirty data should be avoided and hence we should try to “clean” them up! In the future blog entry, I would investigate on how to purify the raw / dirty data; again there is no bullet-proof answers to the scenario; all I would like to do is to share the approaches and feel free to discuss by leaving comments~
By the way, wish all a Merry Xmas~