Taming filebeat on elasticsearch (part 3)

This is a multi-part series on using filebeat to ingest data into Elasticsearch. In the first 2 parts, we have successfully installed ElasticSearch 5.X (alias to es5) and Filebeat; then we started to break down the csv contents into fields by using ingest node, our first ingestion pipeline has been experimented. In part 3, we are going to handle multi input sources (e.g. different csv files with each representing a certain stock’s data) as well as how to use kibana to visualize what we have ingested into es5.

related links:

sources for the blog is available at:

Speaking of approaches…. tales of 2 cities

when handling our stocks data, there are some concerns about data separation. Should we create a brand new index for each stocks symbol (e.g. stocks-ibm, stocks-alphabets), or should we mix them up into 1 single index but adding an extra field to clarify the symbol value (e.g. stock-symbol: “ibm”)?

index per stocks symbol
advantages
  • to query for multiple stocks symbols at once, use the following syntax
    GET stocks-alphabets,stocks-ibm/_search

  • different index could have its own settings (e.g. different no.of shards and different text analyzers could be applied to different index)
  • remove entirely a certain stocks data (based on symbol) is easy and straightforward
1 index for all
advantages
  • to query for multiple stocks symbols at once, use a MATCH query by stating the symbols you are interested in
  • if all stocks data should be treated in the same way (i.e. no special handling based on symbols) then it might be a good idea to group the data together
  • limiting the no.of indices available, hence indices management is much easier

So which approach is the best??? There is no “silver bullet” to the question, to be fair, it depends on your use cases a lot. If you are treating the individual stocks data in a general sense, it is suggested to group them together into 1 single index and add a “symbol” field to identify which is which. However, if your use case involves special operations on certain stock based on its “symbol” (e.g. special scoring is to be applied to ONLY IBM stocks, or a specific text analyzer should be applied to ONLY Tesla) then it might be better to use separate index instead.

The following sections would cover how to make both approaches work, and in the end you could simply pick one based on your actual needs.

approach: separate index

1. start es5 and kibana
2. open browser with http://localhost:5601, pick the “console” app. Create a pipeline as follows:
Screen Shot 2017-04-16 at 3.22.27 PM
3. open a command console and create a bash script (for windows, it would be a batch with slightly different commands); the script file for ingesting google-alphabet stocks would be like this:Screen Shot 2017-04-16 at 3.26.13 PM
4. open the filebeat.yml to configure the following areas:Screen Shot 2017-04-16 at 3.34.48 PM

  • paths => telling filebeat where to get the file to ingest, note that ${FILE_NAME} is a variable set in the script file earlier (remember the 3 lines of “export XXX=YYY” ?)
  • fields => are a collection of additional fields you want to add to the document, again a reference of ${SYMBOL} is used as the value of the “symbol” field
  • document_type => the document type for the index
  • fields_under_root => means that our additional field (i.e. symbol in this case) would be added on the root level of the document; if not set, the symbol field would be added under a “fields” field instead
  • exclude_lines => sets an array of patterns where lines with a hit would be excluded

Screen Shot 2017-04-16 at 3.35.15 PM

  • pipeline => pointing to which ingestion pipeline to use during indexing (in this case, pointing to “blog_stock_break_down_pipeline” created in the last step)
  • index => the target index name, again referencing ${INDEX_SUFFIX} set in the script file earlier
5. execute the script file and start ingesting… switch to kibana (browser) and start to check if any documents have been ingested

e.g. “./00_ingest_alphabet.sh”

Screen Shot 2017-04-16 at 4.03.37 PM
continue to run the scripts (ibm, amazon, alphabet, netflix and tesla) and got all data ingested.

queries

Screen Shot 2017-04-16 at 4.13.43 PM
PS. the important point is you could do a query based on the index pattern (e.g. querying both ibm and google data by using “stocks_goog,stocks_ibm” as the index pattern)

approach: 1 index for all with a “symbol” identification field

1. start es5 and kibana
2. open browser with http://localhost:5601, pick the “console” app. Create a pipeline as follows:
Screen Shot 2017-04-16 at 3.22.27 PM
3. open a command console and create a bash script (for windows, it would be a batch with slightly different commands); the script file for ingesting google-alphabet stocks would be like this:Screen Shot 2017-04-16 at 4.22.21 PMPS. the INDEX_SUFFIX is set to “all” for all scripts, therefore in the end all data would be diverted to the same index
4. open the filebeat.yml to configure the following areas:Screen Shot 2017-04-16 at 3.34.48 PM

  • paths => telling filebeat where to get the file to ingest, note that ${FILE_NAME} is a variable set in the script file earlier (remember the 3 lines of “export XXX=YYY” ?)
  • fields => are a collection of additional fields you want to add to the document, again a reference of ${SYMBOL} is used as the value of the “symbol” field
  • document_type => the document type for the index
  • fields_under_root => means that our additional field (i.e. symbol in this case) would be added on the root level of the document; if not set, the symbol field would be added under a “fields” field instead
  • exclude_lines => sets an array of patterns where lines with a hit would be excluded

Screen Shot 2017-04-16 at 3.35.15 PM

  • pipeline => pointing to which ingestion pipeline to use during indexing (in this case, pointing to “blog_stock_break_down_pipeline” created in the last step)
  • index => the target index name, again referencing ${INDEX_SUFFIX} set in the script file earlier
5. execute the script file and start ingesting… switch to kibana (browser) and start to check if any documents have been ingested

e.g. “./01_ingest_alphabet_all.sh”

Screen Shot 2017-04-16 at 4.03.37 PM
continue to run the scripts (ibm, amazon, alphabet, netflix and tesla) and got all data ingested.

queries

Screen Shot 2017-04-16 at 4.28.22 PM.png

visualize, make your data dance~

now we finally got the data ingested, but it would be more useful when trends could be spotted out from them. Nicely, kibana is a great tool for visualization of your data. Let’s go through a few easy steps to get your data ready to be visualize-d

1. setup the index pattern in the management app, this is important if you want to be able to “discover” your data in the discover appScreen Shot 2017-04-16 at 4.40.51 PM.pngScreen Shot 2017-04-16 at 4.41.27 PM.png
2. after the index pattern configured, pick the “visualize” app and pick the “line chart”, by following the diagram, you can create a nice line chart showing you a trend of stocks IBMScreen Shot 2017-04-16 at 4.42.06 PM.png

when the (time)lion roars, data dance even more~

Cool~ We have our first visualized data trends on a particular stocks data. However, would it be even better if we could combine several stocks data in 1 single visualization for deeper analysis? If you are using es5 version 5.1.X or above, you should have Timelion app available on your kibana apps sidebar.

1. the first thing you should learn about Timelion is how to plot a single line chart. Try it by typing:

.es(index=stocks_ibm,metric=’avg:price_close’).label(‘ibm trend’)

Screen Shot 2017-04-17 at 7.51.02 AM.png
everything starts with “.es()” function; by providing different parameters, a line chart is created:

  • index => the index providing the data for visualization
  • metric => the kind of aggregation we are targeting on the data

PS. the line chart is identical to what we have created earlier in kibana!
PS. you might not have any data displayed sometimes; that’s because you need to pick the time range! (by default Timelion and kibana would display the last 15 minutes data; in this case, our data starts from 2016-03, hence you would need to pick the right time range to get data visualizable)

2. now, add another line chart. By adding another “.es()”, you could embed a new line chart using another dataset (index). Try by typing:

.es(index=stocks_ibm,metric=’avg:price_close’).label(‘ibm trend’), .es(index=stocks_netflix,metric=’avg:price_close’).label(‘netflix trend’)

Screen Shot 2017-04-17 at 7.51.35 AM.pngScreen Shot 2017-04-17 at 7.51.45 AM.png

3. you can even try to make a prediction of how the stocks would go by using “holt()” (holt winters)Screen Shot 2017-04-17 at 8.05.52 AM.png
PS. in actual, Timelion has far more features to explore, remember that there is online doc available.Screen Shot 2017-04-17 at 7.56.06 AM.png

a short summary

Finally we are there, we started a journey beginning from kickstarting our es5 and filebeat; then we gone through a data massaging processing using ingestion pipelines; next we started to explore how useful these ingested data could be by visualizations; eventually the data is not just static data but instead an asset to discover trends and acts as a reference for decision makings (is it the right time to invest on a particular stock? 😛 )

Even though this is the end of the tutorial series, however the end is just the beginning for exploring how es5 could help to solve big data use cases. Happy coding and data digging~~~

Advertisements

2 thoughts on “Taming filebeat on elasticsearch (part 3)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s